Reducing Qualitative Errors with the SOUNDEX Function

Naming deviations and manual entry errors such as typos and spelling mistakes are time sinks that data engineers spend their time working around. How many times have you personally had to deal with data discrepancies arising from the fact that some entries in your database have similar but different spellings, even though they mean the same thing, sound the same when spoken, and should be grouped together? Well, you are in luck as there happens to be a simple yet elegant solution for this specific issue – the Soundex function. 


On October 14th, 2020, Google BigQuery updated its standard SQL and implemented support for this function. Now it is publically available and straightforward for everyone to use. Please continue reading to find out how Soundex works, what it is for, and some of its common uses.

How Does Soundex Work?

Soundex is a SQL function that converts any text such as a name or an alphanumeric string into a code representing what it would sound like when spoken aloud in English. Its simple phonetic algorithm encodes a word into a four-character code, starting with a letter and ending with three numbers, such as M-233 (which stands for “Masthead”). The first character corresponds to the word’s first letter, and the three numbers represent everything except for vowels and letters ‘H’ and ‘W.’

 

1 – B, F, P, V

2 – C, G, J, K, Q, S, X, Z

3 – D, T

4 – L

5 – M, N

6 – R

 

This way, “Masthead” gets turned into a simple and short M-232, and it would not matter if “Maasdhead” or “Masted” were used instead. It is easy to see how this method of grouping names together is simple to implement yet effective.

Soundex Makes Names Easier

Where Soundex shines in its effectiveness is reducing qualitative errors, more precisely, typos and misspellings. However, it is also uniquely useful when working with human names specifically. Searching for and grouping people with similar-sounding names can be a grueling task in some cases, and Soundex helps alleviate manual entry and automate things.


For example, let’s say you are trying to find a customer named Nickolas in your database. However, you are unsure if their name happens to be spelled as Nicholas, Nicolas, or Nikolas instead. Instead of having to guess, you could just easily store all names additionally as Soundex strings. In this case, a short code like N-242 makes grouping all of those similarly named people together easy. This function is a commonplace use of Soundex.

Using Soundex with UTM Codes

In short, UTM codes are the tags at the end of website links. Most commonly used in advertising, they help the business track where website visits are coming from, what advertising campaign or platform was responsible for the visits, and what search term keywords are used to find the website.

This link below is what UTM looks like: 

www.mastheadata.com/?utm_medium=medium&utm_source=source&utm_campaign=campaign&utm_term=searchterm

The Soundex function is one of the best ways to make working with UTM codes just a little bit easier. With its help, tasks like grouping multiple website sources and search terms together become much faster. Using Soundex reduces the time you or your team will be hardcoding things. Automation is key, and spending time on repetitive and tedious tasks is a waste.

Data quality is an everyday issue for data engineers, analysts, and stakeholders. Using Soundex, you can take data products at your company to the next level by allowing this effective and simple function to decode typos and other human errors. Relying on data is hard, and there is often bias from technology-resistant team members in organizations when it comes to trusting data. At Masthead, we are building a reliability platform that helps catch data errors before data issues end up in data consumers’ dashboards. Talk to us now to learn more about how Masthead delivers data integrity!

Post Tags :

Soundex, SQL

CEO, Masthead

Stay in touch

Learn more about how Masthead secures your data quality!