Enriching addresses: supplementing data from a second source

Address data is more valuable the more information each individual data record contains. However, the information required for a specific purpose is often spread across several data sources. This could be the telephone number associated with an address, the email address, the cancellation date or information on whether a specific address belongs to a customer or not (→ customer number). Gathering this information manually is usually far too time-consuming and therefore not a viable option. In order to enrich addresses, i.e. to transfer data from one database to another, specifically to the data record to which this information belongs, you need a way of doing this that is as automated as possible.

Unfortunately, addresses are data that are not usually easy to match with other data records. As a rule, there is no clear criterion for this. This is because addresses are not only made up of different components, they are also often recorded differently. Spelling mistakes are just one of the problems here. There are also transposed words, abbreviations, omissions and additions, and in extreme cases even synonyms such as nicknames. In order to be able to carry out address enrichment, however, you need as clear a relationship as possible between the two data sets to be used for this purpose. This greatly limits the selection of tools that can be used for this purpose.

Finally, there is the problem that the data sets to be used for enriching the addresses may be available in different formats. The target data source for enrichment could be, for example, an address list stored in an Excel file. And the data source could be a text file. Or it could be data from a database managed by a database server such as MySQL or SQL Server. Of course, before the actual enrichment, you could first ensure that both data sets are in the same format by exporting and/or converting them accordingly. However, it is much easier to skip this step and process the data directly in the format in which it is available.

DataQualityTools, among others, can do all this and much more. To enrich address data, proceed as follows:

If you have not already done so, download DataQualityTools free of charge here. Install the program and request a trial activation. Then you can work with the program for one whole week without any restrictions.
The function we need to enrich the address list can be found in the menu in the "Compare two tables" block. Select the "Compare by postal address" function for address enrichment:
After starting this function, the project management appears. Here, you create a new project with a project name of your choice and then click on the 'Next' button.
In the next step, use the "Open file" button to open the file containing the address list to be processed.

Excel, Access, dBase, CSV and text files are possible here.

For database servers (MS SQL Server, MySQL, MariaDB, Oracle, Azure SQL or PostgreSQL), the corresponding database server must first be selected from the selection list under "Format / Access to". Then enter the name of the database server. After clicking on the "Connect to serve" button, enter the access data. Finally, select the desired database and table from the corresponding selection lists.
Next, you must tell the programme which column of the table contains which information, for example, which column contains the street name or the name of the town. To do this, select the data field from the drop-down list of column names that best matches the label to its left:

The programme automatically pre-assigns these fields based on the column names. Since we want to search for duplicates using the postal address, we must also specify the columns in the table to be processed in which this information is stored for all components of the postal address. The result of the field assignment can be checked using the "Check field assignment" function on the right-hand side of the screen.
To specify the second table, click on the "Next" button. The dialogue box that appears looks exactly like the previous one and is used in exactly the same way. Specify the second table here and assign the fields in the same way as for the first table.
Clicking on the "Next" button takes us to the dialogue box where the actual function is configured. First and foremost, the threshold value for the maximum permissible deviation between two addresses must be specified here:

Furthermore, individual components of the postal address can be excluded from the comparison. Of course, for each component of the postal address to be compared, a column from the two tables must also be specified in the field assignment made in the two previous steps.

Note: The two tables between which the search for duplicates is to be carried out can be structured in very different ways. For example, in one table, the house number could be in the same column as the street, and in the other table, this information could be in two separate columns. The only important thing is that all three columns are specified in the field assignment for these two tables.
The table we specified to the programme first is normally the table in which the addresses from the second table are searched for.

By default, data from this table is transferred to the second table during enrichment. However, this can be changed. Firstly, the direction in which the comparison is to be made can be changed at this point in the programme. If the direction is changed, the data records from the second table are no longer searched for in the first table, but rather the data records from the first table are searched for in the second. In addition, when configuring the address enrichment, you can also specify later which table the data should be transferred from to the other table.
Clicking on the "Nex" button then starts the search for duplicates. It does not take long and a summary of the results is displayed.

If the programme has found duplicates between the tables to be processed, clicking on the "OK" button opens a dialogue box with functions that can be used to further process the results. Otherwise, the threshold value for the degree of match should be set lower and the comparison restarted.
In the dialogue box with the functions that can be used to further process the result, there is a button at the top labelled "Manual post-processing". Here, the result of the comparison is displayed in the form of a table. The data records that are to be deleted are marked with a red cross, which can be deleted if necessary.
Finally, the result must be further processed. As mentioned above, we want to enrich address data, i.e. transfer certain information from one database to another based on the comparison result. To do this, we select the appropriate function by first clicking on "Enrichment functions":

And then on "Enrich in the source table":
In our case, the table into which the data is written (= target table) is the table "sample1.xls":

The data from the data record without a deletion mark is transferred to the data record with a deletion mark (see table with the result as described in point 10):

To enrich the addresses, we now only need to tell the programme which information to write where. To do this, we add one or more column pairs for this purpose at this point:

For example, the information from the "Tel" column in the source table could be written to the "Result_A" data field in the target table and the information from the "Email" column to the "Result_B" data field.

In both cases, the content of the target data field should be overwritten with the information from the source table if necessary (= "Action"):
Optionally, information can be stored with the data records enriched with additional information that provides details about where this data originates. This information is required in order to fully comply with the right to information in the case of personal data and thus to meet the requirements of various data protection laws, such as the GDPR (General Data Protection Regulation).

This means that all the information required for address enrichment is now available. Click on the "Process data" button to start the process.

Note: There is a video tutorial available for DataQualityTools, introducing the user to the operation of the program based on the example of a duplicate detection within one table.