howto

Prepare recipe - Advanced joins

June 27, 2017

There are two powerful types of joins in the Prepare recipe that do not operate on exact keys.

Fuzzy join

The Fuzzy join processor performs a join where the values of the join key do not need to match precisely, but are “close” according to the Damerau-Levenshtein distance.

For example, you may want to enrich a customer database with information on the economies of the countries where your customers live. Using the country names as the key in a traditional join may be difficult, because the country names in one dataset may not exactly match the country names in the other, either because of typos or differences in use of the official or common name of a country. A fuzzy join can help you to successfully merge this data.

With Damerau-Levenshtein distance of 1, United States will match with the typo Untied States, but a distance of 2 is required to match more typos like Untied Stats, and a greater distance to match United States of America. Of course, the difficulty at that point is that increasing the Damerau-Levenshtein distance to match United States and United States of America can also create a match with United Kingdom, so you may need to recode values.

Geo-join

The Geo-join processor performs a geographic nearest-neighbour join between two datasets with geo coordinates.

For example, you may have determined the locations of customers from their IP addresses. You can enrich this data with information about the closest airport to each customer by using a Geo-join that finds the airport with the shortest distance based on geographic location.

The latitude and longitude of the locations to be joined are used as the keys, and must be available in both datasets. If you have a geopoint, you can use the Extract lat/lon from GeoPoint processor. You can select which columns from the other dataset should be copied to this one, optionally including the distance between the joined locations.