Building a Data Pipeline to Clean Dirty Data

Technology|Data Preparation| February 11, 2016| Robert

A data pipeline is a series of steps that your data moves through. The output of one step in the process becomes the input of the next. Data, typically raw data, goes in one side, goes through a series of steps, and then pops out the other end ready for use or already analyzed.

pipeline

The steps of a data pipeline can include cleaning, transforming, merging, modeling and more, in any combination.

Is this voodoo magic? No. However data pipelines can get highly complex. Let's simplify it and see how you can build a data pipeline in Data Science Studio to help prep a dataset for analysis.

Common Problems We're Going to Solve

While not a comprehensive list of problems I've encountered with datasets I've received, here are the most common:

  • Missing data
  • Multiple fields in a single column
  • Non-unique column headers
  • Non-standardized data: column headers, names, dates
  • Extra whitespace around text

Let's see how we can clean these up.

Pipeline Overview

Here's what we're going to build using Data Science Studio:

dss data cleaning pipeline overview

For this example I've created a fake dataset containing 10,000 records, made to mimic a partial database dump of patient information for a healthcare provider. You can get the code to generate a similar dataset and the DSS project I created on GitHub.

Alright. Let's clean up this data!

Requirements

In order to follow along, you'll need a few things:

  1. Python >= 3.5.1
  2. Java >= 1.8.0_74
  3. Data Science Studio >= 2.3

You only need Python if you want to run the Jupyter Notebook in order to create your own fake dataset. The project in GitHub has all the data ready to go for you.

Pipeline Build-out

First, create a new dataset and view the data. During this step we aren't going to do any manipulation of the column names, only import and preview the dataset.

dss initial data import

Next, create a new recipe to split the full name into first and last names, and rename the columns. For the zipcode field, change it's meaning to Text as we aren't going to do math on this field. The output of this step is a new dataset with the cleaned names.

dss prepared data cleaned names

After that, clean up the home address fields. Do this by creating a new recipe for the cleaned names dataset and:

  1. Trim the address
  2. Use regex to extract out the apartment or suite number
  3. Rename all of the address columns
  4. Replace the apartment or suite number in the original address field so that only the line 1 portion remains
  5. Rename the rest of the fields

dss prepared data cleaned address

Now turn your attention to the first phone column. This column has phone numbers in many formats. Extracting the parts of each format is beyond the scope of this post. For now, let's simply rename this column to home_phone and remove what appears to be an erroneous extension. Do this by creating a preparation recipe that renames the column and uses a regex to find and replace an extension.

dss prepared data cleaned home phone

Great work so far! We're almost finished cleaning this dataset, so let's keep going.

With all of the home information cleaned we can now finish by cleaning up the work information and the account creation date.

The fields for the work data are, perhaps not surprisingly, similar to the home information fields. This allows us to leverage the preparation scripts we've already created. Let's first clean up the work address.

View your current flow diagram and double-click on the preparedssddecleanedaddress script. Under the actions menu in the top right, select Copy. For the input dataset, select dssddecleanedhomephone, and for the output dataset create a new output dataset named dssddecleanedworkaddress.

dss create cleaned work address

Now we have a complete copy of all the work we did for the home address. All that is left is to update all of the column names to apply our transformations to the work columns.

dss prepared data cleaned home phone

To clean up the work phone do the same thing, however this time use the dataset you just created as the input and create a new preparation script to operate on the preparedssddecleanedwork_address dataset. In that recipe, we'll do the following:

  1. Rename the column from 'work phone' to 'work_phone'
  2. Extract the extension using regex, and put it into its own column
  3. Replace the found extension in the main phone number

Only one more step to go! Let's clean up the account created on field.

Do that by going back to the flow view, selecting the dssddecleanedworkphone dataset, and click on the Prepare icon on the right side menu. Name this final, cleaned dataset account_data.

Once you're on the Script page scroll over so you can see the account created on field. Click on the column header, and from the drop down, click Parse date... in the Script section. This will cause DSS to automatically determine the date formats.

dss parse date

Next add a few more formats to pick up on all of the date formats used in the column and specify the output column. By specifying our output column we are also standardizing our column name without needing an additional step.

dss data parsing formats

Last but not least we format the date to make it readable to a human.

dss date formatting

And we're done!

Summary

Phew! It took me longer to write this post than to perform the work. That's because Data Science Studio makes it easy to create data pipelines, especially for preparing data.

What we've created here is a very simple example of what can be accomplished. I suggest you take this a step further and cluster the data by attributes such as their city or state. If you had payment information you could combine the two datasets which could later be used to create a predictive model.

There's also a lot of cool visualization stuff you could do as well.

Your Turn

Have you used Data Science Studio to create a data pipeline? If so what for? Let us know below.

If you haven't, try the free version of Data Science Studio today!

try data science studio

Receive success story

Please fill out the form below to receive the success story by email:

Contact us

How can we come back to you ?