Too Big for Excel? A Fear from The Past

Data Preparation|Technology|Data Science|Data| April 26, 2016| Robert Dempsey

Big data. The term means many things to many people. The best definition I've heard is data of the size that won't fit on your laptop. With 1 terabyte hard drives available by "fit" I mean it's too big to process on your laptop. A top-of-the-line MacBook pro packing a quad-core processor and 16GB of RAM can analyze a lot of data, however it's easy to surpass even those mighty specs. For instance, if you want to download the Airline On-Time Performance data, you'll need 12GB. Want the Reddit Comment Corpus? You'll need 250GB just to store the compressed data. Want to uncompress it? You had better purchase an external hard drive. And analyzing it? You're going to need some power to do that.

So What About Excel?

A number of years ago datasets could easily be opened and worked with in Excel. There are a number of powerful features in Excel that allow you to clean, standardize and munge data. However even the mighty spreadsheet has it's limitations. But what are they? Sure, a 100 megabyte spreadsheet seems to take forever to open even on a Windows machine packed with memory, but what do we mean by "limitations"?

Since Excel 2007, the following have been the max capabilities of the application*:

  • Worksheet size: 1,048,576 rows by 16,384 columns
  • Number of worksheets: limited by available memory
  • Column width: 255 characters
  • Number of characters in a single cell: 32,767
  • Number of characters in the header or footer: 255

What about the PowerPivot Excel add-in you ask? That supports files up to 2GB in size and enables you to work with up to 4GB of data in memory*. That's pretty good, but not good enough to handle even one of today's small datasets.

So what do we do if our dataset is larger than what Excel can handle?

Let's Replace Excel and Keep the Familiarity

data problem solving

The bottom line is this - Excel is very easy to use, which is one reason so many businesses use it. But as we've seen it has limitations. Many of today's datasets simply won't work in it.

Enter Dataiku DSS.

One of the things I love about Dataiku DSS, or DSS for short, is how easy it is to view and work with my data, whether it's in a CSV file, a SQL database, or Hadoop cluster.

Even better than seeing my data in front of me, as I write my recipes to prepare my data for analysis, I can see how the transformations I'll apply will affect the data, all without changing the underlying data!

data preparation in dataiku dss home address before

data preparation in dataiku dss home address after

How many times have you made a major update to a spreadsheet, saved and closed it, and then realized you didn't want to save the update, but it was already too late?

I'll admit, I've done it. With DSS we don't have to worry about that.

What About My Macros?

More advanced Excel users have created macros with Excel VBA that help automate parts of the spreadsheet. I've seen macros that will:

  • Manipulate data and text
  • Apply style and formatting
  • Pull data from various data sources (databases, text files, etc.).
  • Create entirely new spreadsheets
  • Do all of the above in any given order

In Dataiku DSS we have steps, which we can chain together to create recipes.

example of a recipe in dataiku dss

A data cleaning recipe for a full address might be composed of the following steps:

  1. Trim the whitespace from both ends of the string
  2. Extract the zip code
  3. Extract the state
  4. Standardize the formatting of the state
  5. Extract the city
  6. Extract the apartment or suite number
  7. Split the address into the house number and street name

Going back to what I said earlier about saving changes you don't want, if you don't want to apply a particular step you simply turn it off.

task turned on in dataiku dss

task turned off in dataiku dss

No more try and undo!

As Easy as Excel with a Lot More Power

Excel has it's merits and it's place in the data science toolbox. For many companies it's the go-to tool for working with small, clean datasets.

When you're working with data that's big or messy or both, and you need a familiar way to clean it up and analyze it, that's where Dataiku DSS comes in.

Download Dataiku DSS to Get More From Your Data

Download Dataiku DSS

Receive success story

Please fill out the form below to receive the success story by email:

Contact us

How can we come back to you ?