Big data. The term means many things to many people. The best definition I've heard is data of the size that won't fit on your laptop. With 1 terabyte hard drives available by "fit" I mean it's too big to process on your laptop. A top-of-the-line MacBook pro packing a quad-core processor and 16GB of RAM can analyze a lot of data, however it's easy to surpass even those mighty specs. For instance, if you want to download the Airline On-Time Performance data, you'll need 12GB. Want the Reddit Comment Corpus? You'll need 250GB just to store the compressed data. Want to uncompress it? You had better purchase an external hard drive. And analyzing it? You're going to need some power to do that.
A number of years ago datasets could easily be opened and worked with in Excel. There are a number of powerful features in Excel that allow you to clean, standardize and munge data. However even the mighty spreadsheet has it's limitations. But what are they? Sure, a 100 megabyte spreadsheet seems to take forever to open even on a Windows machine packed with memory, but what do we mean by "limitations"?
Since Excel 2007, the following have been the max capabilities of the application*:
What about the PowerPivot Excel add-in you ask? That supports files up to 2GB in size and enables you to work with up to 4GB of data in memory*. That's pretty good, but not good enough to handle even one of today's small datasets.
So what do we do if our dataset is larger than what Excel can handle?
The bottom line is this - Excel is very easy to use, which is one reason so many businesses use it. But as we've seen it has limitations. Many of today's datasets simply won't work in it.
Enter Dataiku DSS.
Even better than seeing my data in front of me, as I write my recipes to prepare my data for analysis, I can see how the transformations I'll apply will affect the data, all without changing the underlying data!
How many times have you made a major update to a spreadsheet, saved and closed it, and then realized you didn't want to save the update, but it was already too late?
I'll admit, I've done it. With DSS we don't have to worry about that.
More advanced Excel users have created macros with Excel VBA that help automate parts of the spreadsheet. I've seen macros that will:
In Dataiku DSS we have steps, which we can chain together to create recipes.
A data cleaning recipe for a full address might be composed of the following steps:
Going back to what I said earlier about saving changes you don't want, if you don't want to apply a particular step you simply turn it off.
No more try and undo!
Excel has it's merits and it's place in the data science toolbox. For many companies it's the go-to tool for working with small, clean datasets.
When you're working with data that's big or messy or both, and you need a familiar way to clean it up and analyze it, that's where Dataiku DSS comes in.
Please fill out the form below to receive the success story by email:
How can we come back to you ?