Robert Dempsey is that cool data-driven developer/marketer/explorer that everyone (or almost-everyone) wants to be. In part two of the talk I had with him last September, he had lots of words of advice for people getting started in data science.
Robert Dempsey (RD) is that guy who loves getting dirty in code, and looking for solutions constantly, but who also enjoys explaining what he does to business people and believes that everything data should be accessible to everyone. He got into data science from the business end by working on Analytics-driven Marketing in his software development company.
Now he’s a consultant for ARPC and is constantly looking for ways to make his team more collaborative and efficient. I was very lucky to talk to him about his work process in a previous interview. He discussed how his team works on data science projects and how Data Science Studio helps collaborate on these projects by getting various standalone languages and technologies to work together. Today, he has things to say to aspiring data scientists out there.
RD: A lot of the technology is becoming a lot easier to set up and more powerful these days. Take Spark for example. I can run Spark, and I have, on a little cluster at my house with 3 different laptops and a RaspberryPi. It’s becoming so much easier to set up these systems and to do these things. That’s also why more people are doing them!
However, for people starting out it’s still difficult to know where to start because there are so many resources. To make it simple, there are two languages leading the day: it’s either Python or R. You do have to pick one and people argue both. I’m a Python guy all the way and I’ve seen R but I’m pretty meh. Some people say you must know R if you’re going to be doing data science, but I’ve always gotten along just fine with Python.
Learning one of those languages is definitely the way to go. You do have to learn one before you’re able to do anything solid and put custom work into production. Outside of that, I would say stay away from them as much as humanly possible. Keep within the tools that make the work much easier, tools like Data Science Studio.
Luckily there's a trend in data science today and companies like Dataiku are creating tools that make data science a lot easier and more approachable for beginners. That was one of the things that led me to Data Science Studio.
For a long time, to do data science you either had to be good at Python development, or good at R, or have a degree in statistics. A lot of people out there who are doing data sciency things at their jobs or trying to learn to do data analysis, and have no time to go back and learn how to do statistics, or even care to. Our business users for example are not going to go back to school to learn stats. They need to use these new tools that make certain aspects of data science much more approachable and not quite so mystical.
In DSS I can do all of my data wrangling in a visual format, without any code. The Studio makes it really easy to apply work that I’ve created on one file to another file with the same file format. I can do all sorts of things with DSS that, if I had to do them in code, would be an incredible amount of quite annoying work and very time consuming.
RD: Sure. For example, I designed my first predictive algorithm with the help of DSS. I wanted to do a scoring system. I had lots of training data that I could use a supervised training method on, so I thought, I’d try and figure out how to do predictive modeling.
I’m already a pretty good developer but I was not going to go back to school to learn stats. So I did my research. What I found was that I spent HOURS and hours (read weeks) reading blog posts and books before I even got to writing my algorithm. And then when I tested it I was getting some weird results. This was the first model I’d ever created, so I was thinking I might be doing it right but maybe not.
I fed all my data to DSS and looked at what it spat out and noticed it was ignoring part of my data. I then updated my model from what DSS had done, tested it on additional data, and then put that into production. I’ve been using the predictive model ever since, so that’s worked out really well!
When I showed people I work with the modeling and what DSS had done they said: - "How does DSS even do that?" I told them: - "Well they have a team of Data science people who know what they’re doing so they built all of this for us!"
I ultimately learned what that was called - factor analysis - but DSS did that before I even knew what it was so I didn’t have to do it myself. It makes the whole process a lot easier.
Of course I’m a developer and I wanted to put it into production, so I couldn’t do everything in Data science Studio, I had to get my hands dirty. Thanks to the Studio I did enough to figure out what I was doing wrong and then implement that in the production system. It’s a great learning tool!
RD: I come from Marketing but I don’t really consider myself a ‘marketer’ per say, so I can say: you should watch out for the marketing people who think they’re doing a lot of analysis but aren’t. One of the things I tell people at my job: if their definition of data analysis is running a report, that is not data analysis. Apart from that, a lot of business people have been trying to do real data analytics with Excel for years. Excel isn’t a good tool for serious data analysis either. There are so many more advanced data science techniques. Making them more approachable and easier is definitely key.
For instance, take how I replaced our scoring method with a predictive model, you can use that for so many business problems. It is more advanced but if you get a tool that can help figure that stuff out for you, then you don’t have to worry about getting it wrong. But always remember – garbage in garbage out. The quality of the underlying data can make or break you.
In general, the learning curve in Data science is extremely steep for someone who knows nothing about statistics. To create my first predictive model for example took me weeks of work because there was no one place I could go to learn all the things I had to learn. You learn one thing, and then you think “ok now I know Python, so I could look at scikit-learn”, and then “now I can look at modeling, but which way of doing modeling should I ckeck”, and "what does all that even mean?"" And then all of a sudden it’s weeks later, and you haven’t gotten anything done because you’ve spent so much time reading and trying to figure things out. It seems that it should be so much easier, but today it’s not.
The problem is there’s no one resource where you can go, and find what you need to know to do just what you want to do. This is an issue because a lot of business people function just like me. They don’t wait until they know everything to start taking action. They learn just enough as they go. I don’t think I should need to be a stats wizard just to create a predictive model! It should be easier.
RD: I generally just believe things need to get better. It is difficult for an average person to learn all this stuff because when you do start looking it seems like there’s so much stuff to learn. Most people don’t have time to learn all the things. I don’t think they should have to get another degree just to do it either because data analysis is so important; Data has been growing by leaps and bounces, people have been saying that for years. Today more and more people are finally realizing that they can learn so much from their data, and tools like Data Science Studio make that easier and more approachable.
Thank you for reading! If you'd like to contribute to our blog and write your own advice for data science beginners, or any other topic you like, don't hesitate to email me! If you enjoyed hearing what Robert Dempsey had to say, find him on Twitter, and stay tuned, Dataiku has more stuff coming up with him super soon!
Please fill out the form below to receive the success story by email:
How can we come back to you ?