Spark and stream processing are the words on everybody's lips these days in the Big Data engineering world. Here are a few words of wisdom by somebody why actually works with Spark everyday, the great Helena Edelson! We caught up with her at Spark Summit in Amsterdam last month to hear what she really thinks of Spark.
Since the exciting release of Data science Studio 2.1 integrating Spark, we’ve been enjoying discovering the Spark community and being a part of that dynamic ecosystem. With the release of 2.2 and the Prediction API server, we've gotten into the business of stream processing as well and we're joining yet another dynamic commmunity!
We could particularly sense this when we attended Spark Summit for the first time in Amsterdam at the end of October. We got to meet so many people that were passionate about the technology and discussing their diverse use cases. I was particularly excited to get to speak to Helena Edelson, VP of Product Engineering at Tuplejump, and a speaker at Spark Summit. We, of course, talked about Spark; here is what she had to say!
HE: I recently joined Tuplejump as their VP of Product engineering, having been a cloud engineer and doing big data for a long time. At Tuplejump we have two different parts of the company. We do international consulting for companies. And we have a platform and services supporting big data blending and fast analytics. So we do sophisticated data collection and blending combining machine learning and analytics to understand the intention of the analyst, provide a unified view of your data, from multiple data sources, any locations, both streaming and non-streaming, for fast easy advanced data analysis.
In this way anyone, anytime, can feed many disparate data sources into the system easily and start deriving meaning from their data. We present a holistic view of all of your data and then based on your queries we can do the rest of the work of engineering and data science for you via the platform, in real time. And right now we’re using Spark for a lot of this.
HE: Well, after a decade or so in distributed messaging engineering, and then many more years as a senior cloud engineer doing large-scale cloud applications and infrastructure automation work, I accepted a role at a big data cyber security company on their cloud engineering team. I was really interested in working with the data vs the applications just moving the data around. So I started working on a new project doing big data analysis.
I added a new Scala layer to our Hadoop-based analytics system and automated it end to end. I had always worked in completely asynchronous event-driven environments, so this batch and scheduled world seemed odd to me. After this I started a new project, still big data analytics but with streaming - and I didn’t want to use hadoop, just Spark over Cassandra with Akka. We also leveraged Elastic Search but that now isn’t needed with fast querying using columnar storage.
HE: I liked the fact that it was built in Scala and used Akka, and still is even though I know they’re taking out the Akka. I’ve been using Scala and Akka in production for over six years. I also like the fact that Spark is very intuitive coming from being a Scala engineer. Working with the data collections is very similar to what we do in Scala. It’s also very intuitive after using Scalding - which is a Scala analytics framework for batch, over Cascading. And I love that as an engineer and not a trained data scientist it allows me to get requirements for analytics and very easily be able to implement them.
That’s one thing that I found is interesting here at Spark Summit. I met a data scientist who was also a speaker and, coming from the data scientist side, she’s been the one introducing Spark to the engineers vs my experience of being an engineer introducing Spark. So it works both ways and I think that that’s really interesting for a product. It’s very accessible for both sides.
And the other really great thing about Spark is that I can integrate very easily both my streaming and my batch computations, in the same application. That’s very useful. And I can even replace my batch infrastructure all together and do everything in my streaming layer, removing the need for ETL completely. This can save a company millions of dollars, and for several reasons.
HE: That’s true. To answer your question here’s a good example of what stream processing can be used for. Years ago I was working on Hadoop scheduled batch jobs. Some jobs were daily aggregations of different events per day and some more in-depth analysis on that. So you have data for each hour of the day - on that day. You’re collecting a day’s worth of data, and every hour more data is being stored. By the time you actually do the computation, 99% of your data is completely stale. Whereas with streaming you can constantly see what’s happening.
When you think about it that way it’s very interesting. When you need to know immediately about particular anomalies so that you can react, or with machine learning, if you want to predict in the stream when something will probably occur in order to proactively respond, in domains like cyber security, then it’s extremely relevant.
Everyone loves Spark.
HE: I’ve read a lot of complaints from people who were not happy or just very confused by the error messages. I’ve been unphased myself. I find that a lot of those messages are related to Akka. Having handled akka error messages for years, they make sense to me. Properly done error handling is difficult for any language and any product. I don’t see it as a thing for Spark specifically, but I do understand where these people are coming from.
HE: It really depends. I generally work for technology providers, so we’re producing technology. But I have talked to a few people lately that have noticed that trend. There are definitely people trying to automate the role of the engineer more, to make it easier and more accessible.
HE: Right, and that’s something that we’re trying to make more seamless at Tuplejump. If you’re an analyst, we’re trying to make Tuplejump (the engineering and data science) completely transparent in your workflow. Whatever your favourite tool is, you can work on it and hopefully not even be aware of what Tuplejump is doing. We want to make it all seamless, intuitive and fast.
HE: We’re at Spark Summit right now so I am focusing my talk on spark streaming technologies, their use cases and how to integrate them. But there are lots of streaming products today, like Apache Flink, and Gearpump is another new one. I’m also speaking next at QCon in San Francisco, in a track on streaming at scale.
There are so many different use cases that call for different technologies. For instance, we all know Netflix does a lot of streaming, but not all of their streaming work is analytics based. Since all of their streaming isn’t specifically set up for data science it doesn’t make sense to apply what they’ve done to any business.
HE: What I can say is that everything is moving very fast. You can start working on a prototype using one technology and before you’ve even gotten to MVP, you hear about some new thing that allows you to do more or just differently. It’s all a lot to keep up with. Particularly when you’re a producer of technology. You’ve got to get your solution out to the market before everyone else!
HE: There are many choices out there today. As with anything in software, we have many solutions available for any given problem. You really have to consider what your use cases, requirements and constraints are, and then be aware of the different facets of technologies available. When you’re picking a particular stack it’s also really important to think about how the different technologies collaborate together.
Also keep in mind that there’s always some kind of give and take. You might have to compromise on a desired functionality with one technology because it really helps you in another area that has more importance for you. It’s really about knowing what you’re working with and what you really need. It’s not about what you heard X or Y is doing, or what technology everyone is talking about or what the well-known companies are using. You should really look into what you’re trying to do and what’s the best answer for it.
HE: Everyone is coming up with really great ideas for solving things very quickly. I’ve been involved in open source for a very long time, so I always appreciate how everyone is collaborative. I’ve been speaking at more and more conferences and you see people from all around the world, talking and sharing and working together. It’s truly great.
For more awesome interviews by Dataiku, you can check out Olivier Grisel's talk on scikit learn and big data technologies, and our conversation with Robert Dempsey on data wrangling and teamwork.
Please fill out the form below to receive the success story by email:
How can we come back to you ?