Startup Genome supports forward-looking geographies in catalyzing their own startup ecosystems. Their challenge, therefore, is sorting through all of the anecdotal information and dispersed data that surrounds startups in order to develop precise reports from which policy makers can draw insights.
Challenges in Leveraging Data for Research
The data and analytics team at Startup Genome performs both primary and secondary data collection surrounding the startup ecosystem, building large collections of datasets and analysis out of that data from which researchers will get insights to produce their annual Global Report plus many specific deep dive reports for their clients.
Given the nature of their work, Startup Genome faces several unique challenges:
- Approximately 30% of the time, structured, readily available datasets don’t exist for the types of analyses they want to do, so the team spends quite a bit of time digging for the data they need to find potentially relevant datasets that could ultimately produce interesting analysis.
- When they do find data, it’s often incomplete. That means they have to put data through a set of business rules in order to fill out the missing data. For example, the first step might be to manually hunt for any missing data, and the second might be to create a standard estimation of the missing data.
- When doing data analysis, the team at Startup Genome has to minimize bias and be able to consider the context of their data in order to truly draw meaning from it. For example, to determine the relevant relationship and whether there is a correlation between, say, engineer graduate data and startups in a region.
What Dataiku Brings to the Table
Startup Genome uses Dataiku as their centralized system for all database and analytics needs (data governance, data blending, manipulation & feature engineering, predictive model creation, and data governance).
Dataiku ensures that everyone works all in one place, without data floating on local machines — this also ensures consistency and quality of analysis by keeping everything in the same tool. Thanks to data preparation features in Dataiku, the team at Startup Genome is able to leverage visual analysis for about 70 percent of their work, keeping the need for coding to only about 30 percent of work, ultimately speeding up analysis. Ultimately, with Dataiku, Startup Genome follows a standard data pipeline and can quickly iterate, reduced the amount of time iterations on data analysis take by an estimated 40-50%.