howto

There are a tremendous number of Python packages, devoted to all sorts of applications: from web development to data analysis to pretty much everything. We list here packages we have found essential for data science.

There are six fundamental packages for data science in Python:

- numpy
- scipy
- matplotlib
- IPython
- pandas
- scikit-learn

If you have already done data science in Python, you probably already know them. If you are new to Python, and need a quick introduction to these packages, you can check out our Getting started with Python post. We also list there tutorials and useful resources to help you get started.

Scikit-learn is the main Python package for machine learning. It contains many unsupervised and supervised learning algorithms for discovering patterns in your data or building predictive models.

However, besides scikit-learn, there are several others packages for more advanced, specific applications. Packages like networkx for graph data, nltk for text data, or statsmodels for temporal data, nicely complement scikit-learn, either for feature engineering or even modeling.

Also some packages offers a different approach to data analysis and modeling, such as statsmodels for traditional statistical analysis, or PyMC for Bayesian inference.

Graph analytics is particularly useful for social network analysis: uncovering communities, finding central agents in the network.

Networkx is the most popular Python package for graph analytics. It contains many functions for generating, analyzing and drawing graphs.

However, networkx may not scale well for large-scale graphs. For such graphs, you should also consider igraph (available in R and python), graph-tool or GraphLab Create^{TM}.

This starter code illustrates how you can include networkx in your data processing flow. As a starting point, you will generally have a dataframe representing links in a network. A link could denote, for example, that the two users are friends on Facebook.

You first need to convert the links dataframe into a graph. You can then, for example, find the connected components of the graph, sorted by size. You can also restrict your analysis to a subgraph, for instance the largest connected component.

To find the most influential people in the network, you can explore several centrality measures, such as degree, betweenness, and pagerank. Finally, you can easily output the centrality measures in a dataframe, for further analyses.

In many applications, predictions are affected by temporal factors: seasonality, an underlying trend, lags. The purpose of time-series analysis is to uncover such patterns in temporal data, and then build models upon them for forecasting.

Statsmodels is the main python package for time-series analysis and forecasting. It nicely integrates with pandas time-series. This packages also contains many statistical tests, such as ANOVA or t-test, used in traditional approaches to statistical data analysis.

In the code below, taken from the examples section of statsmodels, we fit an auto-regressive model to the sunspots acitivity data, and use it for forecasting. We also plot the autocorrelation function which reveals that values are correlated with past values.

Analyzing text is a difficult and broad task. The nltk package is a very complete package for that purpose. It implements many tools useful for natural language processing and modeling, like tokenization, stemming, and parsing to name a few.

Finally, PyMC is a Python package devoted to Bayesian inference. This package allows you to easily construct, fit, and analyze your probabilistic models.

If you are not familiar with Bayesian inference, we recommend you the excellent Probabilistic Programming and Bayesian Methods for Hackers by Cameron Davidson Pilon. The book is entirely written as IPython notebooks, and contains lots of concrete examples using PyMC code.

Here’s one simple example, taken from the notebook. The purpose is to infer if the user has changed his text-message behavior, based on a time-series of text-message count data.

The first step in Bayesian inference is to propose a probabilistic model for the data. For instance here, it is assumed that the user changed its behavior at a time `tau`

: before the event, he was sending messages at a rate `lambda_1`

, and after the event, at a rate `lambda_2`

. Then the Bayesian approach allows to infer the whole probability distribution of `tau`

, `lambda_1`

or `lambda_2`

given the observed data, not just a single estimate.

The probability distributions are usually obtained by Markov Chain Monte Carlo sampling, as done in the example code.

Many websites contain a lot of interesting and useful data. Unfortunately, the data is rarely available in a nice tabular format to download! The data is only displayed, disseminated across the web page, or even dispatched on different pages.

Suppose you wish to retrieve data on the most popular movies of 2014, displayed on the IMDb site. Unfortunately, as you can see, the movie title, its rating and so on are disseminated across the web page. In Chrome, if you left-click on any element, such as the movie title, and select “Inspect element”, you will see to which part of the HTML code it corresponds to.

The goal of web scraping is to systemically recover data displayed on websites. Several Python packages are useful to this end.

First, requests or urllib2, allow you to retrieve the HTML content of the pages.

You can then industrialize your browsing, and systemically fetch the related content. Then, BeautifulSoup or lxml allow you to efficiently parse the HTML content. If you understand the page structure, you can then easily get each datum displayed on the page.

In the code below for instance, we first parse the HTML content to get the list of all movies. Then for each movie, we retrieve its ranking number, title, outline, rating and genre.

One of our data scientists used this kind of web scrapping to build his **personalized movie recommender system]**!

The standard plotting package in Python is matplotlib, that enables you to can make simple plots rather easily. Matplotlib is also a very flexible plotting library. You can use it to make arbitrarily complex plots and customize them at will.

However using matplotlib can be frustrating at times, for two reasons. First, matplotlib default aesthetics is not specially attractive, and you may end-up doing a lot of manual tweaking to get awesome-looking plots. Second, matplotlib is not well suited for exploratory data analysis, when you want to quickly analyze your data across several dimensions. Your code will often end up being verbose and lengthy.

Fortunately, there are additional packages to make better visualizations, and more easily.

First there are the plotting capabilities of pandas, the data manipulation package. This greatly simplifies the exploratory data analysis, as you get visualizations straight out from your dataframes.

Second, seaborn is a great way to enhance the aesthetics of your matplotlib visualizations. Indeed, simply add at the beginning of your notebook:

and all your matplotlib plots will be much more pretty! Seaborn also comes with better color palettes and utility functions for removing chartjunk. Seaborn has also a lot of very useful functions for exploratory data analysis, such as the clustermap, the pairplot, or the corrplot and lmplot as in the example below.

Be sure to check out the gallery for many more examples.

And finally, there is the ggplot package, which is based on the R ggplot2 package. Based on the grammar of graphics, it allows you to build visualizations from a dataframe with a very clear syntax. This is how for instance you can do a scatter plot of A vs B, and add a trend line.