More Python packages for Data Science

May 15, 2015

There are a tremendous number of Python packages, devoted to all sorts of applications: from web development to data analysis to pretty much everything. We list here packages we have found essential for data science.

The basic stack

There are six fundamental packages for data science in Python:

  • NumPy: basic array manipulation
  • SciPy: scientific computing in Python, including signal processing and optimization
  • Matplotlib: visualization and plotting
  • IPython: write and run python code interactively in a shell or a notebook
  • pandas: data manipulation
  • scikit-learn: machine learning

If you have already done data science in Python, you probably already know them. If you are new to Python, and need a quick introduction to these packages, you can check out our Getting started with Python post. We also list there tutorials and useful resources to help you get started.

Analytics: networkx, nltk, statsmodels, PyMC

Scikit-learn is the main Python package for machine learning. It contains many unsupervised and supervised learning algorithms for discovering patterns in your data or building predictive models.

However, besides scikit-learn, there are several other packages for more advanced, specific applications. Packages like networkx for graph data, nltk for text data, or statsmodels for temporal data, nicely complement scikit-learn, either for feature engineering or even modeling.

Also some packages offer a different approach to data analysis and modeling, such as statsmodels for traditional statistical analysis, or PyMC for Bayesian inference.

Graph analytics

Graph analytics is particularly useful for social network analysis: uncovering communities, finding central agents in the network.

Networkx is the most popular Python package for graph analytics. It contains many functions for generating, analyzing and drawing graphs.

However, networkx may not scale well for large-scale graphs. For such graphs, you should also consider igraph (available in R and python), graph-tool or GraphLab CreateTM.

Sample code

This starter code illustrates how you can include networkx in your data processing flow. As a starting point, you will generally have a dataframe representing links in a network. A link could denote, for example, that the two users are friends on Facebook.

Links dataframe

You first need to convert the links dataframe into a graph. You can then, for example, find the connected components of the graph, sorted by size. You can also restrict your analysis to a subgraph, for instance the largest connected component.

To find the most influential people in the network, you can explore several centrality measures, such as degree, betweenness, and pagerank. Finally, you can easily output the centrality measures in a dataframe, for further analyses.

# build graph from links dataframe
import networkx as nx

# connected components sorted by size
cc =  sorted(nx.connected_components(g), key = len, reverse=True)
print "number of connected components: ", len(cc)
print "size of largest connected component: ", len(cc[0])
print "size of second largest: ", len(cc[1])

# largest connected component
G = g.subgraph(cc[0])

# output centrality measures in a dataframe
centrality = pd.DataFrame({'user':G.nodes()})
centrality['degree'] =
centrality['pagerank'] =
centrality['betweenness'] =

Time series analysis and forecasting

In many applications, predictions are affected by temporal factors: seasonality, an underlying trend, lags. The purpose of time-series analysis is to uncover such patterns in temporal data, and then build models upon them for forecasting.

Statsmodels is the main python package for time series analysis and forecasting. It nicely integrates with pandas time series. This packages also contains many statistical tests, such as ANOVA or t-test, used in traditional approaches to statistical data analysis.

Sample code

In the code below, taken from the examples section of statsmodels, we fit an auto-regressive model to the sunspots activity data, and use it for forecasting. We also plot the autocorrelation function which reveals that values are correlated with past values.

import statsmodels.api as sm

# sunspots activity data
print sm.datasets.sunspots.NOTE
data = sm.datasets.sunspots.load().endog
dates = sm.tsa.datetools.dates_from_range('1700', '2008')
ts = pd.TimeSeries(data, index=dates)

# plot the acf, lags=40)

# fit an AR model and forecast
ar_fitted = sm.tsa.AR(ts).fit(maxlag=9, method='mle', disp=-1)
ts_forecast = ar_fitted.predict(start='2008', end='2050')

Statsmodels chart output: sun activity forecast and autocorrelation

Natural language processing

Analyzing text is a difficult and broad task. The nltk package is a very complete package for that purpose. It implements many tools useful for natural language processing and modeling, like tokenization, stemming, and parsing, to name a few.

Bayesian inference

Finally, PyMC is a Python package devoted to Bayesian inference. This package allows you to easily construct, fit, and analyze your probabilistic models.

If you are not familiar with Bayesian inference, we recommend you the excellent Probabilistic Programming and Bayesian Methods for Hackers by Cameron Davidson Pilon. The book is entirely written as IPython notebooks, and contains lots of concrete examples using PyMC code.

Sample code

Here’s one simple example, taken from the notebook. The purpose is to infer if the user has changed his text-message behavior, based on a time-series of text-message count data.

Time series data graph: texting habits over time

The first step in Bayesian inference is to propose a probabilistic model for the data. For instance here, it is assumed that the user changed its behavior at a time tau: before the event, he was sending messages at a rate lambda_1, and after the event, at a rate lambda_2. Then the Bayesian approach allows to infer the whole probability distribution of tau, lambda_1 or lambda_2 given the observed data, not just a single estimate.

Posterior probability distributions of the variables

The probability distributions are usually obtained by Markov Chain Monte Carlo sampling, as done in the example code.

import pymc as pm

# probabilistic model
alpha = 1.0 / count_data.mean()  # count_data is the variable
                                 # that holds our txt counts
lambda_1 = pm.Exponential("lambda_1", alpha)
lambda_2 = pm.Exponential("lambda_2", alpha)

tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data)

def lambda_(tau=tau, lambda_1=lambda_1, lambda_2=lambda_2):
    out = np.zeros(n_count_data)
    out[:tau] = lambda_1  # lambda before tau is lambda1
    out[tau:] = lambda_2  # lambda after (and including) tau is lambda2
    return out
observation = pm.Poisson("obs", lambda_, value=count_data, observed=True)
model = pm.Model([observation, lambda_1, lambda_2, tau])

# MCMC sampling
mcmc = pm.MCMC(model)
mcmc.sample(40000, 10000, 1)
lambda_1_samples = mcmc.trace('lambda_1')[:]
lambda_2_samples = mcmc.trace('lambda_2')[:]
tau_samples = mcmc.trace('tau')[:]

Web scraping: beautifulsoup, urllib2, …

Many websites contain a lot of interesting and useful data. Unfortunately, the data is rarely available in a nice tabular format to download! The data is only displayed, disseminated across the web page, or even dispatched on different pages.

Suppose you wish to retrieve data on the most popular movies of 2014, displayed on the IMDb site. Unfortunately, as you can see, the movie title, its rating and so on are disseminated across the web page. In Chrome, if you left-click on any element, such as the movie title, and select “Inspect element”, you will see to which part of the HTML code it corresponds to.

Screenshots: Most Popular Feature Films Released in 2014 and snippet of the HTML code

The goal of web scraping is to systemically recover data displayed on websites. Several Python packages are useful to this end.

First, requests or urllib2, allow you to retrieve the HTML content of the pages.

You can then industrialize your browsing, and systemically fetch the related content. Then, BeautifulSoup or lxml allow you to efficiently parse the HTML content. If you understand the page structure, you can then easily get each datum displayed on the page.

Sample code

In the code below for instance, we first parse the HTML content to get the list of all movies. Then for each movie, we retrieve its ranking number, title, outline, rating and genre.

One of our data scientists used this kind of web scraping to build his personalized movie recommender system]!

import urllib2
from bs4 import BeautifulSoup

# get the html content
url = ""
page = urllib2.urlopen(url).read()

# parsing HTML
soup = BeautifulSoup(page)

# find all movies
movies = soup.find("table", {"class":"results"}).findAll('tr')

# get information for each movie
records = []
for movie in movies:
    record = {}
    record['number'] = movie.find('td', {"class":"number"}).text
    record['title'] = movie.find('a')['title']
    record['rating'] = movie.find('div', {"class":"rating-list"})['title']
    record['outline'] = movie.find('span', {"class":"outline"}).text
    record['credit'] = movie.find('span', {"class":"credit"}).text
    record['genres'] = movie.find('span', {"class":"genre"}).text.split('|')
    records += [record]

# output in a dataframe
df = pd.DataFrame(records)

Visualization: seaborn, ggplot

The standard plotting package in Python is matplotlib, that enables you to can make simple plots rather easily. Matplotlib is also a very flexible plotting library. You can use it to make arbitrarily complex plots and customize them at will.

However using Matplotlib can be frustrating at times, for two reasons. First, the Matplotlib default aesthetic is not especially attractive, and you may end up doing a lot of manual tweaking to get awesome-looking plots. Second, Matplotlib is not well suited for exploratory data analysis, when you want to quickly analyze your data across several dimensions. Your code will often end up being verbose and lengthy.

Fortunately, there are additional packages to make better visualizations, and more easily.

First there are the plotting capabilities of pandas, the data manipulation package. This greatly simplifies the exploratory data analysis, as you get visualizations straight out from your data frames.

import pandas as pd
# bar plot
# kernel density estimate
# scatter plot of A vs B, color given by D, and size by C
df.plot(kind='scatter', x='A', y='B', s=100*df['C'], c='D');

Second, seaborn is a great way to enhance the aesthetics of your Matplotlib visualizations. Indeed, simply add at the beginning of your notebook:

import seaborn as sns

and all your Matplotlib plots will be much more pretty! Seaborn also comes with better color palettes and utility functions for removing chartjunk. Seaborn has also a lot of useful functions for exploratory data analysis, such as the clustermap, the pairplot, or the corrplot and lmplot as in the example below.

import seaborn as sns
# corrplot of iris data
df = sns.load_dataset("iris")
sns.corrplot(df, hue="species", size=2.5)
# faceted logistic regression of titanic data
df = sns.load_dataset("titanic")
pal = dict(male="#6495ED", female="#F08080")
g = sns.lmplot("age", "survived", col="sex", hue="sex", data=df,
               palette=pal, y_jitter=.02, logistic=True)
g.set(xlim=(0, 80), ylim=(-.05, 1.05))

Seaborn corrplot Seaborn lmplot

Be sure to check out the gallery for many more examples.

And finally, there is the ggplot package, which is based on the R ggplot2 package. Based on the grammar of graphics, it allows you to build visualizations from a data frame with a very clear syntax. This is how for instance you can do a scatter plot of A vs B, and add a trend line.

from ggplot import *
p = ggplot(aes(x='A', y='B'), data=df)
p + geom_point() + stat_smooth(color='blue')