howto

Getting started with Python for Data Science

May 15, 2015

There are many Python packages useful for data science. Especially if you are new to Python, you may feel lost and wonder which package to learn first. We list here the most important packages for data science, to help you get started with Python.

There are 6 packages fundamental for data science with Python. They form the basic scientific Python stack:

  • NumPy: basic array manipulation
  • SciPy: scientific computing in Python, including signal processing and optimization
  • Matplotlib: visualization and plotting
  • IPython: write and run python code interactively in a shell or a notebook
  • pandas: data manipulation
  • scikit-learn: machine learning

The basic scientific Python stack

At this point, you may feel overwhelmed with the number of packages to master. Happily, there are many helpful resources and excellent tutorials to get you started. We highly recommend you to follow the tutorials of the SciPy conference, or the scientific python lectures by JR Johansson. In a few days, you’ll be all set for data science with Python!

You’ll first need to learn how to use the IPython notebook, as it is the IDE used in many tutorials. It’s also the main way to prototype your Python code in Dataiku DSS. Furthermore, there is an impressive gallery of notebooks to learn and get inspiration from.

If you need to set priorities, you should probably focus on pandas, scikit-learn and Matplotlib.

Most of a Python data scientist’s time is spent on pandas for data preparation and aggregation, scikit-learn for machine learning, and Matplotlib for plots.

NumPy and SciPy are rarely used directly, but it is useful to know at least a bit about them, since they are the fundamental building blocks of the other packages.

If you need a quickstart for pandas, you can follow the 10 Minutes to pandas tour. If you are coming from experience with SQL or R, most of the data manipulation of pandas will seem familiar to you. The Comparison with R / R libraries and Comparison with SQL sections will provide you with a useful introduction.

For Matplotlib, besides tutorials, it is often useful to check out the gallery. You will quickly find starter code for the most common plots!

Finally, learning scikit-learn is easy! The documentation site is impressively rich: tutorials, many examples, and a very complete user-guide.

If you want to go further and learn more, head over to our more Python packages for Data Science post.