Getting started with Python for Data Science

May 15, 2015

There are many Python packages useful for data science. Especially if you are new to Python, you may feel lost and wonder which package to learn first. We list here the most important packages for data science, to help you get started with Python.

There are 6 packages fundamental for data science with Python. They form the basic scientific python stack:

  • numpy: basic array manipulation
  • scipy: scientific computing in python, including signal processing and optimization
  • matplotlib: visualization and plotting
  • IPython: write and run python code interactively in a shell or a notebook
  • pandas: data manipulation
  • scikit-learn: machine learning

At this point, you may feel overwhelmed with the number of packages to master. Happily, there are many helpful resources and excellent tutorials to get you started. We highly recommend you to follow the tutorials of the SciPy conference, or the scientific python lectures by JR Johansson. In a few days, you’ll be all set for data science with Python!

You’ll first need to learn how to use the Ipython notebook, as it is the support of many tutorials. It’s also the main way to prototype your python code in DSS. Furthermore there is a impressive gallery of notebooks to learn and get inspiration from.

If you need to set priorities, you should probably focus on pandas, scikit-learn and matplotlib.

Most of the time of a Python data scientist will be spent on pandas for data preparation and aggregation, scikit-learn for machine learning and matplotlib for plots.

Numpy and scipy are scarcely used directly, but it is useful to know at least a bit about them, as they are the fundamental building-blocks of the other packages.

If you need a quickstart for pandas, you can follow the 10 Minutes tour to pandas. If you are coming from SQL or R world, most of the data manipulation of pandas will seem familiar to you. The Comparison with R / R libraries and Comparison with SQL sections will provide you a useful introduction.

For matplotlib, besides tutorials, it is often useful to check out the gallery. You will easily get a starter code for the most common plots!

Finally, learning scikit-learn is easy! The documentation site is impressively rich: tutorials, many examples, and a very complete user-guide.

If you want to go further and learn more Data science Python package, head over to our more Python packages for Data Science post.