There are many Python packages useful for data science. Especially if you are new to Python, you may feel lost and wonder which package to learn first. We list here the most important packages for data science, to help you get started with Python.
There are 6 packages fundamental for data science with Python. They form the basic scientific Python stack:
At this point, you may feel overwhelmed with the number of packages to master. Happily, there are many helpful resources and excellent tutorials to get you started. We highly recommend you to follow the tutorials of the SciPy conference, or the scientific python lectures by JR Johansson. In a few days, you’ll be all set for data science with Python!
You’ll first need to learn how to use the IPython notebook, as it is the IDE used in many tutorials. It’s also the main way to prototype your Python code in Dataiku DSS. Furthermore, there is an impressive gallery of notebooks to learn and get inspiration from.
If you need to set priorities, you should probably focus on pandas, scikit-learn and Matplotlib.
Most of a Python data scientist’s time is spent on pandas for data preparation and aggregation, scikit-learn for machine learning, and Matplotlib for plots.
NumPy and SciPy are rarely used directly, but it is useful to know at least a bit about them, since they are the fundamental building blocks of the other packages.
If you need a quickstart for pandas, you can follow the 10 Minutes to pandas tour. If you are coming from experience with SQL or R, most of the data manipulation of pandas will seem familiar to you. The Comparison with R / R libraries and Comparison with SQL sections will provide you with a useful introduction.
If you want to go further and learn more, head over to our more Python packages for Data Science post.