5 Best Python Libraries For Data Science

If you have decided to learn Python as your programming language.

“What are the different Python libraries available to perform data analysis?”

This will be the next question in your mind. There are many libraries available to perform data analysis in Python. Don’t worry; you don’t have to learn all of those libraries. You have to know only five Python libraries to do most of the data analysis tasks. I will give a short introduction to each of these libraries, and I will point you to some of the best tutorials to learn them.

So let’s get started,


It is the foundation on which all higher level tools for scientific Python are built. Here are some of the functionalities it provides:

  1. N- Dimensional array, a fast and memory efficient multidimensional array providing vectorized arithmetic operations.
  2. You can apply standard mathematical operations on arrays of entire data without writing loops.
  3. It is very easy to transfer data to external libraries written in a low-level language (such as C or C++), and also for external libraries to return data to Python as Numpy arrays.Linear algebra, Fourier transforms and random number generation

NumPy does not provide high-level data analysis functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools like Pandas much more effectively.


  1. Scipy.org provides a brief description to Numpy package.
  2. Here is an amazing tutorial that completely focuses on usability of Numpy


The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines , such as routines for numerical integration and optimization. SciPy has modules for optimization, linear algebra, integration and other common tasks in data science.


I couldn’t find any good tutorial other than Scipy.org. This is the best tutorial for learning Scipy.


It contains high-level data structures and tools designed to make data analysis fast and easy. Pandas are built on top of NumPy, and makes it easy to use in NumPy-centric applications.

  1. Data structures with labeled axes, supporting automatic or explicit data alignment. This prevents common errors resulting from misaligned data and working with differently-indexed data coming from different sources.
  2. Using Pandas it is easier to handle missing data.
  3. Merge other relational operations found in popular databases (SQLbased, for example)

Pandas is the best tool for doing data munging.


  1. Quick intro to pandas
  2. Alfred Essa has a series of videos on Pandas. These videos should give you a good idea of basic concepts.
  3. Also don’t miss this tutorial by Shane Neeley, this video gives you a comprehensive intro to Numpy, Scipy and Matplotlib.


Matlplotlib is a Python module for visualization. Matplotlib allows you to easily make line graphs, pie chart, histogram and other professional grade figures. Using Matplotlib you can customize every aspect of a figure. When used within IPython, Matplotlib has interactive features like zooming and panning. It supports different GUI back ends on all operating systems, and can also export graphics to common vector and graphics formats: PDF, SVG, JPG, PNG, BMP, GIF, etc.


  1. Show me do has a good tutorial on Matplotlib
  2. I also recommend the cook book from pack publishers. This is an amazing book for someone getting started in Matplotlib.


Scikit-learn is a Python module for Machine learning built on top of Scipy. It provides a set of common Machine learning algorithms to users through a consistent interface. Scikit-learn helps to quickly implement popular algorithms on your dataset. Have a look at the list of algorithims available in scikit-learn, and you can quickly realize that it includes tools for many standard machine-learning tasks (such as clustering, classification, regression, etc).


  1. Introduction to Scikit-learn
  2. Tutorials from Scikit-learn.org


There are also other libraries such as Nltk(Natural language Tool kit), Scrappy for web scraping, Pattern for web mining, Theano for deep learning. But if you are getting started in python, I would recommend you to first get familiar with these 5 libraries. I have mentioned the tutorials that are beginner friendly, before going through these tutorials ensure that you are familiar with basics of python programming.

Becoming a Data Scientist

Data Science, Machine Learning, Big Data Analytics, Cognitive Computing …. well all of us have been avalanched with articles, skills demand info graph’s and point of views on these topics (yawn!). One thing is for sure; you cannot become a data scientist overnight. Its a journey, for sure a challenging one. But how do you go about becoming one? Where to start? When do you start seeing light at the end of the tunnel? What is the learning roadmap? What tools and techniques do I need to know? How will you know when you have achieved your goal?

Given how critical visualization is for data science, ironically I was not able to find (except for a few), pragmatic and yet visual representation of what it takes to become a data scientist. So here is my modest attempt at creating a curriculum, a learning plan that one can use in this becoming a data scientist journey. I took inspiration from the metro maps and used it to depict the learning path. I organized the overall plan progressively into the following areas / domains,

  1. Fundamentals
  2. Statistics
  3. Programming
  4. Machine Learning
  5. Text Mining / Natural Language Processing
  6. Data Visualization
  7. Big Data
  8. Data Ingestion
  9. Data Munging
  10. Toolbox

Each area / domain is represented as a “metro line”, with the stations depicting the topics you must learn / master / understand in a progressive fashion. The idea is you pick a line, catch a train and go thru all the stations (topics) till you reach the final destination (or) switch to the next line. I have progressively marked each station (line) 1 thru 10 to indicate the order in which you travel. You can use this as an individual learning plan to identify the areas you most want to develop and the acquire skills. By no means this is the end; but a solid start. Feel free to leave your comments and constructive feedback.

PS: I did not want to impose the use of any commercial tools in this plan. I have based this plan on tools/libraries available as open source for the most part. If you have access to a commercial software such as IBM SPSS or SAS Enterprise Miner, by all means go for it. The plan still holds good.

PS: I originally wanted to create an interactive visualization using D3.js or InfoVis. But wanted to get this out quickly. Maybe I will do an interactive map in the next iteration.

Road to data scientist
Road to data scientist

Source: http://webcache.googleusercontent.com/search?q=cache:g_Tt8sMiLdsJ:nirvacana.com/thoughts/becoming-a-data-scientist/+&cd=1&hl=en&ct=clnk&gl=vn