Becoming a Data Scientist

Data Science, Machine Learning, Big Data Analytics, Cognitive Computing …. well all of us have been avalanched with articles, skills demand info graph’s and point of views on these topics (yawn!). One thing is for sure; you cannot become a data scientist overnight. Its a journey, for sure a challenging one. But how do you go about becoming one? Where to start? When do you start seeing light at the end of the tunnel? What is the learning roadmap? What tools and techniques do I need to know? How will you know when you have achieved your goal?

Given how critical visualization is for data science, ironically I was not able to find (except for a few), pragmatic and yet visual representation of what it takes to become a data scientist. So here is my modest attempt at creating a curriculum, a learning plan that one can use in this becoming a data scientist journey. I took inspiration from the metro maps and used it to depict the learning path. I organized the overall plan progressively into the following areas / domains,

  1. Fundamentals
  2. Statistics
  3. Programming
  4. Machine Learning
  5. Text Mining / Natural Language Processing
  6. Data Visualization
  7. Big Data
  8. Data Ingestion
  9. Data Munging
  10. Toolbox

Each area / domain is represented as a “metro line”, with the stations depicting the topics you must learn / master / understand in a progressive fashion. The idea is you pick a line, catch a train and go thru all the stations (topics) till you reach the final destination (or) switch to the next line. I have progressively marked each station (line) 1 thru 10 to indicate the order in which you travel. You can use this as an individual learning plan to identify the areas you most want to develop and the acquire skills. By no means this is the end; but a solid start. Feel free to leave your comments and constructive feedback.

PS: I did not want to impose the use of any commercial tools in this plan. I have based this plan on tools/libraries available as open source for the most part. If you have access to a commercial software such as IBM SPSS or SAS Enterprise Miner, by all means go for it. The plan still holds good.

PS: I originally wanted to create an interactive visualization using D3.js or InfoVis. But wanted to get this out quickly. Maybe I will do an interactive map in the next iteration.

Road to data scientist
Road to data scientist



Tôi bận đọc – Nguyễn Thị Ngọc Minh

Học Thế Nào

Dành tặng các sinh viên năm nhất của tôi

Kinh nghiệm này tôi học được từ một thầy giáo dạy tiếng Anh chuyên ngành Lý thuyết văn học, từng học Đại học ở Thụy Điển, Cao học ở Anh và kém tôi 7 tuổi, người không ngừng khiến tôi sửng sốt vì sự hiểu biết phong phú, tư duy mạch lạc và cách nhìn nhận vấn đề vô cùng sâu sắc.

Nhiều sinh viên của tôi than thở: “Sách ở Đại học quá nhiều và chúng em không đủ thời gian để đọc. Làm sao có thể xoay xở được khi mà trong một học kì, riêng môn Văn học phương Tây hay Văn học Nga chẳng hạn, cần phải đọc đến hơn chục cuốn tiểu thuyết, mỗi cuốn vài trăm trang, chưa kể biết bao nhiêu giáo trình chính và tài liệu tham khảo”. Ông thầy trẻ tuổi của…

View original post 1,878 more words

Study Source Code: Episode 1 –


Whenever I talk to other people, I say “I love open source software and hate proprietary ones”. However, have you ever read the source code yourself? I rarely read the source code of any tools that I use, but I benefited so much whenever I took a look. Also, these days, our team got stuck with our hadoop environment. Some people complaint it is all screwed up but they cannot correctly identify what is going wrong.  Other people say everything is doing fine but they also cannot face the fact that all the Hive queries take way much longer than everyone expected. Based on two things that I mentioned above, I decide to take a look at the source of Hadoop/HDFS …etc. I am more like a Python programmer and my only knowledge of Java is no more than `System.out.println()` and `java -jar` to run the jar file. However, since…

View original post 1,292 more words