“I hear and I forget. I see and I remember. I do and I understand” – Confucius
With the plethora of free (or at least reasonably priced) high-quality massive open online courses (MOOCs), free online textbooks, tutorials, the tools available for aspirant data science apprentices are many and varied. From taking courses offered by Coursera to freely available eBooks and code examples to download from Github, there are many useful resources at our disposal.
Demand for data science skills remains consistently high. IBM predicts that appetite for data scientists will grow 28% by 2020. Job postings for data science skills in South Africa are rising rapidly as companies begin to realise the true value of their data initiatives.
According to IBM, the current most desirable and lucrative skills include machine learning, data science, Hadoop, Hive, Pig and MapReduce. It is interesting to note just how many data engineering type skills are in demand. I recently started to set up a data lab at the Foundery based on the Hortonworks distribution of Hadoop, and I can understand why this is true – (big) data engineering is unnecessarily complicated!
Over the last few years, I have completed (and sometimes part-completed) some data science MOOCs and tutorials. I have downloaded free eBooks and textbooks – some good and some not so good. These, along with the MOOCs, have become my primary source of knowledge and skills development in the data science domain. I am finding this form of online learning to be a very efficient and effective way to grow my knowledge and expertise. However, my choice of which courses to do has been haphazard at best and having this much choice has also made it difficult to find the right courses to pursue, often leading to me abandoning classes or not learning as well as I should.
The purpose of this blog, therefore, is twofold: to create a thoughtful and considered curriculum that I can follow to elevate my data science mastery and to share with you some of the resources that I have collated in researching this proposed curricula. Whether you are a seasoned data science expert, or an absolute beginner in the field, I believe there is value from some, if not all, of the topics in the curriculum.
The ultimate ambition of completing this proposed curriculum is to vastly (and more efficiently) improve my mathematical, statistics, algorithmic development, programming and data visualisation skills to go from a journeyman level understanding of data science to full-on mastery of advanced data science concepts.
I want to DO so that I can better UNDERSTAND. Eventually, I’d like to understand and implement advanced machine learning and deep learning concepts (both from a theoretical and practical perspective) as well as obtain more in-depth expertise in big data technology. I also aim to improve my data visualisation skills so that I can have more impactful, interesting and valuable discussions with our business stakeholders and clients.
The day that I can have a debate with my maths colleagues about advanced mathematical concepts, compete with the computer scientists on Hackerrank coding challenges, run my models on a big data platform that I have set up, create a beautiful and insightful visualizations AND make this all understandable to my wife and daughter is the day when I know I have been successful in this endeavour.
I proposed this curriculum based on the skills that are commonly acknowledged to be required for data science as well as on course ratings, popularity, participant reviews and cost. I have tried to be as focussed as possible and my thinking is that this is the most efficient plan to get deep data science skills.
This curriculum will be based on open-source programming languages only, namely Python and R. My initial focus will be on improving my Python skills where possible as I want to get this up to a level where I can implement Python-based machine learning models in NumPy/SciPy. I do acknowledge, however, that for many of the stats and maths related courses, R is often preferred and in that event, I will switch.
Given my work commitments and the fact that we have a new (and very loud) addition to our family, I think that I would likely only be able to devote 10 hours a week to this challenge. My proposed timetable will, therefore, be based on this estimate. The current estimate to fully complete the curriculum is at 110 weeks or just over 2 years! This is going to be a long journey…
I plan to update this blog periodically as and when I complete a course. My updates will include a more detailed summary of the course, an in-depth review and score, how much it cost me as well as tracking how long the course took to complete relative to the advised timeframe provided by the course facilitators. My time estimates will be slightly more conservative relative to the time estimates for each course as, in my experience, it always takes longer than suggested.
Thank you for reading this far. If you wish to join me in growing your data science skills (almost for free) and help keep me honest and accountable in completing this curriculum, then please do read on.
Data Science Curriculum
0. Supplementary resources and setup
Sticking to the blog’s theme of finding low-cost resources for this curriculum wherever possible, I have found a few high-quality free online maths and stats textbooks. These will serve as useful reference material for the bulk of the curriculum. They are:
- Think Stats – a freely downloadable introductory book on Probability and Statistics for Python. Code examples and solutions are provided via the book’s Github repository.
- An Introduction to Statistical Learning with Applications in R – another freely available book that is described as the “how to” manual for statistical learning. This textbook provides an introduction to machine learning and contains code examples written in R. There is online course material that accompany this book and this can be found here as well as here. I will use this manual and potentially their associated MOOCs as a reference when I begin the machine learning component of this curriculum.
- Introduction to Linear Algebra is the accompanying linear algebra reference book for the MIT Open Courseware MOOC. This book will also have to be purchased should the MOOC require this.
- Although not free, Python Machine Learning by Sebastian Rashka has good reviews as a reference book for machine learning applications in Python. The book also has an accompanying code repository on Github.
1. start by focusing on maths and stats
The first section of the curriculum will allow us to concentrate on redeveloping fundamentals in mathematics and statistics as they relate to data science. University, over a decade ago now, was the last time I did any proper maths (yes, engineering maths is ‘proper mathematics’ to all you engineering-maths naysayers).
Regarding learning the mathematics and statistics required for data science and machine learning, I will focus on the following courses.
- Statistics with R Specialisation – There were many courses available to improve my stats knowledge. Ultimately, I settled on this Coursera specialisation by Duke University as it seemed the most comprehensive and the textbook seems a good companion book. This specialisation comprises 5 courses – Introduction to Probability and Data, Inferential Statistics, Linear Regression and Modelling, Bayesian Statistics and a capstone project written in R. Each course will take 5 weeks and will require 5-7 hours of effort per week. I will use this set of courses to improve my R skills, and I will audit courses if possible or I may have to pay for the specialisation. [Total time estimate: 250 hours]
- Multivariable Calculus – (Ohio State University) referencing the Multivariable calculus sections of the Khan Academy where required. This highly rated course (average rating of 4.8 out of 5 stars) will provide me with a refresher of calculus and will take approximately 25 hours to watch all the videos. I think I can safely add the same amount of time to go through all the tutorials and exercises putting the length of this study at 50 hours. [Total time estimate: 50 hours]
- Linear Algebra – (MIT Open Courseware) referencing the Linear Algebra sections of the Khan Academy where required. I don’t know how long this should take to complete, so I will base my estimate on the previous courses estimate of 50 hours. I chose this course as the lecturer of the Linear Algebra textbook, MIT Professor Gilbert Strang, conducts this MOOC. [Total time estimate: 50 hours]
2. Time to improve my management skills of data science projects, experiments and teams
A large part of work at my previous employer and at my current job at the Foundery is to manage various data science projects and teams. I have a lot of practical experience in this domain, but I don’t think it would hurt to go back and refresh some of the core concepts that relate to effective data science project management. To this end, I managed to find an appropriate Coursera specialisation that aims to help data science project managers “assemble the right team”, “ask the right questions”, and “avoid the mistakes that derail data science project”.
- Executive Data Science Specialization – John Hopkins University. The entire specialisation is only 5 weeks long, and requires 4-6 hours a week of effort. The courses that are on offer are titled “A Crash Course in Data Science”, “Building a Data Science Team”, “Managing Data Analysis”, “Data Science in Real Life” and “Executive Data Science Capstone”. I wasn’t able to obtain rating information for this specialisation. [Total time estimate: 50 hours]
3. Improve my computer science and software engineering skills
When I first started out, I managed to pick up a few Unix skills (just enough to be dangerous as evidenced when I once took out a production server with an errant Unix command). Since then, and over time, I have lost the little that I knew (luckily for production support teams).
New and exciting software engineering paradigms have emerged, such as DevOps and code repository solutions like Github are now commonly used in both the data science and development industries. As such, I thought that some study in this domain would be useful in my journey.
I would also like to increase my knowledge of data structures and algorithms from both a practical and theoretical perspective. To this end, I have found an exciting and challenging University of California San Diego Coursera specialisation called “Master Algorithmic Programming Techniques”.
The courses that I am planning to complete to improve my computer science and software engineering skills are:
- How to use Git and GitHub – a freely available course offered by Udacity with input from Github. This course is a 3-week MOOC and is rated 4.5 out of 5 stars out of 41 student reviews and will require 6 hours of commitment per week. This course introduces Git and GitHub and will help me to learn how to use better source control, which in turn will greatly assist with project delivery of medium to large sized data science projects. [Total time estimate: 30 hours]
- Introduction to Linux – a freely available course from edX. This is an 8-week course rated 4 out of 5 stars by 118 student reviews with over 250 000+ students enrolled. Thoroughly covering this material will take between 40-60 hours per the course notes. Gaining a firm understanding of Linux will allow me more control when using the open source data science environments and tools. [Total time estimate: 60 hours]
- Introduction to DevOps – Udacity. This free course introduces that concept of DevOps and explains how to implement continuous integration, continuous testing, continuous deployment and release management processes into your development workflow. I am very interested to see how this could be applied to the data science world. The course does not have a rating and is 3 weeks in length requiring 2-3 hours per week of effort. [Total time estimate: 10 hours]
- Master Algorithmic Programming Techniques – This Coursera specialisation by the University of California San Diego comprises 6 courses —Algorithmic Toolbox, Data Structures, Algorithms on Graphs, Algorithms on Strings, Advanced Algorithms and Complexity, Genome Assembly Programming Challenge Each course is 4 weeks of study, 4-8 hours per week. The individual courses were rated between 3.5 – 4.5 stars.
What excited me about this specialisation is that I would get an opportunity to learn and implement over 100 algorithms in a programming language of my choice from the ground up. I think that this would certainly improve both my knowledge about algorithms as well as my programming skills.
After looking a bit deeper at the course structure, it seems as if this specialisation is paid for at $49 per month until you complete it. So, the faster I do this, the cheaper it’ll be – nice incentive! [Total time estimate: 235 hours]
4. Improve my base data science skills and up my Python coding abilities
At this stage of the curriculum, I would have solidified my maths and stats skills, improved my computer science and software engineering skillset, and brushed up on some data science project management theory. Before embarking on intensive machine learning material, I think that it might be a good decision to get back to basics and look at improving my base data science and visualisation skills and upping my Python coding abilities while at it.
One of my goals for this curriculum was to improve my communication skills by becoming a real data story-teller. An effective way to do this is to learn how to visualise data in a more concise, meaningful and, I guess, beautiful manner. I say beautiful because of an amazing data visualisation website called Information is Beautiful. Check it out; you won’t regret it.
- Learning Python for Data Analysis and Visualisation – Udemy. Jose Portilla’s Udemy course is highly rated at 4.6 stars out of 5 from over 4 220 student reviews. Over 47 812 students have enrolled in the course. The length of the videos on this course is 21 hours, so until I can estimate this better, I will add 100% to my time estimate for completing the course. The course is focussed on Python and introduces topics such as Numpy, Pandas, manipulating data, data visualisation, machine learning, basic stats, SQL and web scraping. Udemy often run specials on their courses, so I expect to pick this one up between $10 and $20. [Total time estimate: 50 hours]
- Data Visualization and D3.js – Communicating with Data – Udacity. This free course is part of Udacity’s Data Analyst nanodegree programme. The course provides a background in visualisation fundamentals, data visualisation design principles and will teach you D3.js. It is an intermediate level course that will take approximately 7 weeks to complete at 4-6 hours per week. [Total time estimate: 50 hours]
- HackerRank challenges – HackerRank is a website that provides a very entertaining, gamified way to learn how to code. HackerRank offers daily, weekly and monthly coding challenges that reward you for solving a problem. The difficulty of the questions ranges from “Easy’ to “Hard”, and I plan to use this to test my new-and-improved Python skills. Every now and then I will use this form of learning Python as a “break” from the academic slog. [Total time estimate: n/a]
5. Learn the basics of machine learning from both a practical and theoretical perspective
The resurgence of machine learning (the science of “teaching” computers to act without explicitly being programmed) is one of the key factors in the popularity of data science and drives many of the biggest companies today including the likes of Google, Facebook and Amazon. Machine learning is used in many recent innovations including self-driving cars, natural language processing, advances in medical diagnoses to name a few. It is a fascinating field, and as such, I want to gain a solid foundational understanding of this topic. It will also lay the foundation to understand the more advanced machine learning theory such as deep learning, reinforcement learning and probabilistic graphical models.
Machine Learning – Stanford University. Taught by Andrew Ng, this 10-week course is one of Coursera’s most popular courses and is rated 4.9 out of 5 from 39 267 student reviews. A commitment of 4-6 hours per week will be required.
Andrew Ng provides a comprehensive and beginner-friendly introduction to machine learning, data mining and pattern recognition and is based on several case studies and real-world applications. Supervised and unsupervised learning algorithms are explained and implemented from first principles, and machine learning best practices are discussed.
This course is a rite of passage for all aspirant data scientists and is a must-do. If you are on a more advanced level of machine learning understanding, look for the handouts of the CS229 Machine Learning course taught at Stanford (also by Andrew Ng) for further material. [Total time estimate: 80 hours]
Machine Learning A-Z Hands-On Python & R In Data Science – Udemy. This course is a highly rated, practical machine learning course on Coursera. It is rated 4.5 stars out of 5 based on 11 798 student reviews. 86 456 students had signed up to this course at the time of writing. The videos total 41 hours and as before I will double this for my effort estimate.
The course is very hands on, and comprehensively covers topics such as data pre-processing, regression, classification, clustering, association rule learning, reinforcement learning, natural language processing, deep learning, dimensionality reduction and model selection. It can be completed in either R or Python. Again, I will look to pick this one up on a special for between $10 – $20. [Total time estimate: 80 hours]
6. Our capstone project – let’s dive into the deep learning end
We have finally made it to what I regard as the curriculum’s capstone project – a practical course on deep learning:
- Practical Deep Learning For Coders, Part 1 – fast.ai. Out of all the courses that I have looked at, I am probably the most excited about this one. Fast.ai’s deep learning course is a very different MOOC to the rest in that the content is taught top down rather than bottom up. What this means is that you are taught how to use deep learning to solve a problem in week 1 but only taught why it works in week 2.
The course is run by Jeremy Howard who has won many Kaggle challenges and is an expert in this field. The problems solved and datasets used in this course comes from previously run Kaggle challenges, which allows you to easily benchmark your solution to the best submitted entries.
A significant time commitment is required for this course – 10 hours a week for 7 weeks. The course teaches you some cool stuff such as such as how to set up a GPU server in the cloud using Amazon Web Services, and how to use the Python Keras library. As per the homepage for Keras, “Keras is a high-level neural networks AP developed with a focus on enabling fast experimentation. It is written in Python and is capable of running on top of either TensorFlow, CNTK or Theano.”
As Jeremy Howard says, all you need to succeed in this course is pragmatic programming, tenacity, an open mind and high school math so good luck and well done on getting to this stage! [Total time estimate: 100 hours]
So, we have finally made it to the end – well done! I have reviewed countless number courses in compiling this curriculum, and there were so many more that I wanted to add, including these more advanced topics:
- Deep Natural Language Processing – University of Oxford
- Mining Massive Datasets – Stanford University
- Machine Learning Engineer Nanodegree – Udacity
- Artificial Intelligence Engineer Nanodegree – Udacity
- Deep Learning Nanodegree – Udacity
- Self-Driving Car Engineer – Udacity
I have also not touched on related topics such as big data, data engineering, data management, data modelling nor database theory of structured and unstructured sets of data. An understanding of these topics is nonetheless vital to understand the end-end spectrum that makes up the data analytics continuum. Nor have I chatted about the myriad data science tutorials and Kaggle-like data science challenges out there.
I intend to look at relevant tutorials and Kaggle problems where they relate to parts of this curriculum and where possible I will try implement some of these solutions on a big-data platform. While discussing this topic with one of my colleagues, he suggested also trying to build something big enough that encompasses all the above so that I can have an end-target in mind, don’t get bored and implement something that I am passionate about from the ground up. This is certainly something that I will also consider.
This challenge will start on 10 July 2017. According to my estimate, this curriculum will take 110 weeks or just over 2 years!! As daunting as this sounds, I take heart from Andrew Ng, the machine learning expert, when he said the following in an interview with Forbes magazine:
“In addition to work ethic, learning continuously and working very hard to keep on learning is essential. One of the challenges of learning is that it has almost no short-term rewards. You can spend all weekend studying, and then on Monday your boss does not know you worked so hard. Also, you are not that much better at your job because you only studied hard for one or two days. The secret to learning is to not do it only for a weekend, but week after week for a year, or week after week for a decade. The time scale is measured in months or years, not in weeks. I believe in building organizations that invest in every employee. If someone joins us, I can look them in the eye and say, “If you come work with me, I promise that in six months you will know a lot more and you will be much better at doing this type of work than you are today”
I hope that this quote resonates with you too and that the blog has helped or motivated you to improve your data science skills. Thank you for reading this and please keep me honest in terms of completing this challenge. Please post a comment if you think I should add to or change the curriculum in any way, and post your own course reviews — let me know if there are any other books and textbooks that I should consider. Expect updates soon!
by Nicholas Simigiannis