A curriculum for growing your data science skills (almost) for free

With the plethora of free (or at least reasonably priced) high-quality massive open online courses (MOOCs), free online textbooks, tutorials, the tools available for aspirant data science apprentices are many and varied. From taking courses offered by Coursera to freely available eBooks and code examples to download from Github, there are many useful resources at our disposal.

I hear and I forget. I see and I remember. I do and I understand – Confucius

With the plethora of free (or at least reasonably priced) high-quality massive open online courses (MOOCs), free online textbooks, tutorials, the tools available for aspirant data science apprentices are many and varied. From taking courses offered by Coursera to freely available eBooks and code examples to download from Github, there are many useful resources at our disposal.

Demand for data science skills remains consistently high. IBM predicts that appetite for data scientists will grow 28% by 2020. Job postings for data science skills in South Africa are rising rapidly as companies begin to realise the true value of their data initiatives.

According to IBM, the current most desirable and lucrative skills include machine learning, data science, Hadoop, Hive, Pig and MapReduce. It is interesting to note just how many data engineering type skills are in demand. I recently started to set up a data lab at the Foundery based on the Hortonworks distribution of Hadoop, and I can understand why this is true – (big) data engineering is unnecessarily complicated!

Over the last few years, I have completed (and sometimes part-completed) some data science MOOCs and tutorials. I have downloaded free eBooks and textbooks – some good and some not so good. These, along with the MOOCs, have become my primary source of knowledge and skills development in the data science domain. I am finding this form of online learning to be a very efficient and effective way to grow my knowledge and expertise. However, my choice of which courses to do has been haphazard at best and having this much choice has also made it difficult to find the right courses to pursue, often leading to me abandoning classes or not learning as well as I should.

The purpose of this blog, therefore, is twofold: to create a thoughtful and considered curriculum that I can follow to elevate my data science mastery and to share with you some of the resources that I have collated in researching this proposed curricula. Whether you are a seasoned data science expert, or an absolute beginner in the field, I believe there is value from some, if not all, of the topics in the curriculum.

sourced from http://blogs.edweek.org/edweek/edtechresearcher/2014/07/moocs_and_the_science_of_learning.html

The ultimate ambition of completing this proposed curriculum is to vastly (and more efficiently) improve my mathematical, statistics, algorithmic development, programming and data visualisation skills to go from a journeyman level understanding of data science to full-on mastery of advanced data science concepts.

I want to DO so that I can better UNDERSTAND. Eventually, I’d like to understand and implement advanced machine learning and deep learning concepts (both from a theoretical and practical perspective) as well as obtain more in-depth expertise in big data technology. I also aim to improve my data visualisation skills so that I can have more impactful, interesting and valuable discussions with our business stakeholders and clients.

The day that I can have a debate with my maths colleagues about advanced mathematical concepts, compete with the computer scientists on Hackerrank coding challenges, run my models on a big data platform that I have set up, create a beautiful and insightful visualizations AND make this all understandable to my wife and daughter is the day when I know I have been successful in this endeavour.

I proposed this curriculum based on the skills that are commonly acknowledged to be required for data science as well as on course ratings, popularity, participant reviews and cost. I have tried to be as focussed as possible and my thinking is that this is the most efficient plan to get deep data science skills.

This curriculum will be based on open-source programming languages only, namely Python and R. My initial focus will be on improving my Python skills where possible as I want to get this up to a level where I can implement Python-based machine learning models in NumPy/SciPy. I do acknowledge, however, that for many of the stats and maths related courses, R is often preferred and in that event, I will switch.

Given my work commitments and the fact that we have a new (and very loud) addition to our family, I think that I would likely only be able to devote 10 hours a week to this challenge. My proposed timetable will, therefore, be based on this estimate. The current estimate to fully complete the curriculum is at 110 weeks or just over 2 years! This is going to be a long journey…

https://unsplash.com/collections/136866/journey?photo=7RIm0GqvvkM

I plan to update this blog periodically as and when I complete a course. My updates will include a more detailed summary of the course, an in-depth review and score, how much it cost me as well as tracking how long the course took to complete relative to the advised timeframe provided by the course facilitators. My time estimates will be slightly more conservative relative to the time estimates for each course as, in my experience, it always takes longer than suggested.

Thank you for reading this far. If you wish to join me in growing your data science skills (almost for free) and help keep me honest and accountable in completing this curriculum, then please do read on.

Data Science Curriculum

0. Supplementary resources and setup

Sticking to the blog’s theme of finding low-cost resources for this curriculum wherever possible, I have found a few high-quality free online maths and stats textbooks. These will serve as useful reference material for the bulk of the curriculum. They are:

  • Think Stats – a freely downloadable introductory book on Probability and Statistics for Python. Code examples and solutions are provided via the book’s Github repository.
  • An Introduction to Statistical Learning with Applications in R – another freely available book that is described as the “how to” manual for statistical learning. This textbook provides an introduction to machine learning and contains code examples written in R. There is online course material that accompany this book and this can be found here as well as here. I will use this manual and potentially their associated MOOCs as a reference when I begin the machine learning component of this curriculum.
  • Introduction to Linear Algebra is the accompanying linear algebra reference book for the MIT Open Courseware MOOC. This book will also have to be purchased should the MOOC require this.
  • Although not free, Python Machine Learning by Sebastian Rashka has good reviews as a reference book for machine learning applications in Python. The book also has an accompanying code repository on Github.
https://unsplash.com/collections/488/books-libraries-paper?photo=6ywyo2qtaZ8

1.  start by focusing on maths and stats

The first section of the curriculum will allow us to concentrate on redeveloping fundamentals in mathematics and statistics as they relate to data science. University, over a decade ago now, was the last time I did any proper maths (yes, engineering maths is ‘proper mathematics’ to all you engineering-maths naysayers).

Regarding learning the mathematics and statistics required for data science and machine learning, I will focus on the following courses.

  • Statistics with R Specialisation – There were many courses available to improve my stats knowledge. Ultimately, I settled on this Coursera specialisation by Duke University as it seemed the most comprehensive and the textbook seems a good companion book. This specialisation comprises 5 courses – Introduction to Probability and Data, Inferential Statistics, Linear Regression and Modelling, Bayesian Statistics and a capstone project written in R. Each course will take 5 weeks and will require 5-7 hours of effort per week. I will use this set of courses to improve my R skills, and I will audit courses if possible or I may have to pay for the specialisation. [Total time estimate: 250 hours]
  • Multivariable Calculus – (Ohio State University) referencing the Multivariable calculus sections of the Khan Academy where required. This highly rated course (average rating of 4.8 out of 5 stars) will provide me with a refresher of calculus and will take approximately 25 hours to watch all the videos. I think I can safely add the same amount of time to go through all the tutorials and exercises putting the length of this study at 50 hours. [Total time estimate: 50 hours]
  • Linear Algebra – (MIT Open Courseware) referencing the Linear Algebra sections of the Khan Academy where required. I don’t know how long this should take to complete, so I will base my estimate on the previous courses estimate of 50 hours. I chose this course as the lecturer of the Linear Algebra textbook, MIT Professor Gilbert Strang, conducts this MOOC. [Total time estimate: 50 hours]

2.  Time to improve my management skills of data science projects, experiments and teams

A large part of work at my previous employer and at my current job at the Foundery is to manage various data science projects and teams. I have a lot of practical experience in this domain, but I don’t think it would hurt to go back and refresh some of the core concepts that relate to effective data science project management. To this end, I managed to find an appropriate Coursera specialisation that aims to help data science project managers “assemble the right team”, “ask the right questions”, and “avoid the mistakes that derail data science project”.

  • Executive Data Science Specialization – John Hopkins University. The entire specialisation is only 5 weeks long, and requires 4-6 hours a week of effort. The courses that are on offer are titled “A Crash Course in Data Science”, “Building a Data Science Team”, “Managing Data Analysis”, “Data Science in Real Life” and “Executive Data Science Capstone”. I wasn’t able to obtain rating information for this specialisation. [Total time estimate: 50 hours]

 3.  Improve my computer science and software engineering skills

When I first started out, I managed to pick up a few Unix skills (just enough to be dangerous as evidenced when I once took out a production server with an errant Unix command). Since then, and over time, I have lost the little that I knew (luckily for production support teams).

New and exciting software engineering paradigms have emerged, such as DevOps and code repository solutions like Github are now commonly used in both the data science and development industries. As such, I thought that some study in this domain would be useful in my journey.

I would also like to increase my knowledge of data structures and algorithms from both a practical and theoretical perspective. To this end, I have found an exciting and challenging University of California San Diego Coursera specialisation called “Master Algorithmic Programming Techniques”.

The courses that I am planning to complete to improve my computer science and software engineering skills are:

  • How to use Git and GitHub – a freely available course offered by Udacity with input from Github. This course is a 3-week MOOC and is rated 4.5 out of 5 stars out of 41 student reviews and will require 6 hours of commitment per week. This course introduces Git and GitHub and will help me to learn how to use better source control, which in turn will greatly assist with project delivery of medium to large sized data science projects. [Total time estimate: 30 hours]
  • Introduction to Linux – a freely available course from edX. This is an 8-week course rated 4 out of 5 stars by 118 student reviews with over 250 000+ students enrolled. Thoroughly covering this material will take between 40-60 hours per the course notes. Gaining a firm understanding of Linux will allow me more control when using the open source data science environments and tools. [Total time estimate: 60 hours]
  • Introduction to DevOpsUdacity. This free course introduces that concept of DevOps and explains how to implement continuous integration, continuous testing, continuous deployment and release management processes into your development workflow. I am very interested to see how this could be applied to the data science world. The course does not have a rating and is 3 weeks in length requiring 2-3 hours per week of effort. [Total time estimate: 10 hours]
  • Master Algorithmic Programming Techniques – This Coursera specialisation by the University of California San Diego comprises 6 courses —Algorithmic Toolbox, Data Structures, Algorithms on Graphs, Algorithms on Strings, Advanced Algorithms and Complexity, Genome Assembly Programming Challenge Each course is 4 weeks of study, 4-8 hours per week. The individual courses were rated between 3.5 – 4.5 stars.

What excited me about this specialisation is that I would get an opportunity to learn and implement over 100 algorithms in a programming language of my choice from the ground up. I think that this would certainly improve both my knowledge about algorithms as well as my programming skills.

After looking a bit deeper at the course structure, it seems as if this specialisation is paid for at $49 per month until you complete it. So, the faster I do this, the cheaper it’ll be – nice incentive! [Total time estimate: 235 hours]

4.  Improve my base data science skills and up my Python coding abilities

At this stage of the curriculum, I would have solidified my maths and stats skills, improved my computer science and software engineering skillset, and brushed up on some data science project management theory. Before embarking on intensive machine learning material, I think that it might be a good decision to get back to basics and look at improving my base data science and visualisation skills and upping my Python coding abilities while at it.

One of my goals for this curriculum was to improve my communication skills by becoming a real data story-teller. An effective way to do this is to learn how to visualise data in a more concise, meaningful and, I guess, beautiful manner. I say beautiful because of an amazing data visualisation website called Information is Beautiful. Check it out; you won’t regret it.

  • Learning Python for Data Analysis and Visualisation – Udemy. Jose Portilla’s Udemy course is highly rated at 4.6 stars out of 5 from over 4 220 student reviews. Over 47 812 students have enrolled in the course. The length of the videos on this course is 21 hours, so until I can estimate this better, I will add 100% to my time estimate for completing the course.  The course is focussed on Python and introduces topics such as Numpy, Pandas, manipulating data, data visualisation, machine learning, basic stats, SQL and web scraping. Udemy often run specials on their courses, so I expect to pick this one up between $10 and $20. [Total time estimate: 50 hours]
  • Data Visualization and D3.js – Communicating with DataUdacity. This free course is part of Udacity’s Data Analyst nanodegree programme. The course provides a background in visualisation fundamentals, data visualisation design principles and will teach you D3.js. It is an intermediate level course that will take approximately 7 weeks to complete at 4-6 hours per week. [Total time estimate: 50 hours]
  • HackerRank challenges – HackerRank is a website that provides a very entertaining, gamified way to learn how to code. HackerRank offers daily, weekly and monthly coding challenges that reward you for solving a problem. The difficulty of the questions ranges from “Easy’ to “Hard”, and I plan to use this to test my new-and-improved Python skills. Every now and then I will use this form of learning Python as a “break” from the academic slog. [Total time estimate: n/a]

5.  Learn the basics of machine learning from both a practical and theoretical perspective

The resurgence of machine learning (the science of “teaching” computers to act without explicitly being programmed) is one of the key factors in the popularity of data science and drives many of the biggest companies today including the likes of Google, Facebook and Amazon. Machine learning is used in many recent innovations including self-driving cars, natural language processing, advances in medical diagnoses to name a few. It is a fascinating field, and as such, I want to gain a solid foundational understanding of this topic. It will also lay the foundation to understand the more advanced machine learning theory such as deep learning, reinforcement learning and probabilistic graphical models.

Machine Learning – Stanford University. Taught by Andrew Ng, this 10-week course is one of Coursera’s most popular courses and is rated 4.9 out of 5 from 39 267 student reviews. A commitment of 4-6 hours per week will be required.

Andrew Ng provides a comprehensive and beginner-friendly introduction to machine learning, data mining and pattern recognition and is based on several case studies and real-world applications. Supervised and unsupervised learning algorithms are explained and implemented from first principles, and machine learning best practices are discussed.

This course is a rite of passage for all aspirant data scientists and is a must-do. If you are on a more advanced level of machine learning understanding, look for the handouts of the CS229 Machine Learning course taught at Stanford (also by Andrew Ng) for further material. [Total time estimate: 80 hours]

Machine Learning A-Z Hands-On Python & R In Data ScienceUdemy. This course is a highly rated, practical machine learning course on Coursera. It is rated 4.5 stars out of 5 based on 11 798 student reviews. 86 456 students had signed up to this course at the time of writing. The videos total 41 hours and as before I will double this for my effort estimate.

The course is very hands on, and comprehensively covers topics such as data pre-processing, regression, classification, clustering, association rule learning, reinforcement learning, natural language processing, deep learning, dimensionality reduction and model selection. It can be completed in either R or Python. Again, I will look to pick this one up on a special for between $10 – $20. [Total time estimate: 80 hours]

6.  Our capstone project – let’s dive into the deep learning end

We have finally made it to what I regard as the curriculum’s capstone project – a practical course on deep learning:

  • Practical Deep Learning For Coders, Part 1 – fast.ai. Out of all the courses that I have looked at, I am probably the most excited about this one. Fast.ai’s deep learning course is a very different MOOC to the rest in that the content is taught top down rather than bottom up. What this means is that you are taught how to use deep learning to solve a problem in week 1 but only taught why it works in week 2.

The course is run by Jeremy Howard who has won many Kaggle challenges and is an expert in this field. The problems solved and datasets used in this course comes from previously run Kaggle challenges, which allows you to easily benchmark your solution to the best submitted entries.

A significant time commitment is required for this course – 10 hours a week for 7 weeks. The course teaches you some cool stuff such as such as how to set up a GPU server in the cloud using Amazon Web Services, and how to use the Python Keras library. As per the homepage for Keras, “Keras is a high-level neural networks AP developed with a focus on enabling fast experimentation. It is written in Python and is capable of running on top of either TensorFlow, CNTK or Theano.

As Jeremy Howard says, all you need to succeed in this course is pragmatic programming, tenacity, an open mind and high school math so good luck and well done on getting to this stage! [Total time estimate: 100 hours]

https://unsplash.com/collections/579786/knowledge-is-power?photo=esCc1qx6TVw

7.  Conclusion

So, we have finally made it to the end – well done! I have reviewed countless number courses in compiling this curriculum, and there were so many more that I wanted to add, including these more advanced topics:

I have also not touched on related topics such as big data, data engineering, data management, data modelling nor database theory of structured and unstructured sets of data. An understanding of these topics is nonetheless vital to understand the end-end spectrum that makes up the data analytics continuum. Nor have I chatted about the myriad data science tutorials and Kaggle-like data science challenges out there.

I intend to look at relevant tutorials and Kaggle problems where they relate to parts of this curriculum and where possible I will try implement some of these solutions on a big-data platform. While discussing this topic with one of my colleagues, he suggested also trying to build something big enough that encompasses all the above so that I can have an end-target in mind, don’t get bored and implement something that I am passionate about from the ground up. This is certainly something that I will also consider.

This challenge will start on 10 July 2017. According to my estimate, this curriculum will take 110 weeks or just over 2 years!! As daunting as this sounds, I take heart from Andrew Ng, the machine learning expert, when he said the following in an interview with Forbes magazine:

In addition to work ethic, learning continuously and working very hard to keep on learning is essential. One of the challenges of learning is that it has almost no short-term rewards. You can spend all weekend studying, and then on Monday your boss does not know you worked so hard. Also, you are not that much better at your job because you only studied hard for one or two days. The secret to learning is to not do it only for a weekend, but week after week for a year, or week after week for a decade. The time scale is measured in months or years, not in weeks. I believe in building organizations that invest in every employee. If someone joins us, I can look them in the eye and say, “If you come work with me, I promise that in six months you will know a lot more and you will be much better at doing this type of work than you are today

I hope that this quote resonates with you too and that the blog has helped or motivated you to improve your data science skills. Thank you for reading this and please keep me honest in terms of completing this challenge. Please post a comment if you think I should add to or change the curriculum in any way, and post your own course reviews — let me know if there are any other books and textbooks that I should consider. Expect updates soon!

by Nicholas Simigiannis

 

 

Three Spheres: Science, Design and Engineering

In the world of finance, the Foundery stands out as a pioneering challenger to the traditional financial institution – think suits, three-letter acronyms and legacy software housed in massive, skyline-dominating buildings. Although the Foundery isn’t alone in this endeavour, the digital financial organisation is still in its earliest days and there are many unanswered questions and unsolved challenges that lie ahead. This is the nature of the challenge that the Foundery has accepted: there will be no obvious answers or solutions.

http://www.symmetrymagazine.org/article/universe-steps-on-the-gas

In the world of finance, the Foundery stands out as a pioneering challenger to the traditional financial institution – think suits, three-letter acronyms and legacy software housed in massive, skyline-dominating buildings. Although the Foundery isn’t alone in this endeavour, the digital financial organisation is still in its earliest days and there are many unanswered questions and unsolved challenges that lie ahead. This is the nature of the challenge that the Foundery has accepted: there will be no obvious answers or solutions.

The key to success, however, is to recognise that with uncertainty comes opportunity – the opportunity to break new technological ground and seek new digital pathways that will one day reshape the world of finance.

This blogpost, however, isn’t about those challenges. Rather it is about the pioneering spirit, embodied by three overlapping spheres of innovation: science, design and engineering.

Science

We understand science as both the body of knowledge and the process by which we try to understand the world. Science is humanity’s attempt to organise the entire universe into testable theories from which we can make predictions about the world.

Here the universe is taken to include the natural world – such as physics

and biology – the social world – such as economics and linguistics – and

the abstract world, such as mathematics and computer science  [link].

If the goal of science is to formulate testable theories from which we can make predictions, how does it relate to the Foundery’s challenge of transforming the world of banking?

Science is the sphere that embodies the process of discovery. It is curiosity coupled with the discipline to establish truths and meaning in the world in which we live – including the world of digital disruption which the Foundery inhabits.

The pioneering spirit requires not only the curiosity to break new ground, but also a special kind of scientific curiosity to turn this new ground into groundbreaking discoveries.

Design

Design is the conceptual configuration of an idea, process or object. It is understood as the formulation of both the aesthetic and functional specifications of the object, idea or process.

To put it more simply in the words of the late Steve Jobs, arguably one of the most significant pioneers of the 21st century:

“Design is not just what it looks and feels like. Design is how it works.”

Whereas science is concerned with trying to understand the world that humanity occupies, design is concerned with the things – objects, ideas and processes – which humanity adds to the world, and how they look and how they work.

At the Foundery, the pioneering spirit is more than just breaking new ground: it is the creation of accessible pathways, including new solutions and disruptive technologies. Design is the process of creating new solutions – not just planning and configuring what these solutions are, but experimenting with how they look and work.

Thus design is the sphere which embodies experimentation. It is the courage to try something new, unencumbered by the fear of failure. It is the willpower to try over and over again until something great can be achieved.

Engineering

Engineering is the application of science to solve problems in the real world. At one level engineering is the intersection of science and design – combining scientific knowledge with principles from design – but taken on the whole engineering is more that: it encompasses the design, control and scaling of constructive and systematic solutions to real world problems.

In the past engineering was typically associated with physical systems such as chemical processes and mechanical engines. In today’s technological age, we also associate engineering with abstract information systems and computer programmes.

Now financial institutions can be viewed as massive, highly complex and highly specialised information systems. So from this perspective, one part of the Foundery’s task is to engineer the processes, interfaces and information networks of the bank of the future.

Engineering is the sphere which embodies problem solving. It is one thing to break new ground and make new discoveries and experiment with new solutions, but something else entirely to translate the pioneering spirit into technologies and systems with the potential to change the world.

Bringing the Spheres Together

On their own, science, design and engineering represent different aspects of the creation process: science is the process of discovery, design is the process of experimentation and refinement and engineering is the process of problem solving. But this view alone suggests that there is a linear order to the creation process: that each process must take place in phases.

This isn’t my view and certainly isn’t the aim of this blogpost. Rather, my interpretation of science, design and engineering is that they are abstract, multi-dimensional spheres which embody the creative process. They are self-contained concepts which exist in their own right, but with clear points of intersection which link science, design and engineering. Together they are a whole which is greater than the sum of its parts.

Whether it is the blockchain exchange, the novel application of machine learning to existing financial services or even our partnership-based organisational structure, science, design and engineering are very much at the Foundery’s core. These three spheres embody the pioneering spirit which drives our purpose: from the curiosity to explore more, to the courage to try more and the resolve to do more.

by Jonathan Sinai

 

 

 

 

 

The Dimensions Of An Effective Data Science Team

Organisations worldwide are increasingly looking to data science teams to provide business insight, understand customer behaviour and drive new product development. The broad field of Artificial Intelligence (AI) including Machine Learning (ML) and Deep Learning (DL) is exploding both in terms of academic research and business implementation. Some of the world’s biggest companies including Google, Facebook, Uber, Airbnb, and Goldman Sachs derive much of their value from data science effectiveness. These companies use data in very creative ways and are able to generate massive amounts of competitive advantage and business insight through the effective use of data.

https://static1.squarespace.com/static/5193ac7de4b0f3c8853ae813/5194e45be4b0dc6d4010952e/55ba8a68e4b0aac11e3339cd/1438288490143//img.jpg

The Need for Data Science

Organisations worldwide are increasingly looking to data science teams to provide business insight, understand customer behaviour and drive new product development. The broad field of Artificial Intelligence (AI) including Machine Learning (ML) and Deep Learning (DL) is exploding both in terms of academic research and business implementation. Some of the world’s biggest companies including Google, Facebook, Uber, Airbnb, and Goldman Sachs derive much of their value from data science effectiveness. These companies use data in very creative ways and are able to generate massive amounts of competitive advantage and business insight through the effective use of data.

Have you ever wondered how Google Maps predicts traffic? How does Facebook know your preferences so accurately? Why would Google give a platform as powerful as Gmail away for free? Having data and a great idea is a start – but the likes of Facebook’s and Google’s have figured out that a key step in the creation of amazing data products (and the resultant business value generation) is the formation of highly effective, aligned and organisationally-supported data science teams.

Effective Data Science Teams

How exactly have these leading data companies of the world established effective data science teams? What skills are required and what technologies have they employed? What processes do they have in place to enable effective data science? What cultures, behaviours and habits have been embraced by their staff and how have they set up their data science teams for success? The focus of this blog is to better understand at a high level what makes up an effective data science team and to discuss some practical steps to consider. This blog also poses several open-ended questions worth thinking about. Later blogs in this series will go into more detail in each of the dimensions discussed below.

Drew Harry, Director of Science at Twitch wrote an excellent article titled “Highly Effective Data Science teams”. He states that “Great data science work is built on a hierarchy of basic needs: powerful data infrastructure that is well maintained, protection from ad-hoc distractions, high-quality data, strong team research processes, and access to open-minded decision-makers with high leverage problems to solve” [1].

In my opinion, this definition accurately describes the various dimensions that are necessary for data science teams to be effective. As such, I would like to attempt to decompose this quote further and try to understand it in more detail.

Drew Harry’s Hierarchy of Basic Data Science Needs

Great data science requires powerful data infrastructure

A common pitfall of data science teams is that they are sometimes forced either through lack of resources or through lack of understanding of the role of data scientists, to do time-intensive data wrangling activities (sourcing, cleaning, preparing data). Additionally, data scientists are often asked to complete ad-hoc requests and build business intelligence reports. These tasks should ideally be removed from the responsibilities of a data science team to allow them to focus on their core capabilities: that is utilising their mathematical and statistical abilities to solve challenging business problems and find interesting patterns in data rather than expending their efforts on housekeeping work. To do this, ideally data scientists should be supported by a dedicated team of data engineers. Data engineers typically build robust data infrastructures and architectures, implement tools to assist with data acquisition, data modeling, ETL, data architecture etc.

https://sg-dae.kxcdn.com/blog/wp-content/uploads/2014/01/managerial-skills-hallmarks-great-leaders.jpg

An example of this is at Facebook, a world leader in data engineering. Just imagine for a second the technical challenges inherent in providing over one billion people a personalised homepage full of various posts, photos and videos on a near-real time basis. To do this, Facebook runs one of the world’s largest data warehouses storing over 300 petabytes of data [2] and employs a range of powerful and sophisticated data processing techniques and tools [3]. This data engineering capability enables thousands of Facebook employees to effectively use their data to focus on value enhancing activities for the company without worrying about the nuts and bolts of how the data got there.

I realise that we are not all blessed with the resources and data talent inherent in Silicon Valley firms such as Facebook. Our data landscapes are often siloed and our IT support teams where data engineers traditionally reside mainly focus on keeping the lights on and putting out fires. But this model has to change – set up your data science teams to have the best chance of success. Co-opt a data engineer onto the data science team. If this is not possible due to resource constraints then at least provide your data scientists with the tools to easily create ETL code and rapidly spin up bespoke data warehouses thus enabling them with rapid experimentation execution capability. Whatever you do, don’t let them be bogged down in operational data sludge.

Great data science requires easily accessible, high-quality data

https://gcn.com/~/media/GIG/GCN/Redesign/Articles/2015/May/datascience.png

Data should be trusted, and be of a high quality. Additionally, there should be enough data available to allow data scientists to be able to execute experiments. Data should be easily accessible, and the team should have processing power capable of running complex code in reasonable time frames. Data scientists should, within legal boundaries, have easy, autonomous, access to data. Data science teams should not be precluded from the use of data on production systems and mechanisms need to be put in place to allow for this rather than being banned from use just because “hey – this is production – don’t you dare touch!”

In order to support their army of business users and data scientists, eBay, one of the world’s largest auction and shopping sites, has successfully implemented a data analytics sandbox environment separate from the company’s production systems. eBay allows employees that want to analyse and explore data to create large virtual data marts inside their data warehouse. These sandboxes are walled off areas that offer a safe environment for data scientists to experiment with both internal data from the organisation as well as providing them with the ability to ingest other types of external data sources.

I would encourage you to explore the creation of such environments in your own organisations in order to provide your data science teams with easily accessible, high quality data that does not threaten production systems. It must be noted that to support this kind of environment, your data architecture must allow for the integration of all of the organisation’s (and other external) data – both structured and unstructured. As an example, eBay has an integrated data architecture that comprises of an enterprise data warehouse that stores transactional data, a separate Teradata deep storage data base which stores semi-structured data as well as a Hadoop implementation for unstructured data [4]. Other organisations are creating “data lakes” that allow raw, structured and unstructured data to be stored in a vast, low-cost data stores. The point is that the creation of such integrated data environments goes hand in hand with providing your data science team with analytics sandbox environments. As an aside, all the efforts going into your data management and data compliance projects will also greatly assist in this regard.

Great data science requires access to open-minded decision-makers with high leverage problems to solve

https://www.illoz.com/group_articles_images/3248184859.jpg

DJ Patel stated that “A data-driven organisation acquires, processes, and leverages data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape” [5]. This culture of being data-driven needs to be driven from the top down. As an example, Airbnb promotes a data-driven culture and uses data as a vital input in their decision-making process [6]. They use analytics in their everyday operations, conduct experiments to test various hypotheses, and build statistical models to generate business insights to great success.

Data science initiatives should always be supported by top-level organisational decision-makers. These leaders must be able to articulate the value that data science has brought to their business [1]. Wherever possible, co-create analytics solutions with your key business stakeholders.  Make them your product owners and provide feedback on insights to them on a regular basis. This will help keep the business context front of mind and allows them to experience the power and value of data science directly. Organisational decision-makers will also have the deepest understanding of company strategy and performance and can thus direct data science efforts to problems with the highest business impact.

Great data science requires strong team research processes

Data science teams should have strong operational research capabilities and robust internal processes. This will enable the team to be able to execute controlled experiments with high levels of confidence in their results. Effective internal processes can assist in promoting a culture of being able to fail fast, fail quickly and provide valuable feedback into the business experiment/data science loop. Google and Facebook have mastered this in their ability to amongst other things; aggregate vast quantities of anonymised data, conduct rapid experiments and share these insights internally with their partners thus generating substantial revenues in the process.

Think of this as employing robust software engineering principles to your data science practice. Ensure that your documentation is up to date and of a high standard. Ensure that there is a process for code review, and that you are able to correctly interpret the results that you are seeing in the data. Test the impact of this analysis with your key stakeholders. As Drew Harry states, “controlled experimentation is the most critical tool in data science’s arsenal and a team that doesn’t make regular use of it is doing something wrong” [1].

In Closing

This blog is based on a decomposition of Drew Harry’s definition of what enables great data science teams. It provides a few examples of companies doing this well and some practical steps and open-ended questions to consider.

To summarise: A well-balanced and effective data science team requires a data engineering team to support them from a data infrastructure and architecture perspective. They require large amounts of data that is accurate and trusted. They require data to be easily accessible and need some level of autonomy in accessing data. Top level decision makers need to buy into the value of data science and have an open mind when analysing the results of data science experiments. These leaders also need to be promoting a data-driven culture and provide the data science team with challenging and valuable business problems. Data science teams also need to keep their house clean and have adequate internal processes to execute accurate and effective experiments which will allow them to fail and learn quickly and ultimately become trusted business advisors.

Some Final Questions Worth Considering and Next Steps

In writing this, some intriguing questions come to mind: Surely there is an African context to consider here? What are we doing well on the African continent and how can we start becoming exporters of effective data science practices and talent. Other questions that come to mind include: To what end does all of the above need to be in place at once? What is the right mix of data scientists/engineers and analysts? What is the optimal mix of permanent, contractor and crowd-sourced resources (e.g. Kaggle-like initiatives [7])? Academia, consultancies and research houses are beating the drum of how important it is to be data-driven, but to what extent is this always necessary? Are there some problems that shouldn’t be using data as an input? Should we be purchasing external data to augment the internal data that we have, and if so, what data should we be purchasing? One of our competitors recently launched an advertising campaign explicitly stating that their customers are “more than just data” so does this imply that some sort of “data fatigue” is setting in for our clients?

My next blog will explore in more detail, the ideal skillsets required in a data engineering team and how data engineering can be practically implemented in an organisation’s data science strategy. I will also attempt to tackle some of the pertinent open-ended questions mentioned above.

The dimensions discussed in this blog are by no means exhaustive, and there are certainly more questions than answers at this stage. I would love to see your comments on how you may have seen data science being implemented effectively in your organisations or some vexing questions that you would like to discuss.

References

[1] https://medium.com/mit-media-lab/highly-effective-data-science-teams-e90bb13bb709

[2] https://blog.keen.io/architecture-of-giants-data-stacks-at-facebook-netflix-airbnb-and-pinterest-9b7cd881af54

[3] https://www.wired.com/2013/02/facebook-data-team/

[4] http://searchbusinessanalytics.techtarget.com/feature/Data-sandboxes-help-analysts-dig-deep-into-corporate-info

[5] https://books.google.co.za/books?id=wZHe0t4ZgWoC&printsec=frontcover#v=onepage&q&f=false

[6] https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c?s=keen-io

[7] https://www.kaggle.com/

by Nicholas Simigiannis