This week’s Ask-A-Data-Scientist column answers two short questions from students. Please email your data science related quandaries to mailto:[email protected]. Note that questions are edited for clarity and brevity. Previous posts include:
- Does Machine Learning as a Service (MLaaS) work? Do you need a PhD?
- How to change careers and become a data scientist
- How to structure your data science and engineering teams
- Advice to a student interested in deep learning
Q1: I have a BS and MS in aerospace engineering and have been accepted to a data science bootcamp for this summer. I have been spending 15 hours/week on MIT’s 6.041 edx.org probability course, which is the hardest math course I’ve ever taken. I feel like my time could be better spent elsewhere. What about teaching myself the concepts as needed on the job? Or maybe you could recommend certain areas of probability to focus on? I’d like to tackle a personal project (either dealing with fitness tracker data or bitcoin) and maybe put probability on the backburner for a bit.
A: It sounds like you already know the answer to this one: yes! your time could be better spent elsewhere.
Let your coding projects motivate what you do, and learn math on an as needed basis. There are 3 reasons this is a good approach:
- For most people, the best motivation will be letting the problems you’re working on motivate your learning.
- The real test of whether you understand something is whether you can use it and build with it. So the projects you’re working on are needed to cement your understanding.
- By learning on an as-needed basis, you study what you actually need, and don’t waste time on topics that may end up being irrelevant.
The only exceptions: if you want to be a math professor or work at a think tank (for most of my math phd, my goal was to become a math professor, so I see the appeal, but I was also totally unaware at the time of the breadth of awesome and exciting jobs that use math). And sometimes you need to brush up on math for white-boarding interviews.
Q2: I am currently pursuing a Master’s degree in Data Science. I am not that advanced in programming and new to most of the concepts of machine learning & statistics. Data science is such a vast field so most of my friends advise me to concentrate on a specific branch. Right now I am trying everything and becoming a jack in all and ace at none. How can I approach this to find a specialty?
A: There is nothing wrong with being a jack of all trades in data science; in some ways, that is what it means to be a data scientist. As long as you are spending the vast majority of your time writing code for practical projects, you are on the right track.
My top priorities of things to focus on for aspiring data scientists:
- Focus on Python (including Numpy, Pandas, and Jupyter notebooks).
- Try to focus on 1 main project. Extend something that you did in class. It can be difficult if you are mostly doing scattered problem sets in a variety of areas. For self-learners, one of the risks is jumping around too much and starting scattered tutorials across a range of sites, but never going deep enough with any one thing. Pick 1 Kaggle competition, personal project, or extension of a school project and stick with it. I can think of a few times I continued extended a class project for months after the class ended, because I was so absorbed in it. This is a great way to learn.
- Start with decision tree ensembles (random forests and gradient boosting machines) on structured data sets. I have deeply conflicted feelings on this topic. While it’s possible to do these in Python using sklearn, I think R still handles structured datasets and categorical variables better. However, if you are only going to master one language, I think Python is the clear choice, and most people can’t focus on learning 2 new languages at the same time.
- Then move on to deep learning using the Python library Keras. To quote Andrew Ng, deep learning is “the new electricity” and a very exciting, high impact area to be working in.
In terms of tips, there are a few things you can skip since they aren’t widely used in practice, such as support vector machines/kernel methods, Bayesian methods, and theoretical math (unless it’s explicitly necessary for a practical project you are working on).
Note that this answer is geared towards data scientists and not data engineers. Data engineers put algorithms into production and have a different set of skills, such as Spark and HDFS.