This week’s Ask-A-Data-Scientist column is from a medical doctor. Email your data science advice questions to mailto:[email protected]. Previous posts include:
- How to change careers and become a data scientist
- Advice to parents to encourage your child in STEM
- How to structure your data science and engineering teams
- Advice to a student interested in deep learning
- Alternatives to a degree to prove yourself
- Does Machine Learning as a Service (MLaaS) work? Do you need a PhD?
Q: I’m a medical doctor (MD-PhD). I do a mix of clinical work and basic science research. My research primarily involves small scale animal studies for hypothesis testing, although others in my lab do some statistical clinical studies, such as paired cohort analysis. I’m interested in AI and wondering if and how it can be applied to my field?
A: AI is being applied to several fields of medicine, including:
Diabetic retinopathy is the fastest growing cause of blindness. The first step in the screening process is for ophthalmologists to examine a picture of the back of the eye, yet in many parts of the world there are not enough specialists available to do so. Researchers at Google and Stanford have used deep learning to create computer models that are as accurate as human ophthalmologists. This technology could help doctors screen a larger number of patients faster, helping to alleviate the global shortage of doctors.
In 2012, Merck sponsored a drug discovery competition where participants were given a dataset describing the chemical structure of thousands of molecules and asked to predict which were most likely to make for effective drugs. Remarkably, the winning team had only decided to enter the competition at the last minute and had no specific knowledge of biochemistry. They used deep learning.
In a 2016 New York Times article, it was shared that medical start-up Enlitic, founded by fast.ai’s Jeremy Howard, was 50 percent more accurate than human radiologists in making lung cancer diagnoses.
Jeremy Howard, photographed by Jason Henry for the New York Times
Fast.ai remote fellow Xinxin Li is working with Ikaishe and Xeed to develop wearable devices for patients with Parkinson’s Disease. Traditionally, doctors observe the patient walking to assess disease progression, and wearable devices will allow for much more data and more precise data to be collected.
Deep learning can classify skin cancer with dermatologist-level accuracy, as published in Nature earlier this year.
Cardiogram is an app for Apple Watch that screens users’ cardio health and is able to detect atrial fibrillation, a common form of heart irregularity, with 97 percent accuracy.
Does this mean I need “big data”? No.
Currently, when news articles talk about “AI”, they are often referring to deep learning, one particular family of algorithms.
Although the above examples involve relatively large datasets, deep learning is being effectively applied to smaller and smaller datasets all the time. Here are a few examples that I listed in a previous blog post: Francois Chollet, creator of the popular deep learning library Keras and now at Google Brain, has an excellent tutorial entitled Building powerful image classification models using very little data in which he trains an image classifier on only 2,000 training examples. At Enlitic, Jeremy Howard led a team that used just 1,000 examples of lung CT scans with cancer to build an algorithm that was more accurate at diagnosing lung cancer than a panel of 4 expert radiologists. The C++ library Dlib has an example in which a face detector is accurately trained using only 4 images, containing just 18 faces!
Fast.ai student Ben Bowles wrote a post on how data platform Quid uses deep learning with small data to improve the quality of one of their data sets.
Transfer learning is a powerful technique in which models trained on larger data sets (by teams with more computational resources) can be fine-tuned to fit other problems, with less data and less computation. For instance, models originally trained on ImageNet (images in a set of 1,000 categories) are a good starting point for other computer vision problems (such as analyzing a CT scan for lung cancer). Transfer learning is a major focus of our Practical Deep Learning for Coders course.
How does machine learning compare to hypothesis testing or paired cohort analysis?
Machine learning shines at handling messy data. Techniques such as controlled studies or paired cohort analysis rely on carefully controlling for different variables in your experiment set-up (or in finding the pairings), whereas machine learning is an excellent choice when this isn’t possible.
Random Forests
Deep learning is just one kind of machine learning. Another machine learning algorithm is the random forest, which is great for observational studies.
The original random forest paper successfully tested the algorithm on a number of small data sets, including 569 images of a finite aspirate of a breast cancer mass, data from 345 men with liver disorders, and 336 ecoli samples.
Getting started
If you are at a university, seek out collaborators or interns. Note, if you have a student who knows how to code, they can learn deep learning. If you are looking for a collaborator, you do not need to find a deep learning expert. All you need is someone with a year or two of coding experience who is interested in your project and wants to learn deep learning. It’s even better if they are familiar with your research (for instance, perhaps a student who may already be working in your lab and knows how to code).
I recommend that you learn to code. Even if it’s not your area of focus and you will be collaborating with programmers, knowing some code will help you better understand what’s possible and have a better sense of what the programmers you’re collaborating with are doing.
For doctors, I think the best way to start coding is by learning R: R has the easiest to use implementation of random forests (which is a great general purpose machine learning algorithm) and R is commonly used by statisticians, so you will most likely meet biostatisticians who use it. Rstudio is a relatively user-friendly, free environment in which to use R (although it still requires writing code). This free coursera class, taught by biostatisticians from Johns Hopkins University, is one good way to get started. Folks that know me may be surprised by this recommendation: in general, I recommend that people interested in becoming data scientists learn Python; I recommend that teenagers or anyone who likes art or games learn JavaScript; and now I’m recommending that doctors learn R. You will need to learn Python if you start learning deep learning, but random forests in R are a great place to get started with machine learning (and random forests still produce top quality results in many areas– they are not just for beginners!)