7 Steps To Mastering Machine Learning With Python
7 Steps To Mastering Machine Learning With Python
Where to begin?
This post aims to take a newcomer from minimal knowledge of machine learning in
Python all the way to knowledgeable practitioner in 7 steps, all while using freely
available materials and resources along the way. The prime objective of this outline
is to help you wade through the numerous free options that are available; there are
many, to be sure, but which are the best? Which complement one another? What is
the best order in which to use selected resources?
Moving forward, I make the assumption that you are not an expert in:
Machine learning
Python
Any of Python's machine learning, scientific computing, or data analysis
libraries
It would probably be helpful to have some basic understanding of one or both of
the first 2 topics, but even that won't be necessary; some extra time spent on the
earlier steps should help compensate.
KDnuggets' own Zachary Lipton has pointed out that there is a lot of variation in
what people consider a "data scientist." This actually is a reflection of the field of
machine learning, since much of what data scientists do involves using machine
learning algorithms to varying degrees. Is it necessary to intimately
understand kernel methods in order to efficiently create and gain insight from a
support vector machine model? Of course not. Like almost anything in life, required
depth of theoretical understanding is relative to practical application. Gaining an
intimate understanding of machine learning algorithms is beyond the scope of this
article, and generally requires substantial amounts of time investment in a more
academic setting, or via intense self-study at the very least.
The good news is that you don't need to possess a PhD-level understanding of the
theoretical aspects of machine learning in order to practice, in the same manner
that not all programmers require a theoretical computer science education in order
to be effective coders.
Andrew Ng's Coursera course often gets rave reviews for its content; my
suggestion, however, is to browse the course notes compiled by a former student
of the online course's previous incarnation. Skip over the Octave-specific notes (a
Matlab-like language unrelated to our Python pursuits). Be warned that these are
not "official" notes, but do seem to capture the relevant content from Andrew's
course material. Of course, if you have the time and interest, now would be the
time to take Andrew Ng's Machine Learning course on Coursera.
Unofficial Andrew Ng course notes
There all sorts of video lectures out there if you prefer, alongside Ng's course
mentioned above. I'm a fan of Tom Mitchell, so here's a link to his recent lecture
videos (along with Maria-Florina Balcan), which I find particularly approachable:
10 Minutes to Pandas
You will see some other packages in the tutorials below, including, for example,
Seaborn, which is a data visualization library based on matplotlib. The
aforementioned packages are (again, subjectively) the core of a wide array of
machine learning tasks in Python; however, understanding them should let you
adapt to additional and related packages without confusion when they are
referenced in the following tutorials.
With a foundation having been laid in scikit-learn, we can move on to some more
in-depth explorations of the various common, and useful, algorithms. We start with
k-means clustering, one of the most well-known machine learning algorithms. It is a
simple and often effective method for solving unsupervised learning problems:
k-means Clustering, by Jake VanderPlas
Next, we move back toward classification, and take a look at one of the most
historically popular classification methods:
We've gotten our feet wet with scikit-learn, and now we turn our attention to some
more advanced topics. First up are support vector machines, a not-necessarily-
linear classifier relying on complex transformations of data into higher dimensional
space.
Support Vector Machines, by Jake VanderPlas
Next, random forests, an ensemble classifier, are examined via a Kaggle Titanic
Competition walk-through:
Kaggle Titanic Competition (with Random Forests), by Donne Martin
Dimensionality reduction is a method for reducing the number of variables being
considered in a problem. Principal Component Analysis is a particular form of
unsupervised dimensionality reduction:
Using Python and its machine learning libraries, we have covered some of the most
common and well-known machine learning algorithms (k-nearest neighbors, k-
means clustering, support vector machines), investigated a powerful ensemble
technique (random forests), and examined some additional machine learning
support tasks (dimensionality reduction, model validation techniques). Along with
some foundational machine learning skills, we have started filling a useful toolkit
for ourselves.
We will add one more in-demand tool to that kit before wrapping up.
Theano is a Python library that allows you to define, optimize, and evaluate
mathematical expressions involving multi-dimensional arrays efficiently.
The following introductory tutorial on deep learning in Theano is lengthy, but it is
quite good, very descriptive, and heavily-commented:
Caffe is a deep learning framework made with expression, speed, and modularity in
mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by
community contributors.
This tutorial is the cherry on the top of this article. While we have undertaken a few
interesting examples above, none likely compete with the following, which is
implementing Google's #DeepDreamusing Caffe. Enjoy this one! After
understanding the tutorial, play around with it to get your processors dreaming on
their own.
Dreaming Deep with Caffe via Google's GitHub
I didn't promise it would be quick or easy, but if you put the time in and follow the
above 7 steps, there is no reason that you won't be able to claim reasonable
proficiency and understanding in a number of machine learning algorithms and
their implementation in Python using its popular libraries, including some of those
on the cutting edge of current deep learning research.