JSFeeds: objectpartners.com - Getting started with machine learning

Friday, 8 February, 2019 UTC

Getting started with machine learning

Summary

Machine and deep learning are very popular topics which have a very wide range of use cases. I’m not going to spend too much time explaining what machine learning is, instead I’m going to assume you already know that part. However, just in case, here is a good high level overview of what machine learning is and how it relates to deep learning.

My goal in this article is to get you up and running as quick as possible. So from here on out, we are assuming you know at least the definitions of data science and machine learning and at a high level what they are. The rest of this article deals with get how to start exploring the landscape of machine learning and how you might go about solving real problems.

Find your motivation

The field of machine learning is gigantic and ever changing. In order to not be overwhelmed, I recommend finding one topic or problem that you can focus your energy on. This will hopefully help you focus your energy so you can make steady progress, even if it might be slow going at first.

Here are some ideas for choosing an interesting topic:

Google for “common machine learning use cases” or “common deep learning use cases”.
Do you have a hard problem at work or as part of a hobby?
Do you have access to an interesting data set?
There are ML techniques for images, audio, video, text, tabular, time series and other types of data, can you create any of these?
Kaggle.com hosts ML competitions where some of the best in the industry compete for prize money and recognition. The list of current and past competitions is a great source of interesting problems.
A quick search for “public data sets” will return many curated lists of interesting data, maybe seeing these data sets will help you generate some interesting hypotheses.

There is a misconception that you need huge amounts of data to do machine learning, but this is not the case. There are techniques that use pre-trained models, where you might only need 20 or 30 images to get good results. So as you are brainstorming for problems you might solve ,or data sets you might use, don’t assume you need gigabytes of data to do something useful or interesting.

In recap, having a specific problem to solve will give you focus and help narrow your initial learning curve. Solving your first problem will force you to become familiar with many of the common tools, brush up on some of the math, and develop a workflow for iterating on a problem.

Sign up for a MOOC

Some time early in your new machine learning career, I would recommend taking one of the popular ML MOOCs. MOOC stands for “massive online open course”, they are generally free online courses with hundreds or thousands of students at any given time. They are usually self paced, meaning you can start at anytime. However, some of them do have recommended start times so that you can go through the course with other students at the same time. Here are some details on 2 very popular ML MOOCs:

Andrew Ng’s MOOC on Coursera:

https://www.coursera.org/learn/machine-learning/home/welcome

The course is free, but you can pay to receive a certificate with a passing grade.
This course follows a bottom up learning approach where you start with details and build up to applications.
This is how math is taught in school (arithmetic then algebra then calc…).
The course uses GNU Octave as its programming environment of choice.
Coursera has lessons, quizzes, and deadlines. it’s just like a college class.
No official pre-reqs, but familiarity with calculus, linear algebra, statistics, and programming would definitely be helpful.

Jeremy Howard’s Fast.ai

http://fast.ai

This course is free, and no certificate available upon completion.
This course follows a top down learning approach where you start with interesting high level applications and work toward theory.
This is how baseball is taught, for example on your first day, they put you at the plate and tell you to swing.
The course consists of 7 lessons, 2 hours each, and lots of sample code.
Jeremy recommends 10 hours of study and coding per lesson, however just skimming the videos does provide a lot of value.
Jeremy expects you to be self motivated, there are no tests or quizzes.
Jeremy and his team created the fastai python package as a companion to the course.
The fastai package is a wrapper around pytorch, which is very high level and very productive.

I personally found Jeremy’s top down approach more engaging. I liked that he had his students solving state of the art problems within a couple of hours. Developing practical skills is a great motivator to keep digging deeper. I think if you are very mathematically inclined, and enjoy learning the details before the applications, then Andrew’s course might be a better place to start.

No matter where you choose to start, I would definitely consider taking both courses. I found it useful to see the same material from 2 different points of view. Lastly, I think it’s worth pointing out that the more you put into each course the more you will get out of it. Try to avoid the temptation of just breezing through the videos and not taking notes, writing any code, or solve any problems. Pay attention, take notes, write code, and participate on the forums.

Choose a place to run your code

Many of the open source ML and DL software packages are optimized to use GPUs (video cards) and not CPUs. This means you will need access to a machine with a NVIDIA GPU, a lot of ram, and 100 GB or more of disk space. I specifically mentioned NVIDIA, because most ML code is written for NVIDIA and will not work with an AMD GPU.

Your laptop is not ideal place to train ML models, i would recommend not even trying. Even if it does have a NVIDIA GPU, it probably isn’t fast enough and will take forever to train models. It may even end up overheating. Instead, I would consider one of the many cloud providers that will sell you hourly access to a machine with a nice GPU.

Free Jupyter notebooks:

Google collab – https://colab.research.google.com
Kaggle notebooks – https://www.kaggle.com/kernels

Cloud Virtual Machine + GPU providers:

Paperspace (cheap and good) – https://www.paperspace.com/pricing
AWS – https://aws.amazon.com/ec2/instance-types/p3/
Google GCP – https://cloud.google.com/gpu/
Azure – https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/

Build your own workstation:

Building a PC for ML – http://timdettmers.com/2018/12/16/deep-learning-hardware-guide/
Choosing a GPU for ML – http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/

Get familiar with some of the popular tools

Again, the ML field is very large and very active, and for better or worse, this results in tons of code being written in many languages. The pace of it reminds me of the javascript world a few years back.

If you were going to choose one language to focus on, I would choose python. Many would argue it is the defacto language for data science and machine learning. It is also a great language for prototyping. Many tutorials and code samples are distributed as Jupyter notebooks which is written in Python. There are other languages that execute faster, but you really only need speed once you have a successful model and are ready for prod. When the time comes, you might re-write your algorithm in java, but for now, you will want the ability to quickly prototype and iterate.

popular languages FOR ML:

Python – The defacto language of data science and machine learning.
R – Very popular statistical computing environment, heavily used in statistics, data science and ML.
Julia – A new language, purpose built for scientific and numerical computing.
Java – Java is fast, robust, often used to run models in prod.
Scala – Scala is often used in production for the same reasons as java.
Octave – An open source programming environment, similar to R, it is mostly compatible with Matlab.

popular python packages:

Fastai – A productive, high level API on top of Pytorch.
Pytorch – Very popular machine learning library, open sourced by Facebook.
Jupyter – Interactive prototyping and publishing tool.
Numbpy – Low level, N-dimensional array object lib used by many packages.
Pandas – Higher level wrapper around numbpy.
Matplotlib – Popular charting and graphing package, often used in conjunction with Jupyter.
Tensorflow – An older but still popular machine learning package created by Google.
Keras – A higher level API with many available backends including Tensorflow.
Scikit-Learn – Popular statistical and machine learning package with excellent docs.

Hone your skills

After you have some base knowledge under your belt and have gone through one or both of the MOOCs, you are probably ready to begin honing your skills even further. A popular way of doing that is by competing in Kaggle competitions. Kaggle is a site where companies post prize money for the best models. The competition is lively as you are competing with some very smart people from around the world. For every competition only 3 to 5 people actually finish “in the money”, so I wouldn’t go in with the expectation of earning any money. Instead, go in with a goal of learning something new. Follow the forums, read other people’s kernels, write your own code, submit it for review, rinse and repeat.

Most people in the field seem to agree that ranking or winning a kaggle competition is not enough to get you a job in the ML field, but it shouldn’t be ignored either. At minimum, it gives you a set of interesting problems to talk about in an interview.

Further reading

There is no lack of great resources out there. If anything, there is too much great research, code, blog posts, tutorials, white papers, and everything else which is being created every day. It can be quite a challenge to keep up and try to absorb it all. Here are some of the sources I have personally found helpful or interesting so far.

Resources:

Machine learning wikipedia page.
Fast.ai MOOC, excellent intro to ML and DL
Companion forum for fast.ai MOOC
Very good list of additional courses, mostly on Youtube
Recently released “100 page book on ML”.
Many good articles on current ML and DL topics
Chris Albon’s data science notes in a nice format. Many code snippets and definitions
Hands-On Machine Learning with Scikit-Learn and TensorFlow
Deep Learning: A Practitioner’s Approach
Scikit-learn examples page shows many interesting use cases and code samples.
AWS re:Invent 2018: Deep Learning for Developers: An Introduction, Featuring Samsung SDS (AIM301-R)

Conclusion

My goal was to lay out one possible path to start you on your machine learning journey. I discussed limiting your initial scope by choosing one interesting problem to solve. I recommended some online courses to build up your base knowledge. Once a base has been established, I suggested kaggle competitions as a great place to continue to learn and grow your skills. I also recommended a short list of tools to look at, and gave some further reading.

It’s been said that “data is the new oil”, which I think is an interesting idea. I think being able to find meaning and make use of all of this digital oil is a very valuable skill to have. Any time spent learning these skills will not be wasted. If you enjoy learning, programming, and challenging problems, then let me welcome you to the world of machine learning.

About Me

My name is Curt Larson, I am the Director of Cloud Engineering at Object Partners Inc. in Omaha NE. I help partners on their journey into the cloud.

You can find me on LinkedIn https://www.linkedin.com/in/curt-larson-4176043/

... more @ objectpartners.com

objectpartners.com