Machine Learning 101

Computational methods in biology in general

Computational methods are essential whenever you have big and complex data sets. In some cases, we need computational methods to replicate what people do very well (like look at images or read text), but automated to be faster and scalable over more instances. In other cases, we need computational methods to find patterns or trends, using mathematical principals, in cases where people struggle to perform these operations (like finding recurring subsequences in what seems like random strings).

In many cases, we write algorithms to perform these operations. Algorithms, fundamentally, are a recipe, taking a structured input and producing a specific output, with a method in the middle that presents an efficient process for performing this computation.

In biology, algorithmic methods can compute measures of similarity between two nucleotide sequences. Algorithmic methods can construct phylogenetic tress from sequence data. Algorithmic methods can find common motifs in biological networks.

To build these approaches, we encapsulate our understanding into an approach or a set of rules or operators, explicitly coding the approach based on what we know and how we expect inputs to map into outputs.

But what if we don't know what to look for in mapping inputs to outputs for a specific problem? Or what if this mapping is so complex that we can't code for all possible cases?

Machine learning: learning from 'examples'

Machine learning is a subfield of AI and the basic goal is to train computational models that encapsulate information from a dataset, something that the model has automatically 'learned' from examples. Then we can use this model to make predictions from new data or even to generate new data instances that are similar in fundamental ways.

Learning from example is useful when

  • It is hard for people to explicitly write the 'rules' for making decisions (eg. not like calculating your taxes, where there is a clear calculation that can be performed, step by step, based on your income and geographic location and expenses)
  • The solution is dependent on lots of complex cases (eg. biometrics with lots of cases and features, where programming all of these cases would be far too complex for a programmer)
  • We don't have the expertise to fully write 'the rules' but we have lots of examples (eg. we may not know what people who read Harry Potter books will want to read next, because there is no 'rule' in our knowledge base to predict what Harry Potter fans will want to read next. But if we have consumer data, including data for Harry Potter readers, we can use this past customer purchasing data to make suggestions to customers with similar reading habits).

How do computers 'learn' from examples?

Lets assume you have a dataset that looks like this: two variables (x and y ) and a classification (circle vs X). To get concrete, perhaps x is number of missed classes and y is number of missed assignments for a college course, and students drawn with a blue circle 'passed' the course, while those drawn with a red X failed the course. In the picture below we can see that missing lots of courses and assignments is associated with not passing the course. But what is the line which divides passing from failing?

This is a supervised learning problem, where we have data instances (each student's number of missed classes and number of missed assignments) associated with a label (pass or fail the course). Our goal is to take these examples and use them to make predictions about whether a given student, perhaps who is in the middle of the semester, with a particular number of missed assignments and missed classes will pass of fail the course.

Learning through linear separation

  • In some cases, as above, a model may learn a line to separate students who pass from students who fail. This is seen in many methods, including support vector machines (SVM), where learning a Iinear boundary between two classes is the goal.

In other cases a model may learn a set of rules (if miss more than 10 classes and miss more then 3 assignments then: fail) to classify students. This is seen in decision-trees, where the goal is to produce rules the reduce entropy and increase purity with each rule.

In other cases, models may use some distance measure to learn which 'group' a student is closest to, and use the most common result (pass or fail) to predict what this student will do. These are seen in nearest neighbor classification methods.

All of these models are classifiers, though they use different measures (linear separators, rules, and distance) to make the decision.

In other cases, model may learn a mapping from data to a real-valued result, as shown here in this regression. Suppose instead of picking whether a student would 'pass' or 'fail' you wanted to predict their final grade on a scale of 0.0-100.0.

Since regression learns a function f to map a set of data values x to y, some learning processes are fundamentally a matter of optimization, by minimizing the 'loss'.

You can assess the quality of your model by collecting lots of student performance data, and splitting it into 2 groups, the training set and the testing set. The training set 'builds' the model, and the testing set is used to evaluate how well the model works.

I want to learn more, where can I go?

Here are some resources for machine learning in general:

Piyush Rai Lectures on Machine Learning