Machine learning is one of the most influential and rapidly growing fields in all of computer technology. If you're interested in getting into machine learning, you'll need to familiarize yourself with the basic concepts relevant to the field, like algorithms, variables, and different types of model analysis. While there is much more to learn than what is covered in this article, this brief crash course in machine learning will provide you with some definitions/intuitions regarding machine learning's core concepts.
What Is Machine Learning?
Let's start off by defining machine learning. After all, it is difficult to learn more about the field if you don't even understand what the term "machine learning" means. Put simply, machine learning is the process of enabling a computer to carry out a task without being explicitly programmed to do so. Instead of writing out each individual line of code necessary to carry out an algorithm, a machine learning specialist will create a system that takes some data as an input, learns the patterns of that data, and then outputs a decision about what to do with that data. The computer analyzes patterns in the data and "learns" how the input data is related to the output data, so it can generalize this pattern to new data in the future.
Defining Features And Labels
A machine learning system has three components:
- Inputs
- Algorithms
- Outputs
Machine learning literature frequently talks about features and labels, but what are these concepts? Features are parts of the input data, and they can be considered the variables of interest for the machine learning task, the variables the system will analyze to learn patterns. The outputs of the system are the labels. They're the prediction that the model has made about the class to which the input data belongs.
Two Types Of Machine Learning: Supervised And Unsupervised
There are two different types of machine tasks, unsupervised learning and supervised learning. What is the difference? The primary difference between supervised and unsupervised learning is that the former has the data the network is learning about labeled with a ground truth, while the latter does not. To put this another way, in supervised learning the correct output values for a set of inputs is known, but in unsupervised learning the output values are not known, so the goal of the two types of learning differs.
In supervised learning, the goal of training a model is to determine some function that best represents the relationship between the input data and observable output data, to find a model that minimizes the error between the ground truth and the model's prediction. In contrast, the point of an unsupervised learning algorithm is that it must infer the relationship between many different data points, as it doesn't have access to labeled outputs.
The primary use for supervised learning algorithms is classification, which is done when input labels need to be linked to output labels, or for regression when inputs need to be mapped to a series of continuous outputs. By contrast, unsupervised learning is usually applied to tasks such as representation learning, clustering, and density estimation. These tasks require the model to determine the structure of the data even though it isn't explicitly labeled.
Understanding Common Supervised Learning Algorithms
Some of the most common algorithms for supervised learning include:
- Naïve Bayes
- Support Vector Machines
- Logistic Regression
- Random Forests
- Artificial Neural Networks.
Let's take a closer look at each of these algorithms.
Support Vector Machines
A support vector machine groups data points into different classes by drawing a line between different data point clusters. The data points found on one side of the line belong to a specific class, while those found on the other side of the line belong to another. The distance from the line of separation to the points on either side of it represents the classifier's confidence about which point belongs to which class, and the classifier tries to maximize the distance between the line and the points found on both sides of the line.
Logistic Regression
Logistic regression is an algorithm that makes predictions about points in the testing data, giving them a binary label, labeling them either one or zero. If the value of a data point is 0.5 or above it is classified as belonging to Class 1, while 0.49 or below the classifier labels it 0. Logistic regression is appropriate for instances when there is a linear relationship between the data points.
Decision Trees/Random Forests
Decision trees work by dividing a data set into increasingly smaller subsets based on different sorting criteria. With every division of the data set, a new subset is created and the number of examples in any given subset gets smaller. After the network divides the data up into categories containing single data points, these examples are classified according to an assigned key. This is the way a single Decision Tree classifier works, and the Random Forests classifier is made by linking many Decision Tree classifiers together.
Naive Bayes
The Naïve Bayes classifier calculates the probability that some event, or data point, will occur based on some prior event occurring. The Naïve Bayes classifier places data points in classes according to the probability it assigns them, once more based upon some input event. The assumption of a Naïve Bayes classifier is that all the predictors of the class have the same weight/influence on the outcome of that class, that the predictors are not dependent on one another.
Artificial Neural Network
An artificial neural network (ANN) is a system inspired by the human brain. ANNs are made up of different layers of “neurons” connected together, each neuron representing a mathematical function. There is an input layer, an output layer, and a “hidden” layer in the middle. This hidden layer is where the learning happens, with the data being manipulated by many mathematical functions joined together, capable of learning more complex patterns than other algorithms.
Learning About Unsupervised Algorithms
The most frequently used algorithms in unsupervised learning tasks include:
- K-means clustering
- Autoencoders
- Principal Component Analysis
K-Means Clustering
The goal of K-means Clustering is to separate data points into groups of distinct clusters. This uncovers underlying patterns that make data points in one cluster more similar to each other than they are to points in another cluster.
The number of desired clusters/classes is chosen by the user and the algorithm starts off by placing the corresponding number of “centroids” (or hypothetical centers) for the cluster. After the centroids are placed, different assignments of the points in the dataset are tried. The algorithm looks to minimize the distance from assigned points to the centroids, finishing when it has found the centroid placement with the shortest distance to the points surrounding them.
Principal Component Analysis
Principal Component Analysis is a technique for unsupervised learning that reduces the dimensionality of the data, “squeezing” the data down into a smaller feature space. This is helpful because it retains the relationships between the original data points, like distance from one another, but the points are now in a much smaller/easier to analyze space.
The act of deriving the “principal components” means that new features for the dataset can be generated, features which make unsupervised learning easier.
Autoencoders
Autoencoders are special applications of neural networks useful when conducting unsupervised learning tasks. Autoencoders take unlabeled inputs and encode them into a form the network can use.
The autoencoder tries to reconstruct the inputs as accurately as possible, so it must try to determine which features are the most important. This means that the autoencoder extracts the most relevant features from unlabeled data, or to put that another way, they label their own training data.
Types Of Analysis
After you have decided on an algorithm to use and applied it to your training and testing data, how can you determine how well your model is performing? Analysis is where metrics come in.
The metric you should use depends on the type of learning problem you're attempting. Common analysis types for classification tasks:
Classification accuracy gives you the number of correct predictions divided by the total number of predictions. Classification accuracy is commonly used since it is so simple, but it works best when the number of examples per class is roughly equivalent.
The confusion matrix is a chart that displays the accuracy of your model over your total number of classes. Confusion matrices can be a little hard to interpret, but the predictions are on the X-axis and the actual accuracy on the y-axis, with the correct predictions running a diagonal from top left to bottom right.
A classification report delivers the Recall, Precision and f1-Score metrics, and it's one of the most useful evaluation methods as it returns multiple types of valuable information.
Common analysis types for regression tasks:
MAE is the average of the total difference between actual values and predictions, while MSE is very similar, the difference being that you can take the square root of MSE and convert them back to the original output. The r-squared metric gives a “goodness of fit”, where 0 is complete error and 1 is a perfect fit.
In terms of unsupervised learning, there are no standardized metrics because the success of an unsupervised learning algorithm is subjective. That said, some common practices include visualizing the output data to look for trends or using generated features in a clustering task to see how they represent data relationships.
Conclusion
To summarize, machine learning has two types of tasks: unsupervised learning and supervised learning. Each learning type has its own algorithms and metrics of analysis.
Machine learning is a powerful technology and a rapidly growing field. If you are interested in taking your study of machine learning further, Andrew Ng’s machine learning course on Coursera or the Udacity Intro to Machine Learning course both provide excellent coverage of machine learning algorithms, variables, and forms of analysis.
Join the Kambria Content Challenge
Want more tech content? Visit blog.kambria.io. Or better yet, participate in the Kambria Content Challenge and share your insight and expertise with our growing developer community. You could receive over $200 for the best submission. For complete details about our Content Challenge, click here.