What is Supervised Learning?
Supervised learning is the most common type of machine learning today. Typically, most of the new machine learning practitioners begin their journey with supervised learning algorithms. Supervised learning is a type of machine learning algorithm that uses a labeled dataset to train the algorithm. Labeled dataset simply means data along with correct answers. If you are new to machine learning then I would suggest you go through the basic machine learning concepts first.
So as the name suggests, In supervised learning humans acting as a guide for teaching the algorithm what conclusions it should come up with. Consider like a teacher teaching to his students with the training dataset that contains both input data along with its correct answers. During the training phase, the algorithm will search for relations in the data that correlate with the desired outputs.
The aim of the supervised learning model is to predict the correct label for the newly presented input data.
Difference between supervised and unsupervised learning:
The basic difference between supervised learning and unsupervised learning is;
- Learning Process: In supervised learning, we are supervising the model while it is getting trained. On the other hand in Unsupervised learning, we do not need to supervise the model. Instead, We allow the model to work on its own to discover patterns, categories within the input data and label it.
- Training Data: Supervised learning uses labeled data that contains both input and output variables. Whereas for unsupervised learning unlabeled data is used. i.e only input data is given and the algorithm has to discover categories in data and label it.
How Supervised Machine Learning Works
To understand how it works, let’s consider an example of how a small baby learns to identify different fruits. While learning the baby captures all the information of fruit like shape, color, size, texture, and then stores all of this information under the name of the fruit. So in the future, whenever the baby is shown with the picture of fruit he immediately identifies it. Since he has all the information (Data) of fruit along with its name (Label).
Step by step process:
So we know how human being learns to identify the fruits. However for a machine to learn the same thing it needs data and statistics. So let’s try to develop a supervised learning model for the fruit identification problem. Below are the steps for it;
- Data Collection and labeling: Solving any machine learning problem starts with data collection. So in our case during the data collection stage, we are going to collect all the feature information of fruits like skin color, Shape, Size, patterns or textures and its corresponding name (Label).
- Splitting dataset: Now after the data collection, we are going to split the dataset into two parts (80 and 20%). The first 80% part of the dataset will be used for algorithm training purposes. Hence this portion will be called the training dataset. Remaining 20% dataset we will keep it unlabeled and use it for evaluation purposes. The test dataset will be used after the training of the model for evaluating the model’s performance and accuracy.
- Model selection and Training: Here based on the problem we’ll select a suitable algorithm. In our case problem is to identify the fruit so we’ll most likely select a classification algorithm. So during the training phase machine starts to find the relations in the training dataset. For example, like the color of an apple may vary from green to red, Different patterns present on watermelon, Grapes can have different sizes and colors depending on
its variety, etc.
- Evaluation: Now its time to check our model. After complete training, We get the output model which is ready to identify the name of the fruit from its picture. Now the model will take in the test dataset inputs and will try to predict the answers for these inputs. It will be purely based on prior training data. In our case, we will provide new photos of fruits and will see whether it is able to name it.
Now as we know how supervised learning works. The next thing we should know is the different types of supervised learning algorithms.
Types of supervised machine learning algorithms:
Supervised learning algorithm are broadly classified into two main categories:
- Classification (Group the items)
- Regression (Predict a value)
As the name suggests Classification algorithms classify/group the output items into specific classes or categories. For example, In color identification problem whether the shown color is “Red”, “Blue” or “Yellow”, etc. OR In the case of an online shopping portal from the browsing pattern and history of a customer determining whether he will purchase the product or not. In the case of weather forecasting, it would be like will it rain today or not. The answer will be in binary format like Yes or No.
The basic aim of the classification algorithm is to label the input into specific classes. So if the algorithm tries to label the input into two distinct classes, it is called binary classification. Otherwise, if the classification is trying to classify the items into more than two classes then, it is referred to as multiclass classification. Few examples of multiclass classification are; Gmail classifies the incoming emails into spam, promotions, updates, etc. Or color identification model.
The regression technique predicts the continuous-valued output. It is used to predict a number. We can use this technique for predicting the value of a house on inputs like locality and its size. Or we can predict the price of a stock. Or the time it will take me to reach home based on the traffic and weather conditions.
The basic aim here is to predict a value as much closer to actual output. The difference between predicted and the actual value should be very small that simply means our regression model is highly accurate.
Common supervised learning algorithms:
Linear Regression algorithm:
It is a mostly used supervised learning algorithm. As the name suggests it is a “Regression” algorithm. It is used for finding the linear relationship between the target and one or more dependent variables. For example, finding the relationship between height and weight of an individual.
Depending upon the number of variables, There are two basic types of linear regression algorithms. Simple and multiple.
In simple linear regression, It tries to establish the relationship between two variables using a straight line. It attempts to draw a line that comes closest to the data point by finding the slope and intercept that define the line and minimize regression errors. The distance between the regression line and the data point is the error value. So the regression line is said to be best when all the data points are closer to it.
Examples: Engine run time and fuel consumption by a vehicle, Hours spent studying and marks obtained. Or the price of a house based on its size.
It is the type of supervised learning algorithm used for both Classification and Regression problems. The approach here is to identify different conditions to split the dataset.
When it is used for classification problem i.e to split or classify the data, it is known as Classification Trees (Answers will be in Boolean format e.g Yes or No). When the answer is a continuous number then the tree is called as Regression Tree.
It uses the tree-like representation to solve the problem (E.g Does the customer going to buy an insurance policy?).
- Internal Node: Each internal node corresponds to a test (i.e Question is asked on each node e.g Age, Gender, Income of an individual).
- Branches: The outcome of the test will create further branches and finally will connect us to a leaf node.
- Leaf Node: Here each leaf node represents a class or a label (Buy a policy or not).
- The path from the root to a leaf node represents the classification rule.
Examples: Does the person likes sports channel based on the input information like Age, Gender, occupation, recent browsing history, etc. Or Check if a person is medically fit or not depending upon the data like his eating, sleeping, exercise habits, weight, height, age, etc.
Naive Bayes Classification:
The next supervised learning algorithm is Naive Bayes Classification. As the name suggests it is the “Classification” algorithm. It works on the Bayes Theorem for calculating probabilities and conditional probabilities.
It basically assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. We can find the probability of A happening, given that B has occurred. The assumption made here is that the predictors/features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive.
It is mostly used in sentiment analysis, email spam filtering, recommendation systems used in online shopping portals, etc. It is fast and easy to implement however its biggest disadvantage is that the requirement of predictors to be independent. In most of the real-life problems, the predictors are dependent, this limits the performance of the classifier.
Example: Classify whether a given person is a male or a female based on their measured features like height, weight, and foot size.
Support vector machines (SVM):
This supervised machine learning algorithm used for both “Regression and Classification” problems. While using the SVM algorithm, We plot each data item as a point in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well (look at the below snapshot).
Example: Check the applicant’s loan eligibility status on the basis of information provided like, Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History, etc.
As we already know the basics of the decision tree which is the main building block of random forest algorithm. It is a classification type of supervised learning algorithm that consists of many decision trees. The model is made up of many decision trees instead of single. The core idea here is to make the decision based on the majority.
Example: Consider a person who wants to buy a new car. So he reads different reviews online then visits different car dealers, Also discusses with his friends and family. Finally based on major positive reviews and comments he decides which car he should buy. Here each review or comment he is getting is a single decision tree and a complete set of information is random forest.
Advantages and Disadvantages:
Advantages of supervised learning:
- The number of classes: The labeled dataset gives us the exact idea about the number of classes present in training data.
- Easy to understand: The supervised learning process is easy to understand which is not the case with unsupervised learning. Since in the case of unsupervised learning we don’t easily understand, What is happening inside and how it is learning.
- Precision: We can be very specific about the definition of the classes, i.e We can train the classifier in a way that has a perfect decision boundary to distinguish between different classes accurately.
- After the entire training is completed no need to keep the training data in memory. Instead, We can keep the decision boundary as a mathematical formula.
- Supervised learning is very helpful in classification problems. Also, it is used to predict a target numerical value from some given data and labels.
Disadvantages of supervised learning:
- Data Labeling: For supervised learning, we need a labeled training dataset. So labeling big datasets is a real problem. Since labeling or classifying the big training data will be a time consuming and costly task.
- Data analysis: Supervised learning cannot give you unknown information from the training data like unsupervised learning do. It can’t discover the hidden patterns or categories out of data. It cannot cluster or classify data by discovering its features on
its own, unlike unsupervised learning.
- Variety in the dataset:
- If we do not anticipate and include variety in training data, then the model may give us the incorrect results. Like in the case of classification, it may fail in situations where we give an input that does not belong to any of the classes in the training data. In such situations, the output may be a wrong class label.
- For example, let’s say you trained an image classifier with cats and dogs data. Then if you give the image of a rat, the output may be either cat or dog, which is not correct.
- Accuracy of the model: The accuracy of the model depends upon the variety and quantity of data. While we are training the classifier, we need to select lots of good examples from each class. Otherwise, the accuracy of your model will be very less. This is difficult when you deal with a large amount of training data.
Usually, training needs a lot of computation time, so do the classification, especially if the data set is very large. This will test your machine’s efficiency and your patience as well.
- Supervised learning is limited to a variety of tasks. it can’t handle some of the complex tasks in machine learning.