# Ensemble Techniques

What is Ensemble Techniques?

● Ensemble techniques refer to combining different machine learning models to get better prediction.It simply means use of multiple learning algorithms for the same task.

● Ensemble learning works by integrating multiple models to boost the machine learning outcomes.This approach allows the production of better predictive performance compared to a single model.

● Ensemble approaches are used to win market-leading competitions like the Netflix Recommendation Competition, Kaggle Competitions etc.

● Consider one example:

○ Suppose we want to build a classifier which classifies food images into hot dog and not hot dog.

○ Suppose we have made a model using different machine learning algorithms and those give us different accuracy as follows.

■ Support Vector Machine: 75 %

■ Logistic Regression: 65 %

■ Decision Tree Algorithm: 70%

■ Now lets we combine the accuracy rate:

■ 1–25%35%30%=0.97375=97.3775%

○ So here we get more high accuracy than the individual model.

● The accuracy of the ensemble model is supposed to be better than the individual model.Otherwise there does not make any sense to train multiple models and spend extra computation power instead of using individual algorithms that give better accuracy.

Why use ensemble models?

○ It gives better accuracy(Low error)

○ Highest Consistency(Avoids Overfitting)

○ Reduces Bias and variance errors

When and where to use ensemble models?

○ When single model overfit

○ When result worth(give better accuracy) the extra training

○ When ensemble models can be used for classification as well as regression.

Popular ensemble techniques

○ Bootstrap Aggregating(Bagging)

○ Boosting

○ Stacking

Bootstrap Aggregating(Bagging):

● What is variance error and bias error?

Variance:

○ Variance error is basically variability of a target function’s form with respect to different training sets. Models with small variance error will not change much if you replace a couple of samples in the training set. Models with high variance might be affected even with slight changes in the training set.

○ Decision trees are examples of models with low bias and high variance. The tree makes almost no assumptions about target function but it is highly susceptible to variance in data.

● Bagging is used with the purpose to reduce the variance of a decision tree classifier.

● Bagging is used when the goal is to reduce the variance of a decision tree classifier. Here the aim is to build multiple subsets of data from the randomly selected training samples with replacement. Each subset data selection is used to train their decision trees. As a result, we obtain an ensemble of different models. The average of all predictions from different trees is used which is more reliable than a single decision tree classifier for decision.

Bagging Steps:

○ Suppose the training data set with N observations and M features. A sample from the training data set is taken randomly with replacement.

○ A subset of M features are selected randomly and whichever feature gives the best split is used to split the node iteratively.

○ The tree is grown to the largest.

○ Above steps are repeated n times and prediction is retrieve based on the aggregation of predictions from n number of trees.

■ Reduces overfitting of the model.

■ Handles higher dimensionality data very well.

■ Maintains accuracy for missing data.

■ Since final prediction is based on the mean predictions from subset trees, it won’t give specific values for the classification and regression model.

○ Implementation:

■ We can use sklearn library of python to implement bagging.

Boosting

● ‘Boosting’ refers to a family of algorithms that turns weak learners into powerful learners.

○ Weak learner is defined as a classifier which is linked slightly to true classification.

○ Strong learner is a classifier that is arbitrarily well -correlated with true classification.

○ Boosting is an ensemble method which improves the model predictions of any given learning algorithm.

● Types of boosting:

○ XGBoost

○ .Adaboost integrates multiple weak learners into one strong single learner. In AdaBoost, the weak learners are decision trees with one split, called decision stumps.

○ As AdaBoost makes its first decision stump, All observations are weighted equally.

○ Adaboost integrates several weak learners into one strong learner for the correction of the previous mistake.

○ The weak learners in AdaBoost are decision trees with a single split, called decision stumps. When AdaBoost creates its first decision stump, all observations are weighted equally.

○ The observations which were incorrectly classified now carry more weight than the observations which were properly classified to correct the previous error.

○ AdaBoost algorithms can be used for both classification and regression problem.r, the observations that were incorrectly classified now carry more weight than the observations that were correctly classified.

○ AdaBoost algorithms can be used for both classification and regression problems.

○ Gradient Boosting works by applying predictors sequentially to an ensemble, each one correcting its predecessor.

○ However, instead of adjusting the weights for any incorrect classified observation at any iteration like AdaBoost, the Gradient Boosting approach tries to fit the new predictor to the residual errors made by the previous predictor.

○ GBM uses Gradient Descent to search for the shortcomings in the predictions of the previous learner. GBM algorithm can be given by following steps.

○ Fit a model to the data, F1(x) = y

○ Fit a model to the residuals, h1(x) = y−F1(x)

○ Create a new model, F2(x) = F1(x) + h1(x)

○ Our final model is able to account for a lot of the error from the original model by combining weak learner after weak learner and decreases this error over time.

● XGBoost

○ XGBoost stands for eXtreme Gradient Boosting. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. Gradient boosting machines are generally very slow in implementation because of sequential model training. Hence, they are not very scalable. So,XGBoost is focused on computational speed and model performance. XGBoost provides:

■ Parallelization of tree construction during training, use all of our CPU core.

■ Distributed Computing for training very large models using a cluster of machines.

■ Out-of-Core Computing for very large datasets that do not fit into memory.

■ Cache Optimization of data structures and algorithm to make the best use of hardware.

○ Implementation:

○ We can implement boosting using sklearn library of python.

Decision Trees Algorithm

● A decision tree is one of the supervised machine learning algorithms,this algorithm can be used for regression and classification problems but mostly used for classification problems.

● “A decision tree is a graphical representation of all the possible solutions to a decision based on certain conditions”.

● It is called a decision tree because like a tree it starts from the root and then branches of a number of solutions.

● A decision tree follows a set of if-else conditions to visualize the data and classify it according to the conditions.

Important terminology of decision tree

Root Node:

■ It represents the entire population or sample and this further gets divided into two or more homogenous sets.

Branch/sub-tree

■ A subsection of entire tree is called Branch or sub-tree

Splitting

■ It is the process of dividing nodes into two or more sub nodes.

Decision Node

■ When a sub node splits into the further sub nodes, then it is called decision node.

Leaf/Terminal node

■ Nodes with no children(no further split) are called leaf/terminal nodes.

Pruning

■ When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.

Parent and Child Node

■ A node, which is divided into sub-nodes is called parent node of sub-nodes whereas sub-nodes are the child of parent nodes.

Here A is the parent of B and C.

● Steps to implement decision tree

○ Begin the tree with the root node.

○ Find the best attribute in the dataset using Attribute Selection Measure(ASM).

○ Divide the root node into subsets that contains the best attributes.

○ Generate the decision tree node,which contains the best attribute.

○ Recursively make new decision trees using the subsets of the dataset created in step-3.Continue this process until a stage is reached where you cannot further classify the nodes and call the final node as a leaf node.

● Attribute Selection Measure:

○ While implementing a Decision tree, the main issue arises how to select the best attribute for the root node and for sub-nodes. So, to solve such problems there is a technique which is called an Attribute selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:

■ Information Gain

■ Gini Index

○ Information Gain:

■ Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an attribute.

■ It calculates how much information a feature provides us about a class.

■ According to the value of information gain, we split the node and build the decision tree.

■ A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having the highest information gain is split first. It can be calculated using the below formula:

● Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

● Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data. Entropy can be calculated as:

● Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

● Where,

● S= Total number of samples

● P(yes)= probability of yes

● P(no)= probability of no

○ Gini Index:

■ Gini index is a measure of impurity or purity used while creating a decision tree in the CART(Classification and Regression Tree) algorithm.

■ An attribute with the low Gini index should be preferred as compared to the high Gini index.

■ It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.

■ It is calculated by subtracting the sum of squared probabilities of each class from one.

■ Gini index can be calculated using the below formula:

■ Gini Index= 1- ∑jPj2

● For example,

● According to this image our task should be whether a person accepts a new job offer or not.And for this a person has created a decision tree.

● Here we can see that the decision tree starts with the base condition or we can say that with the root node which is salary greater than \$50,000 or not.If no then decline offer.If yes then it is further checked whether commute more than 1 hour or not. If yes then decline the offer and if no then it is further checked whether the job offers free coffee or not. If yes then accept the offer and if no then decline the offer.

● To build a decision tree first we have to identify different sets of questions that we ask for a tree.

● Decision trees are examples of low bias and high variance.

● We can implement a decision tree using the sklearn library of python.

● In the above code, we have created a classifier object, in which we have passed two main parameters;

● “criterion=’entropy’: Criterion is used to measure the quality of split, which is calculated by information gain given by entropy.

● random_state=0": For generating the random states.

Random Forest Algorithm

● Random forest is a supervised learning algorithm which is used for both classification as well as regression. But it is mainly useful for classification problems.

● Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

● Steps of random forest algorithm:

○ It starts with the selection of random samples from a given dataset.

○ After this algorithm will construct a decision tree for every sample and it will get the prediction result from every decision tree.

○ Then voting will be performed for every predicted result.

○ At the final step it will select the most voted prediction result as the final prediction result.

● Implementation of Random Forest Algorithm:

○ We can implement random forest algorithm with the help of sklearn library.