Random Forest : How random forest works for classification and regression

Preface:

Random forest(RF) is a supervised machine learning algorithm. It is flexible and easy to use which returns good accuracy even without proper hyper-parameter tuning. The term “random-forest” is a compilation of two different terms that is “random”, which comes from Bagging. Bagging is a random selection of features from a dataset. The term “forest” says that a group of decision trees working together to predict a target variable. The random forest fall in the category of ensemble learning.

How random forest works:

Random forest trains several decision trees in parallel. Let say we have a dataset with 10 features and 100 samples.

Step1:

Creating a bootstrapped dataset: Select random samples from the original dataset with repetition. Now, at this stage, the derived dataset( also known as a bootstrapped dataset) will have less than 100 samples but 10 features that are unchanged. There will be some samples(rows), that they will never be in any bootstrapped dataset. These samples are called out-of-bag dataset.

Step2:

Creating tree: randomly select 3(a,b,c) features out of 10 from the bootstrapped dataset. Let’s say, feature “a” did the best split for the root node. In the next split, we shall drop feature “a” and again randomly 3 features are considered. Executing these steps will generate a tree.

Step 3:

Repeat step 1 and step 2: create a new bootstrap dataset and build a tree. After repeating these steps a couple of times, we shall get many trees and that is called a forest.

Note: each tree is trained on a specific dataset. Therefore, they will have low bias high variance.

Step 4:

Getting prediction: take a sample(row) without the target variable and feed that into all the trees. Each tree will return its opinion(class). The opinion with the highest count will be considered as the prediction of the random forest. In the case of regression, each tree will return a continuous value. The final result is the mean/median of all the outcomes.

Step 5:

Measuring the accuracy of the random forest:
While creating bootstrap, assume, n number of samples wasn’t part of any bootstrap. Therefore, these samples are completely unknown to any tree. Each of these samples is feed into the model. Hence, the accuracy of the random forest is measured by the proportion of “out-of-bag” samples that are correctly classified. Inversely, the proportion of “out-of-bag” samples that are incorrectly classified is called out-of-bag error.

Step 6:

To adjust the result of the model, we can take a different number of features at “step 2”.

Random Forest and Overfitting:

The random forest consists of many decision trees. A decision tree is trained on a bootstrapped dataset. Therefore, trees are very sensitive to the data. so whenever a tree predicts new data, the results are quite different every time. In other words, we can say that a decision tree has low bias and high variance. Aggregation of results in the random forest reduces high variance.
Decision trees are very prone to overfitting as it tries to fit every single feature in every split. With the overfitting problem, the RF can be resisted by the depth of the tree.

Important Hyperparameters:

Some of the important hyperparameters of RandomForestClassifier are as follows:

N_estimator: Number of trees in the forest.

Criterion: The measuring method for splitting a node. For classifier measuring, methods are gini , chi-square, entropy. For Regressor measuring methods are m.s.e(variance reduction) and m.a.e.

Max_depth: It says about the maximum depth of the tree. If the value is set, the tree will expand till leaves are pure.

Bootstrap: If False, then the whole dataset is used, if True then bootstrapping will take place.

Advantages and Disadvantages of the Random Forest Algorithm:

1. One of the biggest advantages of a random forest is that it can be used for both classification and regression.
2. It returns high accuracy in most of the unknown datasets.
3. It can handle high dimensional data.
4. It also predicts well even there are a lot of missing data.
5. Random forest even if in its default hyperparameter setting predicts better than other machine learning algorithms.
6. One of the disadvantages of random forest is when a number of trees grow, it becomes computationally expensive.
7. Random forest tries to fit every feature which leads to overfitting. It uses max_depth hyperparameter to counter overfitting.

Feature Importance/selection by Random Forest:

One another important aspect of the Random forest is feature selection. The feature selection property of the RF comes under Embedded methods. These algorithms have their own in-build feature selection mechanism.

Random forest is nothing but hundreds of decision trees. Here each tree is trained on a random set of features and samples. So, every tree is trained on different mini-datasets(bootstrapped dataset), which makes every tree unique. Splitting of a node based on some measure is the core logic of a decision tree. The measure says “how much a node is impure!”. If a split at feature “x” reduces the impurity, then that feature is important. Inversely, if a split at feature “x” increases the impurity then that feature is weak.

Measures are different for RF classification and regression. For classification, the measures are as follows, Gini index, chi-square, and entropy. Random forest regression uses MSE(of reduction invariance) for splitting a node.

Extremely randomized tree(ERT):

Extremely randomized trees are very similar to normal random forests. However, there are two differences. Firstly, ERT does not use bagging to build trees. ERT algorithm uses the whole dataset to train all the trees. Secondly, there is randomness in every split of a node. The decision trees become less correlated as split points are random. Therefore, there is an increase in variance. However, a large number of trees in the random forest helps to reduce variance.

Due to the above two differences, there are three main hyperparameters to tune ERT. These are, number of trees in the forest, number of input features and minimum number of samples(rows) require to split.

RandomForest classification Example:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn import preprocessing
from sklearn.utils import shuffle

# reading iris dataset and storing it to pandas
df=pd.read_csv("IRIS.CSV",header=None)

# shuffling rows of dataframe
df_shuffle=shuffle(df)

#converting the datadfame to numpy array
df_numpy=np.array(df_shuffle)

# y hold only the targest variables
y=df_numpy[:,4]

# X holds all the independent variable
X=df_numpy[:,0:4]

# spliting X and y into train and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

#building randomforest classifier model
clf = RandomForestClassifier(n_estimators=100,criterion='gini',max_depth=5,bootstrap=True)

# training the model
clf.fit(X_train,y_train)

# y_pres hold the predicted results from X_test
y_pred=clf.predict(X_test)

# returns the scores of the model
clf.score(X_test,y_test)

RandomForestRegressor Example:

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import numpy as np

from sklearn import preprocessing
from sklearn.utils import shuffl

df=pd.read_csv("google.csv",header=0)
df=df.drop(axis=1,columns='Date')

df_shuffle=shuffle(df)
df_shuffle.head()

df_numpy=np.array(df_shuffle)

y=df_numpy[:,4]
X=df_numpy[:,0:4]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

RFR=RandomForestRegressor(n_estimators=10,criterion='mse',max_depth=40)

RFR.fit(X_train,y_train)

y_pred=RFR.predict(X_test)

RFR.score(X_test,y_test)

What is Random Forest?