Entropy and Information Gain
Entropy: Entropy is used in the field of machine learning to measure the amount of uncertainty or randomness in a set of data and its usefulness in the decision-making process. In other words, entropy measures the impurity in a dataset. The higher the entropy value, the more impure the data.
Mathematically, entropy can be calculated as,
\( Entropy, E = – \sum_{i = 1}^{C} p_i \log_2 p_i \)
Where C is the number of classes and pi is the probability of selecting ith class. The value of probability can range from 0 to 1. So, from the above equation, it is evident that entropy ranges from 0 to 1. For binary classification problems where we have two classes, the entropy equation can be written as
\( E = – p_1 \log_2 {p_1} – p_2 \log_2 {p_2} \)
Now, let’s calculate entropy for the following two boxes of examples.
Entropy (Box 1): Number of classes = 2 (red and green)
Red balls = 3
Green balls = 3
Probability of red ball, p1 = 3/6 = 0.5
Probability of green ball, p2 = 3/6 = 0.5
So, Entropy E = - 0.5 log2 0.5 – 0.5 log2 0.5 = 1
This is an example of maximum impurity with the highest entropy value.
Entropy (Box 2): Number of classes = 1 (green)
Green balls = 7
Probability of green ball, p1 = 7/7 = 1
So, Entropy E = - 1 log2 1 = 0
This is an example of minimum impurity with the lowest entropy value.
Information Gain: Information gain is a metric used to assess the effectiveness of a feature in splitting a dataset. It measures the reduction in entropy or the gain in information achieved by splitting the data based on a particular feature.
To calculate information gain, first, we need to calculate the entropy of the original dataset. Then we need to calculate the entropy of the dataset after the split based on a specific feature. Thus, Information gain is calculated as,
Information gain = Entropy (original dataset) – Weighted entropy (split dataset)
Example:
Total ball = 17
Red ball = 10
Green ball = 7
Entropy, E = \( – \frac{10}{17} \log_2 \frac{10}{17} – \frac{7}{17} \log_2 \frac{7}{17} \)
= 0.97
Now this dataset is split into two datasets as shown below:
For node 1 entropy = \( – \frac{8}{10} \log_2 \frac{8}{10} – \frac{2}{10} log_2 \frac{2}{10} \)
= 0.72
For node 2 entropy = \( -\frac{2}{7} \log_2 \frac{2}{7} – \frac{5}{7} \log_2 \frac{5}{7} \)
= 0.87
Weighted entropy of branch nodes = \( \frac{10}{17} \times 0.72 + \frac{7}{17} \times 0.87 \)
= 0.78
So, information gain for this split = 0.97 – 0.78
= 0.19