Information gain is the difference in entropy before and after the split at a given node i.e.

\[\text{Information Gain} = \text{Entropy(before)} - \Sigma_{j=1}^K \text{Entropy(j,after)}\]

Where "before" is the dataset before the split, $K$ is the number of subsets generated by the split, and $\text{(j, after)}$ is subset $\text{ j }$ after the split.

So this methos focuses on how well a given attribute separates the training examples according to their target classification.

The compromise that this method brings is the fact that it is locally greedy. It looks to optimize at the local node level i.e. maximize information gain and minimze entropy.

The "Greedy Approach" is based on the concept of Heuristic Problem Solving by making an optimal local choice at each node. By making these local optimal choices, we estimate the approximate optimal global solution. This is not always the best global estimate.

information gain (IG) is also biased toward variables with large number of distinct values not variables that have observations with large values. A variable with the highest number of distinct values probability can divide data to smaller chunks. Also, we know that lower number of observations in each chunk reduces probability of variation occurrence.