As we all know that we often run into imbalanced data problems with most of classification tasks. The hitch with imbalanced datasets is that in standard classification algorithms are often biased/skewed towards the majority class (known as the “negative” class) and the higher misclassification rate for minority class (“positive” class). Nevertheless, many researchers proposed some approaches to deal with this issue both for traditional machine learning and ensemble techniques. They basically are divided into 3 major groups:
1. Data Sampling: in which the dateset is modified to produce more or less balanced class distribution, so this condition the classification algorithms can perform the sample without being skewed towards the majority class.
2. Ensemble Methods: this procedure is to adapt the base learning methods to be more accustomed to the imbalanced class issues.
3. Cost-sensitive learning: this approach includes data level manipulation such as resampling, algorithmic modification or even both combined.
DATA SAMPLING
Oversampling and Under sampling are techniques used in data analytics to modify the unequal data classes to create balanced data sets as widely known as re-sample as well. The question most frequently thrown is, when do we need to use over-sampling over under -sampling or vice versa? We implement the under sampling when the data set is sufficient, However, in the most cases over-sampling is generally preferred as the under-sampling can undergo loss of important data. We can easily benefit imblearn package in python to resample. Both type of resampling can be effective when being used together.
Re-sampling techniques are divided into 3 categories:
1. Under-sampling the majority class(es)
2. Over-sampling the minority class
3. Combining over- and under-sampling
Here below is a list of some methods available on imblearn module. We need to pay attention for the re-sampling whether they can treat the numerical, categorical or even both variables.
- Under-sampling
This method removes instances from the dateset in order to reduce the imbalance ratio. The simplest way to do so, is by randomly discarding some elements from the majority class. Here below are some under-sampling methods available in imblearn.under-sampling:
— ClusterCentroids: makes use of K-means to reduce the number of samples. Therefore, each class will be synthesized with the centroids of the K-means method instead of the original samples.
— RandomUnderSampler: is a fast and easy way to balance the data by randomly selecting a subset of data for the targeted classes.
— EditedNearestNeighbours: applies a nearest-neighbors algorithm and “edit” the dateset by removing samples which do not agree “enough” with their neighborhood.
-RepeatedEditedNearestNeighbours: extends from EditedNearestNeighbours by repeating algorithms multiple times.
— AllKNN: differs from the RepeatedEditedNearestNeighbours since the number of neighbors of the internal nearest neighbors algorithm is increased at each iteration. - Over-Sampling
These techniques undergo the opposite approach. Instead of reducing the majority class, the minority class is increased. New minority elements are added to the training set as duplicates or perturbed variants of original instances.
— RandomOverSampler: similar to random under-sampling, this approach is to generate new samples by randomly sampling with replacement the current available samples (allows to sample heterogeneous data e.g. containing some strings).
— SMOTE and ADASYN: Synthetic Minority Oversampling Technique (SMOTE) and The Adaptive Synthetic (ADASYN) are two popular methods to over-sample minority classes. SMOTE might connect inliers and outliers while ADASYN might focus solely on outliers which, in both cases, might lead to a sub-optimal decision function. When dealing with mixed data such as continuous and categorical features, none of these methods can deal with the categorical features.
— SMOTENC: is the extension of the SMOTE algorithm in which categorical data is treated differently. The SMOTENC is solely working while the data is a mixed of categorical and numerical features.
— SMOTEN: if the dateset only consists of categorical features, so we are able to use SMOTEN variant.
While X is input variables (It expects that the data to resample are only made of categorical features), and Y is output variable.
- Combination of Over- and Under-sampling/ Hybrid Methods
There are two ready-to-use classes imbalanced-learn implements for combining over- and under-sampling methods are SMOTETomek and SMOTEENN.
ENSEMBLE METHODS
Ensemble-based classifiers, are widely known as multiple classifier-systems, try to improve the performance and combining them to obtain a new classifiers to outperforms every one of them. The fundamental in ensemble methods that to construct several classifiers from original data then aggregate their predictions when the unknown instances occurred. A taxonomy for ensemble methods with imbalanced classes can be completely found on the review.
— BaggingClassifier : in ensemble classifiers, bagging methods construct several estimators on different randomly selected subset of data. This method is available in scikitlearn, but this classifier does not allow to balance each subset of data. Therefore, when training on imbalanced data this classifier will tend to classify the majority classes.
— BalancedBaggingClassifier: in this approach, each bootstrap sample will be further resampled to achieve the sampling_strategy desired. Therefore, BalancedBaggingClassifier takes the same parameters than the scikit-learn BaggingClassifier. The sampling is controlled by the parameter sampler or the two parameters sampling_strategy and replacement.
— BalancedRandomForestClassifier: is another ensemble method in which each tree of the forest will be provided a balanced bootstrap sample.
— Boosting: RUSBoostClasssifier randomly under-sample the data set before a boosting iteration. Moreover, AdaBoostClassifier as learners in the bagging classifier is called “EasyEnsemble”. The EasyEnsembleClassifier allows to bag AdaBoost learners which are trained on balanced bootstrap samples.
Cost-sensitive learning
Cost-sensitive learning takes into account the variable cost of a misclassification with respect to the different classes. In general, 2 approaches have been proposed to deal with cost-sensitive issues:
1. Direct Methods: to directly introduce and utilize misclassification costs into the learning algorithms. The cost information is used to choose the best attribute to split the data and determine whether a sub-tree should be pruned.
2. Meta-learning: involved the integration of “preprocessing” process for the training data or a “postprocessing” of the output, in a way that original learning is not modified. Cost-sensitive meta learning can be further divided into 2 big groups:
— Thresholding: a typical decision tree for a binary classification problem assigns the class label of a leaf node depending on the majority class of the training samples that reach the node. A cost-sensitive algorithm assigns the class label to the node that minimizes the classification cost.
— Sampling: is based on modifying the training dateset (e.g. over-sampling or under-sampling).
REFERENCE
- Galar et al.. (2011). A Review on Ensembles for the Class Imbalance
Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS — PART C: APPLICATIONS AND REVIEWS.
- Lopez et al.. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Granada, Spain:Elsevier Inc.
- Sarah Vluymans. (2019). Dealing with Imbalanced and Weakly Labelled Data in Machine Learning using Fuzzy and Rough Set Methods. Warsaw, Poland:Springer International Publishing.
- Zahra Putri Agusta, and Adiwijaya. (2019). Modified balanced random forest for improving imbalanced data prediction. Bandung, Indonesia. International Journal of Advances in Intelligent Informatics.
Eqi Buana is on LinkedIn.