Data Preparation
Data Balancing
This endpoint allows users to upload a file and balance the dataset based on a specified target column. Users can choose from various balancing methods, including oversampling, undersampling, hybrid methods, and class weighting.
Endpoint: POST /data-balancing
Request Parameters
File Upload
file
(required): The file to be processed. Supported formats include CSV, Excel, etc.
Form Parameters
target_column
(required): The target column to balance the dataset on.method
(optional, default:oversampling
): The balancing method to apply. Supported values:oversampling
: Increases the number of samples in the minority class.undersampling
: Decreases the number of samples in the majority class.hybrid
: Combines oversampling and undersampling techniques.class_weighting
: Adjusts class weights to handle imbalance.
oversampling_method
(optional, default:SMOTE
): The oversampling method to use ifmethod
isoversampling
. Supported values:SMOTE
: Synthetic Minority Over-sampling Technique.ADASYN
: Adaptive Synthetic Sampling.RandomOverSampler
: Random oversampling.
k_neighbors
(optional, default:5
): The number of neighbors to use for oversampling methods like SMOTE and ADASYN.random_state
(optional, default:42
): Random seed for reproducibility.undersampling_method
(optional, default:RandomUnderSampler
): The undersampling method to use ifmethod
isundersampling
. Supported values:RandomUnderSampler
: Random undersampling.ClusterCentroids
: Undersampling using cluster centroids.TomekLinks
: Removes Tomek links to clean the dataset.
sampling_strategy
(optional, default:0.8
): The sampling strategy for balancing. Can be a float or a dictionary specifying the desired ratio of classes.class_weights
(optional, default:balanced
): The class weighting strategy ifmethod
isclass_weighting
. Supported values:balanced
: Automatically adjusts weights inversely proportional to class frequencies.custom_weights
: Allows custom weights (not implemented in this version).
Processing Logic
- Read the uploaded file: The file is converted into a Pandas DataFrame.
- Validate the target column: Checks if the specified target column exists in the DataFrame.
- Encode categorical columns: Converts categorical columns into numerical values using label encoding.
- Apply the selected balancing method:
- Oversampling:
SMOTE
: Generates synthetic samples for the minority class.ADASYN
: Generates synthetic samples adaptively.RandomOverSampler
: Randomly duplicates samples from the minority class.
- Undersampling:
RandomUnderSampler
: Randomly removes samples from the majority class.ClusterCentroids
: Reduces the majority class by generating centroids.TomekLinks
: Removes Tomek links to clean the dataset.
- Hybrid: Combines oversampling and undersampling using
SMOTEENN
. - Class Weighting: Adjusts class weights to handle imbalance without modifying the dataset.
- Oversampling:
- Decode categorical columns: Converts numerical values back to their original categorical values.
- Return the balanced dataset: The balanced dataset is provided for download with the filename appended with
_{method}_balanced
.