Data Preparation
Train-Test Splitting
This endpoint processes an uploaded dataset and splits it into training and testing sets based on the specified splitting method.
Endpoint: POST /prepTrainTest
Request Parameters
File Upload
file
(required): CSV file containing the dataset to be split.
Form Parameters
splitting_method
(optional, default:stratified
)- Method for splitting the dataset.
- Supported values:
stratified
,time-based
,group-based
,shuffle
.
target_column
(optional, default:""
)- The name of the target column used for stratification, grouping, or time-based splitting.
group_column
(optional, default:""
)- The column to be used for
group-based
splitting.
- The column to be used for
Splitting Methods
-
Stratified (
stratified
):- Ensures that the training and testing sets maintain the same proportion of target classes.
- Requires
target_column
to be specified.
-
Time-Based (
time-based
):- Splits data chronologically based on the
target_column
. - Uses the 80th percentile as the threshold for splitting.
- Requires
target_column
to contain valid datetime values.
- Splits data chronologically based on the
-
Group-Based (
group-based
):- Splits data while maintaining group integrity, ensuring all data points in a group remain in the same split.
- Requires both
target_column
andgroup_column
to be specified.
-
Shuffle (
shuffle
):- Randomly splits the dataset into training and testing sets.
- Requires
target_column
to be specified.
Response
- Returns downloadable files containing the split datasets.
- Depending on the splitting method, the returned files will include:
X_train.csv
,X_test.csv
,y_train.csv
,y_test.csv
(forstratified
,group-based
, andshuffle
methods).train.csv
,test.csv
(fortime-based
method).
Errors
- If
target_column
is not found in the dataset. - If
group_column
is specified but does not exist in the dataset (forgroup-based
splitting). - If
time-based
splitting is used buttarget_column
is not a valid datetime column.