Data Preprocessing
Detect Duplicates
This endpoint allows users to upload a CSV file and detect duplicate rows based on specified criteria. Users can choose how duplicates should be handled.
Endpoint: POST /detect-duplicates
Request Parameters
File Upload
file
(required): CSV file to be processed.
Form Parameters
case_sensitive
(optional, default:false
)- Whether string comparisons should be case-sensitive.
consider_nulls_equal
(optional, default:false
)- Whether
NULL
values should be considered equal when detecting duplicates.
- Whether
handling_method
(optional, default:first
)- Method for handling detected duplicates.
- Supported values:
first
,last
,min
,max
,mean
,sum
.
Processing Logic
- Read the uploaded file: The file is converted into a Pandas DataFrame.
- Normalize case (if applicable): Converts string values to lowercase if case sensitivity is disabled.
- Handle null values (if applicable): Replaces
NULL
values with a placeholder ifconsider_nulls_equal
is enabled. - Detect duplicates: Identifies duplicate rows based on all columns or a subset.
- Apply handling method:
first
: Keeps the first occurrence of duplicates.last
: Keeps the last occurrence of duplicates.min
: Keeps the row with the smallest value in the subset.max
: Keeps the row with the largest value in the subset.mean
: Averages numerical values in duplicates.sum
: Sums numerical values in duplicates.
- Return the modified file: The processed file is provided for download.