Data Preparation
Dimensionality Reduction
This endpoint allows users to upload a file and apply dimensionality reduction techniques to the dataset. Supported algorithms include PCA, t-SNE, UMAP, and Autoencoder.
Endpoint: POST /dimensionality-reduction
Request Parameters
File Upload
file
(required): The file to be processed. Supported formats include CSV, Excel, etc.
Form Parameters
columns
(optional): A comma-separated list of columns to include in the dimensionality reduction. If not provided, all numeric columns are used.auto_detect_numeric
(optional, default:True
): Whether to auto-detect numeric columns ifcolumns
is not provided.algorithm
(required): The dimensionality reduction algorithm to apply. Supported values:pca
: Principal Component Analysis (PCA).tsne
: t-Distributed Stochastic Neighbor Embedding (t-SNE).umap
: Uniform Manifold Approximation and Projection (UMAP).autoencoder
: Autoencoder-based dimensionality reduction.
n_components
(required): The number of components (dimensions) to reduce the data to.normalize
(optional, default:True
): Whether to normalize the data before applying dimensionality reduction.check_missing_values
(optional, default:False
): Whether to check for missing values in the dataset.check_numeric
(optional, default:False
): Whether to ensure all selected columns are numeric.random_seed
(optional): Random seed for reproducibility.
Algorithm-Specific Parameters
- PCA:
pca_whiten
(optional, default:False
): Whether to whiten the data.pca_svd_solver
(optional, default:auto
): The SVD solver to use. Supported values:auto
,full
,arpack
,randomized
.
- t-SNE:
tsne_perplexity
(optional, default:30.0
): The perplexity parameter for t-SNE.tsne_learning_rate
(optional, default:200.0
): The learning rate for t-SNE.
- UMAP:
umap_neighbors
(optional, default:15
): The number of neighbors for UMAP.umap_min_dist
(optional, default:0.1
): The minimum distance for UMAP.
- Autoencoder:
autoencoder_layers
(optional): A comma-separated list of layer sizes for the autoencoder (e.g.,64,32
).autoencoder_epochs
(optional, default:50
): The number of training epochs for the autoencoder.autoencoder_learning_rate
(optional, default:0.001
): The learning rate for the autoencoder.
Processing Logic
- Read the uploaded file: The file is converted into a Pandas DataFrame.
- Validate inputs:
- Checks if the dataset is empty.
- Validates the selected columns (if provided).
- Auto-detects numeric columns if
auto_detect_numeric
isTrue
. - Checks for missing values if
check_missing_values
isTrue
. - Ensures all selected columns are numeric if
check_numeric
isTrue
. - Validates the number of components (
n_components
).
- Normalize data: If
normalize
isTrue
, the data is standardized usingStandardScaler
. - Apply dimensionality reduction:
- PCA: Applies Principal Component Analysis.
- t-SNE: Applies t-Distributed Stochastic Neighbor Embedding.
- UMAP: Applies Uniform Manifold Approximation and Projection.
- Autoencoder: Trains an autoencoder and uses the bottleneck layer for dimensionality reduction.
- Return the transformed data: The reduced dataset is provided for download with the filename appended with
_dimensionality_reduction
.