Data Analysis
Distribution Analysis
This endpoint performs a distribution analysis on both numerical and categorical columns in the provided dataset, and optionally generates various graphs like histograms, KDE plots, box plots, and more.
Endpoint: POST /distribution-analysis
Request Parameters
File Upload
file
(required): The file to be processed. Supported formats include CSV, Excel, etc.
Numerical Columns
numerical_columns
(optional): Comma-separated list of numerical columns to include in the distribution analysis. If omitted, all numerical columns are used.
Histogram Bins
histogram_bins
(optional): The number of bins to use for histograms. Can be specified as an integer or one of the following string values:"auto"
: Automatically determines the optimal number of bins (default)."fd"
: Freedman-Diaconis rule."doane"
,"scott"
,"rice"
,"sturges"
,"sqrt"
,"scott"
: Specific binning strategies.
Density
density
(optional): Boolean to specify whether the histogram should be normalized to form a probability density (default isfalse
).
Categorical Columns
categorical_columns
(optional): Comma-separated list of categorical columns to include in the distribution analysis. If omitted, all categorical columns are used.
Top Categories
top_categories
(optional): An integer specifying how many top categories to include in the analysis. Default is0
, which means all categories are included.
Include Graphs
include_graphs
(optional): Boolean to specify if graphs should be included in the response. Default isfalse
.
Graph Types
graph_types
(optional): Comma-separated list of graph types to generate. Default is["histogram"]
. Available options:histogram
: Displays histograms of the numerical columns.kde
: Displays Kernel Density Estimation (KDE) plots.boxplot
: Displays boxplots for the numerical columns.ecdf
: Displays the Empirical Cumulative Distribution Function (ECDF) for the numerical columns.bar
: Displays bar charts for categorical data.pie
: Displays pie charts for categorical data (normalized counts).treemap
: Displays a treemap for categorical data.
Example Request
Analysis Components
-
Numerical Distribution Analysis: Computes the following statistical measures for each numerical column:
- Mean: The average value of the column.
- Standard Deviation: Measures the spread of the values.
- Min/Max: The minimum and maximum values of the column.
- Percentiles: The 25th, 50th (median), and 75th percentiles.
-
Categorical Distribution Analysis: Computes the frequency of each category for the specified categorical columns. If
top_categories
is specified, only the top N categories are included. -
Graphs: Optionally generates visualizations based on the distribution analysis:
- Histogram: Displays histograms for numerical columns.
- KDE: Displays Kernel Density Estimation plots.
- Boxplot: Displays boxplots to show the distribution of numerical columns.
- ECDF: Displays the Empirical Cumulative Distribution Function for numerical data.
- Bar: Displays bar charts for categorical data.
- Pie: Displays pie charts showing the proportions of each category in categorical columns.
- Treemap: Displays a treemap of categorical data for hierarchical visualization.