dqm.representativeness package๏
Submodules๏
dqm.representativeness.metric module๏
This script provides functions for analyzing data distribution using chi-square tests, goodness-of-fit tests, Kolmogorov-Smirnov tests, Shannon entropy, and confidence intervals.
- Authors:
Faouzi ADJED Anani DJATO
- Dependencies:
numpy pandas matplotlib scipy seaborn dqm.utils.twe_logger
- Classes:
DistributionAnalyzer: Class for analyzing data distribution
Functions: None
Usage: Import this script and use the provided functions for distribution analysis.
- class dqm.representativeness.metric.DistributionAnalyzer(data, bins, distribution)[source]๏
Bases:
object
Class for analyzing data distribution.
Args: data (pd.DataFrame): The data to be analyzed. bins (int): The number of bins for analysis. distribution (str): The distribution type (โnormalโ or โuniformโ).
- chisquare_test()[source]๏
Perform the chi-square test on the provided data. Returns p-value and confidence intervals
- kolmogorov()[source]๏
Calculate the Kolmogorov-Smirnov test for the chosen distribution. Returns the KS test p-value.
- shannon_entropy()[source]๏
Calculate Shannon entropy for the provided intervals. Returns Shannon entropy.
- grte()[source]๏
Calculates the Granular Relative and Theoretical Entropy (GRTE) for given data Returns The calculated GRTE value and the intervals discretized data
- chisquare_test(*par_dist)[source]๏
Perform a chi-square test for goodness of fit.
This method analyzes the distribution of data using a chi-square test for goodness of fit. It supports normal and uniform distributions.
- Parameters:
*par_dist (float) โ Parameters for the specified distribution.
- Returns:
The p-value from the chi-square test intervals_frequencies (pd.DataFrame): The DataFrame containing
observed and expected frequencies.
- Return type:
p-value (float)
- grte(*args)[source]๏
Calculates the Granular Relative and Theoretical Entropy (GRTE) for given data.
- Parameters:
*args (float) โ Optional arguments. For โuniformโ, provide start and end; for โnormalโ, provide mean and std.
- Returns:
The calculated GRTE value. intervals_discretized (pd.Series): The intervals discretized data.
- Return type:
grte_res (float)
dqm.representativeness.utils module๏
This module implements two classes, DiscretisationParams and VariableAnalysis, providing functionality for variable counting, countplot visualization, and discretization of variables using normal or uniform distributions. It also includes functions for processing data for chi-square tests, calculating expected values, and generating histograms for observed and expected values.
- Authors:
Faouzi ADJED Anani DJATO
- Dependencies:
numpy pandas matplotlib.pyplot scipy.stats dqm.utils.twe_logger seaborn
Functions : None
- Classes:
DiscretisationParams: Class for defining discretization parameters VariableAnalysis: Class for analyzing data distribution
Example: from utils import VariableAnalysis, DiscretisationParams
# Example of using VariableAnalysis class variable_analyzer = VariableAnalysis()
# Example of using the variable_counting method my_variable = pd.Series([1, 2, 2, 3, 3, 3, 4, 4, 4, 4]) counts = variable_analyzer.variable_counting(my_variable) print(โCounts of unique values:โ) print(counts)
# Example of using the countplot method variable_analyzer.countplot(my_variable) plt.show()
# Instantiate the DiscretisationParams class discretisation_params = DiscretisationParams(
data=my_variable, distribution_theory=โnormalโ, distribution_empirical=[-1.0, 0.0, 1.0, 2.0], mean=0.0, std=1.0
)
- class dqm.representativeness.utils.DiscretisationParams(data, distribution_params)[source]๏
Bases:
object
Parameters for discretization.
- Parameters:
data โ Input data.
distribution_params โ Dictionary containing distribution parameters. โtheoryโ: Distribution theory (โnormalโ or โuniformโ). โempiricalโ: Empirical distribution used for discretization. โmeanโ: Mean parameter for the distribution theory. โstdโ: Standard deviation for the distribution theory.
- __init__()[source]๏
Initializes an instance of the DiscretisationParams class.
- Parameters:
data โ Input data.
distribution_params โ Dictionary containing distribution parameters.
- Returns:
None
- class dqm.representativeness.utils.VariableAnalysis[source]๏
Bases:
object
This class provides functions for variable counting, countplot visualization, and discretization of variables using normal or uniform distributions. It includes functions for processing data for chi-square tests, calculating expected values, and generating histograms for observed and expected values.
Args: None
- expected_hist()๏
- countplot(variable)[source]๏
This function will not be used and will be deleted in the final package (to decide) Show the counts of observations of every category
- Parameters:
variable (DataFrame)
- Return type:
None
- Returns:
countplot (show the bar plot of counts of variable)
- data_processing_for_chisqure_test(data)[source]๏
This function is designed to preprocess the input data for chi-square tests. If the data type is object (โOโ), it is assumed to be categorical, and the function converts it into value counts. This step is crucial for chi-square tests, which require frequency distributions.
- Parameters:
data (pd.DataFrame) โ Input data.
- Returns:
Processed data suitable for chi-square tests.
- Return type:
data (pd.DataFrame)
- delete_na(data)[source]๏
Remove missing values (NaN) from the input data.
- Parameters:
data (pd.DataFrame) โ The input data containing missing values.
- Returns:
The input data with missing values removed. If the input is a Series, the output will also be a Series. If the input is a DataFrame, the output will be a DataFrame.
- Return type:
data (pd.DataFrame)
- discretisation(variable, distribution, bins)[source]๏
Discretisation of variable into bins
- Parameters:
distribution (string) โ โnormalโ ou โuniformโ
variable (Series)
bins (int)
- Returns:
discretised variable into bins
- Return type:
interval (array)
- discretisation_intervals(params)[source]๏
This function discretizes a given set of data into intervals based on empirical distribution and calculates observed and expected frequencies for each interval. It supports both normal and uniform distribution theories.
- Parameters:
params (DiscretisationParams) โ Parameters for discretization.
- Returns:
- Intervals and counts of each interval.
Returns None if an unsupported distribution theory is provided.
- Return type:
intervals (Optional[DataFrame])
Note: The function may issue a warning if there are missing values in the data.
Example
- interval_data = discretisation_intervals(
- DiscretisationParams(
- data, {
โtheoryโ: โnormalโ, โempiricalโ: distribution_empirical, โmeanโ: mean, โstdโ: std
}
)
) if interval_data is not None:
logger.info(interval_data)
- expected(distribution, data, *argv)[source]๏
Calculate the expected values of the distribution
- Parameters:
distribution (str) โ โnormalโ or โuniformโ
data (List[float]) โ Input data.
*argv (
float
) โ Parameters of the distribution.
- Returns:
Expected values for every distribution.
- Return type:
n or u (List[float])
- normal_discretization(bins, mean, std)[source]๏
normal Discretisation of variable into bins
- Parameters:
bins (int) โ int
mean (float) โ the first parameter of the gaussian distribution
std (float) โ standard
- Return type:
List
[float
]
- Returns
interval (array): discretised variable into bins
- observed_hist(variable)[source]๏
Plot the observed values of the distribution
- Parameters:
variable (pd.Series) โ Input variable.
- Return type:
None
- Returns:
None (plots histogram)
- uniform_discretization(bins, min_value, max_value)[source]๏
This function discretizes a variable with a uniform distribution into specified bins. It uses the inverse transform method with the scipy.stats.uniform.ppf function.
- Parameters:
bins (int) โ Number of bins.
min_value (float) โ Minimum value for the uniform distribution.
max_value (float) โ Maximum value for the uniform distribution.
- Returns:
- Discretized variable into bins.
The list includes intervals with the first element representing negative infinity and the last element representing positive infinity.
- Return type:
interval (list)