dqm.representativeness package

Submodules

dqm.representativeness.metric module

This script provides functions for analyzing data distribution using chi-square tests, goodness-of-fit tests, Kolmogorov-Smirnov tests, Shannon entropy, and confidence intervals.

Authors:: Faouzi ADJED Anani DJATO
Dependencies:: numpy pandas matplotlib scipy seaborn dqm.utils.twe_logger
Classes:: DistributionAnalyzer: Class for analyzing data distribution

Functions: None

Usage: Import this script and use the provided functions for distribution analysis.

class dqm.representativeness.metric.DistributionAnalyzer(data, bins, distribution)[source]

Bases: object

Class for analyzing data distribution.

Args: data (pd.DataFrame): The data to be analyzed. bins (int): The number of bins for analysis. distribution (str): The distribution type (‘normal’ or ‘uniform’).

chisquare_test()[source]: Perform the chi-square test on the provided data. Returns p-value and confidence intervals

kolmogorov()[source]: Calculate the Kolmogorov-Smirnov test for the chosen distribution. Returns the KS test p-value.

shannon_entropy()[source]: Calculate Shannon entropy for the provided intervals. Returns Shannon entropy.

grte()[source]: Calculates the Granular Relative and Theoretical Entropy (GRTE) for given data Returns The calculated GRTE value and the intervals discretized data

chisquare_test(*par_dist)[source]

Perform a chi-square test for goodness of fit.

This method analyzes the distribution of data using a chi-square test for goodness of fit. It supports normal and uniform distributions.

Parameters:

*par_dist (float) – Parameters for the specified distribution.

Returns:

The p-value from the chi-square test intervals_frequencies (pd.DataFrame): The DataFrame containing

observed and expected frequencies.

Return type:

p-value (float)

grte(*args)[source]

Calculates the Granular Relative and Theoretical Entropy (GRTE) for given data.

Parameters:: *args (float) – Optional arguments. For ‘uniform’, provide start and end; for ‘normal’, provide mean and std.
Returns:: The calculated GRTE value. intervals_discretized (pd.Series): The intervals discretized data.
Return type:: grte_res (float)

kolmogorov(*par_dist)[source]

Calculation of the Kolmogorov-Smirnov test for every distribution.

Parameters:: *par_dist (float) – arbitrary positional arguments, should be numeric
Returns:: KS test p-value
Return type:: p-value (float)

shannon_entropy()[source]

Calculation of Shannon entropy.

Args: None

Return type:: Shannon entropy (float)

dqm.representativeness.utils module

This module implements two classes, DiscretisationParams and VariableAnalysis, providing functionality for variable counting, countplot visualization, and discretization of variables using normal or uniform distributions. It also includes functions for processing data for chi-square tests, calculating expected values, and generating histograms for observed and expected values.

Authors:: Faouzi ADJED Anani DJATO
Dependencies:: numpy pandas matplotlib.pyplot scipy.stats dqm.utils.twe_logger seaborn

Functions : None

Classes:: DiscretisationParams: Class for defining discretization parameters VariableAnalysis: Class for analyzing data distribution

Example: from utils import VariableAnalysis, DiscretisationParams

# Example of using VariableAnalysis class variable_analyzer = VariableAnalysis()

# Example of using the variable_counting method my_variable = pd.Series([1, 2, 2, 3, 3, 3, 4, 4, 4, 4]) counts = variable_analyzer.variable_counting(my_variable) print(“Counts of unique values:”) print(counts)

# Example of using the countplot method variable_analyzer.countplot(my_variable) plt.show()

# Instantiate the DiscretisationParams class discretisation_params = DiscretisationParams(

data=my_variable, distribution_theory=’normal’, distribution_empirical=[-1.0, 0.0, 1.0, 2.0], mean=0.0, std=1.0

)

class dqm.representativeness.utils.DiscretisationParams(data, distribution_params)[source]

Bases: object

Parameters for discretization.

Parameters:

data – Input data.
distribution_params – Dictionary containing distribution parameters. ‘theory’: Distribution theory (‘normal’ or ‘uniform’). ‘empirical’: Empirical distribution used for discretization. ‘mean’: Mean parameter for the distribution theory. ‘std’: Standard deviation for the distribution theory.

__init__()[source]

Initializes an instance of the DiscretisationParams class.

Parameters:

data – Input data.
distribution_params – Dictionary containing distribution parameters.

Returns:

None

to_dict()[source]

Converts the parameters to a dictionary.

Returns:: A dictionary representation of the parameters.
Return type:: dict

Note

This method is not necessary. It was created solely to have at least 2 methods as recommended in a class.

get_data()[source]

Gets the input data.

Returns:: The input data.
Return type:: Any

get_data()[source]

Get the input data.

Returns:: The input data.
Return type:: Any

to_dict()[source]

Convert the parameters to a dictionary.

Returns:: A dictionary representation of the parameters.
Return type:: dict

Note

This method is not necessary. It was created solely to have at least 2 methods as recommended in a class.

class dqm.representativeness.utils.VariableAnalysis[source]

Bases: object

This class provides functions for variable counting, countplot visualization, and discretization of variables using normal or uniform distributions. It includes functions for processing data for chi-square tests, calculating expected values, and generating histograms for observed and expected values.

Args: None

variable_counting()[source]

countplot()[source]

discretisation()[source]

normal_discretization()[source]

data_processing_for_chisqure_test()[source]

uniform_discretization()[source]

discretisation_intervals()[source]

delete_na()[source]

expected()[source]

expected_hist()

observed_hist()[source]

countplot(variable)[source]

This function will not be used and will be deleted in the final package (to decide) Show the counts of observations of every category

Parameters:: variable (DataFrame)
Return type:: None
Returns:: countplot (show the bar plot of counts of variable)

data_processing_for_chisqure_test(data)[source]

This function is designed to preprocess the input data for chi-square tests. If the data type is object (‘O’), it is assumed to be categorical, and the function converts it into value counts. This step is crucial for chi-square tests, which require frequency distributions.

Parameters:: data (pd.DataFrame) – Input data.
Returns:: Processed data suitable for chi-square tests.
Return type:: data (pd.DataFrame)

delete_na(data)[source]

Remove missing values (NaN) from the input data.

Parameters:: data (pd.DataFrame) – The input data containing missing values.
Returns:: The input data with missing values removed. If the input is a Series, the output will also be a Series. If the input is a DataFrame, the output will be a DataFrame.
Return type:: data (pd.DataFrame)

discretisation(variable, distribution, bins)[source]

Discretisation of variable into bins

Parameters:

distribution (string) – ‘normal’ ou ‘uniform’
variable (Series)
bins (int)

Returns:

discretised variable into bins

Return type:

interval (array)

discretisation_intervals(params)[source]

This function discretizes a given set of data into intervals based on empirical distribution and calculates observed and expected frequencies for each interval. It supports both normal and uniform distribution theories.

Parameters:

params (DiscretisationParams) – Parameters for discretization.

Returns:

Intervals and counts of each interval.: Returns None if an unsupported distribution theory is provided.

Return type:

intervals (Optional[DataFrame])

Note: The function may issue a warning if there are missing values in the data.

Example

interval_data = discretisation_intervals(

DiscretisationParams(

data, {: ‘theory’: ‘normal’, ‘empirical’: distribution_empirical, ‘mean’: mean, ‘std’: std

}

)

) if interval_data is not None:

logger.info(interval_data)

expected(distribution, data, *argv)[source]

Calculate the expected values of the distribution

Parameters:

distribution (str) – ‘normal’ or ‘uniform’
data (List[float]) – Input data.
*argv (float) – Parameters of the distribution.

Returns:

Expected values for every distribution.

Return type:

n or u (List[float])

normal_discretization(bins, mean, std)[source]

normal Discretisation of variable into bins

Parameters:

bins (int) – int
mean (float) – the first parameter of the gaussian distribution
std (float) – standard

Return type:

List[float]

Returns: interval (array): discretised variable into bins

observed_hist(variable)[source]

Plot the observed values of the distribution

Parameters:: variable (pd.Series) – Input variable.
Return type:: None
Returns:: None (plots histogram)

uniform_discretization(bins, min_value, max_value)[source]

This function discretizes a variable with a uniform distribution into specified bins. It uses the inverse transform method with the scipy.stats.uniform.ppf function.

Parameters:

bins (int) – Number of bins.
min_value (float) – Minimum value for the uniform distribution.
max_value (float) – Maximum value for the uniform distribution.

Returns:

Discretized variable into bins.: The list includes intervals with the first element representing negative infinity and the last element representing positive infinity.

Return type:

interval (list)

variable_counting(variable)[source]

Counting unique values (only int values and modalities. It cannot be used for float values)

Parameters:: variable (panda.Series)
Returns:: counts of unique values
Return type:: variable_count (DataFrame)

dqm.representativeness package

Submodules

dqm.representativeness.metric module

dqm.representativeness.utils module

Module contents