dqm.representativeness package
Submodules
dqm.representativeness.main module
dqm.representativeness.metric module
This script provides functions for analyzing data distribution using chi-square tests, goodness-of-fit tests, Kolmogorov-Smirnov tests, Shannon entropy, and confidence intervals.
- Authors:
Faouzi ADJED Anani DJATO
- Dependencies:
numpy pandas matplotlib scipy seaborn dqm.representativeness.twe_logger
- Classes:
DistributionAnalyzer: Class for analyzing data distribution
Functions: None
Usage: Import this script and use the provided functions for distribution analysis.
- class dqm.representativeness.metric.DistributionAnalyzer(data, bins, distribution)[source]
Bases:
object
Class for analyzing data distribution.
Args: data (pd.DataFrame): The data to be analyzed. bins (int): The number of bins for analysis. distribution (str): The distribution type (‘normal’ or ‘uniform’).
- chisquare_test()[source]
Perform the chi-square test on the provided data. Returns p-value and confidence intervals
- kolmogorov()[source]
Calculate the Kolmogorov-Smirnov test for the chosen distribution. Returns the KS test p-value.
- shannon_entropy()[source]
Calculate Shannon entropy for the provided intervals. Returns Shannon entropy.
- grte()[source]
Calculates the Granular Relative and Theoretical Entropy (GRTE) for given data Returns The calculated GRTE value and the intervals discretized data
- chisquare_test(*par_dist)[source]
Perform a chi-square test for goodness of fit.
This method analyzes the distribution of data using a chi-square test for goodness of fit. It supports normal and uniform distributions.
- Parameters:
*par_dist (float) – Parameters for the specified distribution.
- Returns:
The p-value from the chi-square test intervals_frequencies (pd.DataFrame): The DataFrame containing
observed and expected frequencies.
- Return type:
p-value (float)
- grte(*args)[source]
Calculates the Granular Relative and Theoretical Entropy (GRTE) for given data.
- Parameters:
*args (float) – Optional arguments. For ‘uniform’, provide start and end; for ‘normal’, provide mean and std.
- Returns:
The calculated GRTE value. intervals_discretized (pd.Series): The intervals discretized data.
- Return type:
grte_res (float)
dqm.representativeness.twe_logger module
The confiance_logger module provides a preconfigured logger for logging messages with specified formatting and output control. It can log messages to the standard output, to a specified file, or both.
- Usage:
- Import the module and get the default logger:
import twe_logger logger = twe_logger.get_logger()
If you need a logger with different parameters, call get_logger with the desired parameters:
logger = twe_logger.get_logger(filename=”my_logs.log”) logger = twe_logger.get_logger(name=”my_logger”, level=’debug’, filename=’my_logs.log’, output=”both”)
Then, use the logger within your code:
logger.info(“This is an info message”) logger.error(“This is an error message”)
- dqm.representativeness.twe_logger.get_logger(name='twe_logger', level='debug', filename=None, output=None)[source]
Creates and returns a logger.
- Parameters:
name (str, optional) – The name of the logger.
level (int or str, optional) – The logging level.
filename (str, optional) – The name of the file where the logger should write.
output (str, optional) – Where should the logger write. Can be ‘stdout’, ‘file’, or ‘both’.
- Returns:
The logger.
- Return type:
logging.Logger
dqm.representativeness.utils module
This module implements two classes, DiscretisationParams and VariableAnalysis, providing functionality for variable counting, countplot visualization, and discretization of variables using normal or uniform distributions. It also includes functions for processing data for chi-square tests, calculating expected values, and generating histograms for observed and expected values.
- Authors:
Faouzi ADJED Anani DJATO
- Dependencies:
numpy pandas matplotlib.pyplot scipy.stats dqm.representativeness.twe_logger seaborn
Functions : None
- Classes:
DiscretisationParams: Class for defining discretization parameters VariableAnalysis: Class for analyzing data distribution
Example: from utils import VariableAnalysis, DiscretisationParams
# Example of using VariableAnalysis class variable_analyzer = VariableAnalysis()
# Example of using the variable_counting method my_variable = pd.Series([1, 2, 2, 3, 3, 3, 4, 4, 4, 4]) counts = variable_analyzer.variable_counting(my_variable) print(“Counts of unique values:”) print(counts)
# Example of using the countplot method variable_analyzer.countplot(my_variable) plt.show()
# Instantiate the DiscretisationParams class discretisation_params = DiscretisationParams(
data=my_variable, distribution_theory=’normal’, distribution_empirical=[-1.0, 0.0, 1.0, 2.0], mean=0.0, std=1.0
)
- class dqm.representativeness.utils.DiscretisationParams(data, distribution_params)[source]
Bases:
object
Parameters for discretization.
- Parameters:
data – Input data.
distribution_params – Dictionary containing distribution parameters. ‘theory’: Distribution theory (‘normal’ or ‘uniform’). ‘empirical’: Empirical distribution used for discretization. ‘mean’: Mean parameter for the distribution theory. ‘std’: Standard deviation for the distribution theory.
- __init__()[source]
Initializes an instance of the DiscretisationParams class.
- Parameters:
data – Input data.
distribution_params – Dictionary containing distribution parameters.
- Returns:
None
- class dqm.representativeness.utils.VariableAnalysis[source]
Bases:
object
This class provides functions for variable counting, countplot visualization, and discretization of variables using normal or uniform distributions. It includes functions for processing data for chi-square tests, calculating expected values, and generating histograms for observed and expected values.
Args: None
- expected_hist()
- countplot(variable)[source]
This function will not be used and will be deleted in the final package (to decide) Show the counts of observations of every category
- Parameters:
variable (DataFrame)
- Return type:
None
- Returns:
countplot (show the bar plot of counts of variable)
- data_processing_for_chisqure_test(data)[source]
This function is designed to preprocess the input data for chi-square tests. If the data type is object (‘O’), it is assumed to be categorical, and the function converts it into value counts. This step is crucial for chi-square tests, which require frequency distributions.
- Parameters:
data (pd.DataFrame) – Input data.
- Returns:
Processed data suitable for chi-square tests.
- Return type:
data (pd.DataFrame)
- delete_na(data)[source]
Remove missing values (NaN) from the input data.
- Parameters:
data (pd.DataFrame) – The input data containing missing values.
- Returns:
The input data with missing values removed. If the input is a Series, the output will also be a Series. If the input is a DataFrame, the output will be a DataFrame.
- Return type:
data (pd.DataFrame)
- discretisation(variable, distribution, bins)[source]
Discretisation of variable into bins
- Parameters:
distribution (string) – ‘normal’ ou ‘uniform’
variable (Series)
bins (int)
- Returns:
discretised variable into bins
- Return type:
interval (array)
- discretisation_intervals(params)[source]
This function discretizes a given set of data into intervals based on empirical distribution and calculates observed and expected frequencies for each interval. It supports both normal and uniform distribution theories.
- Parameters:
params (DiscretisationParams) – Parameters for discretization.
- Returns:
- Intervals and counts of each interval.
Returns None if an unsupported distribution theory is provided.
- Return type:
intervals (Optional[DataFrame])
Note: The function may issue a warning if there are missing values in the data.
Example
- interval_data = discretisation_intervals(
- DiscretisationParams(
- data, {
‘theory’: ‘normal’, ‘empirical’: distribution_empirical, ‘mean’: mean, ‘std’: std
}
)
) if interval_data is not None:
logger.info(interval_data)
- expected(distribution, data, *argv)[source]
Calculate the expected values of the distribution
- Parameters:
distribution (str) – ‘normal’ or ‘uniform’
data (List[float]) – Input data.
*argv (
float
) – Parameters of the distribution.
- Returns:
Expected values for every distribution.
- Return type:
n or u (List[float])
- normal_discretization(bins, mean, std)[source]
normal Discretisation of variable into bins
- Parameters:
bins (int) – int
mean (float) – the first parameter of the gaussian distribution
std (float) – standard
- Return type:
List
[float
]
- Returns
interval (array): discretised variable into bins
- observed_hist(variable)[source]
Plot the observed values of the distribution
- Parameters:
variable (pd.Series) – Input variable.
- Return type:
None
- Returns:
None (plots histogram)
- uniform_discretization(bins, min_value, max_value)[source]
This function discretizes a variable with a uniform distribution into specified bins. It uses the inverse transform method with the scipy.stats.uniform.ppf function.
- Parameters:
bins (int) – Number of bins.
min_value (float) – Minimum value for the uniform distribution.
max_value (float) – Maximum value for the uniform distribution.
- Returns:
- Discretized variable into bins.
The list includes intervals with the first element representing negative infinity and the last element representing positive infinity.
- Return type:
interval (list)