Coverage for dqm/completeness/metric.py: 100%

14 statements  

« prev     ^ index     » next       coverage.py v7.10.6, created at 2025-09-05 14:00 +0000

1""" 

2Data Completeness Evaluation Module 

3 

4This module provides tools to assess the completeness of tabular data. It is 

5especially useful in data preprocessing and cleaning stages of a data analysis 

6workflow. The module includes a class, DataCompleteness, with methods to 

7calculate completeness scores for dataframes and individual columns. 

8These methods help in identifying columns with missing data and quantifying 

9the extent of missingness. 

10 

11Authors: 

12 Faouzi ADJED 

13 Anani DJATO 

14 

15Classes: 

16 DataCompleteness: A class that encapsulates the methods for evaluating data completeness. 

17 

18Methods: 

19 completeness_tabular: Calculates the average completeness score for a dataframe. 

20 data_completion: Calculates the completeness score for an individual data column. 

21 

22Dependencies: 

23 numpy 

24 pandas 

25 matplotlib 

26 scipy 

27 seaborn 

28 warnings 

29 

30Usage: 

31 The DataCompleteness class can be used as follows: 

32 

33 from data_completeness import DataCompleteness 

34 

35 # Create an instance of the class 

36 completeness_evaluator = DataCompleteness() 

37 

38 # Load your data into a pandas DataFrame 

39 df = pd.read_csv('your_data_path.csv') 

40 

41 # Calculate the overall completeness score for the DataFrame 

42 overall_score = completeness_evaluator.completeness_tabular(df) 

43 

44 # Calculate the completeness score for a single column 

45 column_score = completeness_evaluator.data_completion(df['your_column']) 

46 

47 # Print the results 

48 print(f'Overall Data Completeness Score: {overall_score}') 

49 print(f'Completeness Score for Column: {column_score}') 

50""" 

51 

52import pandas as pd 

53 

54 

55class DataCompleteness: 

56 """ 

57 This class provides methods to evaluate the completeness of tabular data. 

58 

59 It includes methods to calculate completeness scores for individual columns and 

60 for entire dataframes by assessing the presence of non-null data. 

61 

62 Methods: 

63 completeness_tabular: Calculate the average completeness score of a dataframe. 

64 data_completion: Calculate the completeness score of a single data column. 

65 """ 

66 

67 def completeness_tabular(self, data: pd.DataFrame) -> float: 

68 """ 

69 Calculate the average completeness score of the entire dataframe. 

70 

71 Args: 

72 data (pd.DataFrame): The dataframe to be evaluated for completeness. 

73 

74 Returns: 

75 score_total(float): The average completeness score of 

76 all columns in the dataframe. 

77 """ 

78 score_total = 0 

79 for column in data.columns: 

80 score_total += self.data_completion(data[column]) 

81 score_total = score_total / len(data.columns) 

82 return score_total 

83 

84 def data_completion(self, data: pd.Series) -> float: 

85 """ 

86 Calculate the completeness score of a single data column. 

87 

88 Args: 

89 data (pd.Series): The data column to be evaluated for completeness. 

90 

91 Returns: 

92 completeness_score(float): The completeness score of the column, 

93 calculated as the ratio of non-null entries to total entries. 

94 """ 

95 processed_data = data.dropna() 

96 if len(data) == len(processed_data): 

97 completeness_score = 1 

98 else: 

99 completeness_score = len(processed_data) / len(data) 

100 return completeness_score