π‘ Guidelineο
Set up a clean virtual environnementο
Linux setting:
pip install virtualenv
virtualenv myenv
source myenv/bin/activate
Windows setting:
pip install virtual env
virtualenv myenv
.\myenv\Scripts\activate
Install the libraryο
You can install it by directly downloading from PyPi using the command:
pip install dqm-ml
Or you can installing it from the source code by launching the following command:
pip install .
Usageο
There is two ways to use the dqm library :
Import dqm package and call the dqm functions within your python code
In standalone mode using direct command line from a terminal, or run the DQm-ML container
Standalone modeο
You can use the dqm-ml directly to evaluate your dataset, by using the βdqm-mlβ command from your terminal.
The command line has the following form :
dqm-ml --pipeline_config_path path_to_your_pipeline_file --result_file_path path_to_your_result_file
This mode requires two user parameters:
pipeline_config_path : A path to a yaml file that will define the pipeline of evaluation you want to apply on your datasests
result_file_path : A yaml file containing the set of computed scores for each defined metric in your pipeline
For example, if your pipeline file is located at path : examples/pipeline_example.yaml
and you want your result file to be stored at "examples/results_pipeline_example.yaml
, you will type in your terminal :
dqm-ml --pipeline_config_path "examples/pipeline_example.yaml" --result_file_path "examples/results_pipeline_example.yaml"
Pipeline definitionο
A dqm-ml pipeline is a yaml file that contains the list of dataset you want to evaluate, and the list of metrics you want to compute on each ones. This file has a primary key pipeline_definition containing a list of items where each item has the following required fields:
dataset : The path to the dataset you want to evaluate .
domain : The category of metric you want to apply
metrics : The list of metrics to compute on the dataset . (For completeness only this field is not used)
For representativeness domain only, the following additional parameters fields are required:
bins :
distribution :
You can use an optionnal field :
columns : The list of columns from your dataset on which you want to restrict the computations of metrics. If this field is missing, by default the metrics are applied on all columns of the given dataset
The field datasets
, can be a path to a single file or a path to a folder. If the path points on a single file, the file content is directly loaded and considered as the final dataset to evaluate. Supported extension for files are βcsv, txt, xls,xlsx, pq and parquetβ. In case of csv or txt file, you can set a separator
field to indicate the separator to be used to parse the file.
If the defined path is a folder, all files within the folder will be automatically concatened along the rows axis to build the final dataset that will be considered for the evaluation. For folders, you can use an additional extension
field to concatenate only the files with the specified extension in the target folder. By default, all present files are tried to be concatenated.
For example:
- domain : "representativeness"
extension: "txt"
metrics: ["chi-square","GRTE"]
bins : 10
distribution : "normal"
dataset: "tdata/my_data_folder"
columns_names : ["col_1", "col_5","col_9"]
For domain gap, because the metrics apply only on images datasets, the definition is quite different, the item has the following field
domain
: defining the name of the domain thus here βdomain_gapβmetrics
: The list of metrics you want to compute, and for each item you have two fieldsmetrics_name : The name of metric to compute
method_config : The user configuration of the metric. In this part you define the source and target datasets, the chosen models, and other user parameters
An example of pipeline file defining the computations of many metrics from the four domains is given below:
pipeline_definition:
- domain : "completeness"
dataset : "tests/sample_data/completeness_sample_data.csv"
columns_names : ["column_1","column_3","column_6","column_9"]
- domain : "representativeness"
metrics: ["chi-square","GRTE"]
bins : 10
distribution : normal
dataset: "tests/sample_data/SMD_test_ds_sample.csv"
columns_names : ["column_2","column_4", "column_6"]
- domain : "diversity"
metrics: ["simpson","gini"]
dataset: "tests/sample_data/SMD_test_ds_sample.csv"
columns_names : ["column_2","column_4", "column_6"]
- domain: "domain_gap"
metrics:
- metric_name: wasserstein
method_config:
DATA:
batch_size: 32
height: 299
width: 299
norm_mean: [0.485,0.456,0.406]
norm_std: [0.229,0.224,0.225]
source: "tests/sample_data/image_test_ds/c20"
target: "tests/sample_data/image_test_ds/c33"
MODEL:
arch: "resnet18"
device: "cpu"
n_layer_feature: -2
METHOD:
name: "fid"
The result file produced at the end of this pipeline is a yaml file containing the pipeline configuration file content augmented with a βscoresβ field in each item, containing the metrics computed scores.
Example of result_score:
pipeline_definition:
- domain: completeness
dataset: tests/sample_data/completeness_sample_data.csv
columns_names:
- column_1
- column_3
- column_6
- column_9
scores:
overall_score: 0.61825
column_1: 1
column_3: 0.782
column_6: 0.48
column_9: 0.211
- domain: representativeness
metrics:
- chi-square
- GRTE
bins: 10
distribution: normal
dataset: tests/sample_data/SMD_test_ds_sample.csv
columns_names:
- column_2
- column_4
- column_6
scores:
chi-square:
column_2: 1.8740034461104008e-34
column_4: 2.7573644464553625e-86
column_6: 3.469236770038776e-64
GRTE:
column_2: 0.8421470393366073
column_4: 0.7615162001699769
column_6: 0.6955152215780268
To create your own pipeline definition, it is adviced to start from one existing model of pipeline present in the examples/
folder.
Use the dockerized versionο
To build locally the docker image, from the root folder of the repository use the command:
docker build . -f dockerfile -t your_image_name:tag
The command line to run the dqm container has the following form :
docker run -e PIPELINE_CONFIG_PATH="path_to_your_pipeline_file" -e RESULT_FILE_PATH="path_to_the_result_file" irtsystemx/dqm-ml:1.1.1
You need to mount the PIPELINE_CONFIG_PATH
path to /tmp/in/$PIPELIN_CONFIG_PATH
and the $RESULT_FILE_PATH
to /tmp/out/$RESULT_FILE_PATH
Moreover, all datasets directories referenced in your pipeline file shall be mounted in the docker
For example if your pipeline file is stored at examples/pipeline_example_docker.yaml
and you want your result file to be stored at results_docker/result_file.yaml
and all your datasets used in your pipeline are stored locally into /tests
folder and defined on data_storage/..
in your pipeline file
The command would be :
docker run -e PIPELINE_CONFIG_PATH="pipeline_example_docker.yaml" -e RESULT_FILE_PATH="result_file.yaml" -v ${PWD}/examples:/tmp/in -v ${PWD}/tests/:/data_storage/ -v ${PWD}/results_docker:/tmp/out irtsystemx/dqm-ml:1.1.1
User with proxy serverο
The computation of domain gap metrics requires the use of pretrained models that are automatically downloaded by pytorch in a local cache directory during the first call of those metrics.
For users behind a proxy server, this download could fail. To overcome this issue, you can manually get those pretrained models by downloading the zip archive from this link and extract it in the following folder : your_user_directory/.cache/torch/hub/checkpoints/
Use the library within your python codeο
Each metric is used by importing the corresponding modules and class into your code.
For more information about each metric, refer to the specific README.md in dqm/<metric_name>
subfolders
Available examplesο
Many examples of DQM-ML applications are avalaible in the folder /examples
You will find :
2 jupyter_notebooks:
multiple_metrics_tests.ipynb : A notebook applying completeness, diversity and representativeness metrics on an example dataset.
domain_gap.ipynb : A notebook demonstrating an example of applying domain_gap metrics to a generated synthetic dataset.
4 python scripts:
Those scripts named main_X.py gives an example of computation of approaches implemented for metrics
The main_domain_gap.py
script must be called with a config file passed as an argument using --cfg
.
For example:
python examples/main_domain_gap.py --cfg examples/domain_gap_cfg/cmd/cmd.json
We provide in the folder /examples/domain_gap_cfg
a set of config files for each domain_gap approaches`:
For some domain_gap examples, the 200_bird_dataset will be required. It can be downloaded from this link. The zip archive shall be extracted into the examples/datasets/
folder.
1 pipeline example that instanciates every metrics implemented in dqm-ml named pipeline_example.yaml
and its corresponding results results_pipeline_example.yaml
.
1 pipeline example similar to the previous one, but with different datasets path, as shown in the example of how using the containerized version.