💡 Guideline

Set up a clean virtual environnement

Linux setting:

pip install virtualenv
virtualenv myenv
source myenv/bin/activate

Windows setting:

pip install virtual env
virtualenv myenv
.\myenv\Scripts\activate

Install the library

You can install it by directly downloading from PyPi using the command:

pip install dqm-ml

Or you can installing it from the source code by launching the following command:

pip install .

Usage

There is two ways to use the dqm library :

Import dqm package and call the dqm functions within your python code
In standalone mode using direct command line from a terminal, or run the DQm-ML container

Standalone mode

You can use the dqm-ml directly to evaluate your dataset, by using the “dqm-ml” command from your terminal.

The command line has the following form :

dqm-ml --pipeline_config_path path_to_your_pipeline_file --result_file_path path_to_your_result_file

This mode requires two user parameters:

pipeline_config_path : A path to a yaml file that will define the pipeline of evaluation you want to apply on your datasests
result_file_path : A yaml file containing the set of computed scores for each defined metric in your pipeline

For example, if your pipeline file is located at path : examples/pipeline_example.yaml and you want your result file to be stored at "examples/results_pipeline_example.yaml, you will type in your terminal :

dqm-ml --pipeline_config_path "examples/pipeline_example.yaml" --result_file_path "examples/results_pipeline_example.yaml"

Pipeline definition

A dqm-ml pipeline is a yaml file that contains the list of dataset you want to evaluate, and the list of metrics you want to compute on each ones. This file has a primary key pipeline_definition containing a list of items where each item has the following required fields:

dataset : The path to the dataset you want to evaluate .
domain : The category of metric you want to apply
metrics : The list of metrics to compute on the dataset . (For completeness only this field is not used)

For representativeness domain only, the following additional parameters fields are required:

bins :
distribution :

You can use an optionnal field :

columns : The list of columns from your dataset on which you want to restrict the computations of metrics. If this field is missing, by default the metrics are applied on all columns of the given dataset

The field datasets , can be a path to a single file or a path to a folder. If the path points on a single file, the file content is directly loaded and considered as the final dataset to evaluate. Supported extension for files are “csv, txt, xls,xlsx, pq and parquet”. In case of csv or txt file, you can set a separator field to indicate the separator to be used to parse the file.

If the defined path is a folder, all files within the folder will be automatically concatened along the rows axis to build the final dataset that will be considered for the evaluation. For folders, you can use an additional extension field to concatenate only the files with the specified extension in the target folder. By default, all present files are tried to be concatenated.

For example:

 - domain : "representativeness"
    extension: "txt"
    metrics: ["chi-square","GRTE"]
    bins : 10
    distribution : "normal"
    dataset: "tdata/my_data_folder"
    columns_names : ["col_1", "col_5","col_9"]

For domain gap, because the metrics apply only on images datasets, the definition is quite different, the item has the following field

domain: defining the name of the domain thus here “domain_gap”
metrics : The list of metrics you want to compute, and for each item you have two fields
- metrics_name : The name of metric to compute
- method_config : The user configuration of the metric. In this part you define the source and target datasets, the chosen models, and other user parameters

An example of pipeline file defining the computations of many metrics from the four domains is given below:

pipeline_definition:
  - domain : "completeness"
    dataset : "tests/sample_data/completeness_sample_data.csv"
    columns_names : ["column_1","column_3","column_6","column_9"]

  - domain : "representativeness"
    metrics: ["chi-square","GRTE"]
    bins : 10
    distribution : normal
    dataset: "tests/sample_data/SMD_test_ds_sample.csv"
    columns_names : ["column_2","column_4", "column_6"]

  - domain : "diversity"
    metrics: ["simpson","gini"]
    dataset: "tests/sample_data/SMD_test_ds_sample.csv"
    columns_names : ["column_2","column_4", "column_6"]

  - domain: "domain_gap"
    metrics:
      - metric_name: wasserstein
        method_config:
            DATA:
                batch_size: 32
                height: 299
                width: 299
                norm_mean: [0.485,0.456,0.406]
                norm_std: [0.229,0.224,0.225]
                source: "tests/sample_data/image_test_ds/c20"
                target: "tests/sample_data/image_test_ds/c33"
            MODEL:
                arch: "resnet18"
                device: "cpu"
                n_layer_feature: -2
            METHOD:
                name: "fid"

The result file produced at the end of this pipeline is a yaml file containing the pipeline configuration file content augmented with a “scores” field in each item, containing the metrics computed scores.

Example of result_score:

pipeline_definition:
- domain: completeness
  dataset: tests/sample_data/completeness_sample_data.csv
  columns_names:
  - column_1
  - column_3
  - column_6
  - column_9
  scores:
    overall_score: 0.61825
    column_1: 1
    column_3: 0.782
    column_6: 0.48
    column_9: 0.211
- domain: representativeness
  metrics:
  - chi-square
  - GRTE
  bins: 10
  distribution: normal
  dataset: tests/sample_data/SMD_test_ds_sample.csv
  columns_names:
  - column_2
  - column_4
  - column_6
  scores:
    chi-square:
      column_2: 1.8740034461104008e-34
      column_4: 2.7573644464553625e-86
      column_6: 3.469236770038776e-64
    GRTE:
      column_2: 0.8421470393366073
      column_4: 0.7615162001699769
      column_6: 0.6955152215780268

To create your own pipeline definition, it is adviced to start from one existing model of pipeline present in the examples/ folder.

Use the dockerized version

To build locally the docker image, from the root folder of the repository use the command:

docker build . -f dockerfile -t your_image_name:tag

The command line to run the dqm container has the following form :

docker run -e PIPELINE_CONFIG_PATH="path_to_your_pipeline_file" -e RESULT_FILE_PATH="path_to_the_result_file" irtsystemx/dqm-ml:1.1.1

You need to mount the PIPELINE_CONFIG_PATH path to /tmp/in/$PIPELIN_CONFIG_PATH and the $RESULT_FILE_PATH to /tmp/out/$RESULT_FILE_PATH Moreover, all datasets directories referenced in your pipeline file shall be mounted in the docker

For example if your pipeline file is stored at examples/pipeline_example_docker.yaml and you want your result file to be stored at results_docker/result_file.yaml and all your datasets used in your pipeline are stored locally into /tests folder and defined on data_storage/.. in your pipeline file

The command would be :

docker run -e PIPELINE_CONFIG_PATH="pipeline_example_docker.yaml" -e RESULT_FILE_PATH="result_file.yaml" -v ${PWD}/examples:/tmp/in -v ${PWD}/tests/:/data_storage/ -v ${PWD}/results_docker:/tmp/out irtsystemx/dqm-ml:1.1.1

User with proxy server

The computation of domain gap metrics requires the use of pretrained models that are automatically downloaded by pytorch in a local cache directory during the first call of those metrics.

For users behind a proxy server, this download could fail. To overcome this issue, you can manually get those pretrained models by downloading the zip archive from this link and extract it in the following folder : your_user_directory/.cache/torch/hub/checkpoints/

Use the library within your python code

Each metric is used by importing the corresponding modules and class into your code. For more information about each metric, refer to the specific README.md in dqm/<metric_name> subfolders

Available examples

Many examples of DQM-ML applications are avalaible in the folder /examples

You will find :

2 jupyter_notebooks:

multiple_metrics_tests.ipynb : A notebook applying completeness, diversity and representativeness metrics on an example dataset.
domain_gap.ipynb : A notebook demonstrating an example of applying domain_gap metrics to a generated synthetic dataset.

4 python scripts:

Those scripts named main_X.py gives an example of computation of approaches implemented for metrics on samples.

The main_domain_gap.py script must be called with a config file passed as an argument using --cfg.

For example:

python examples/main_domain_gap.py --cfg examples/domain_gap_cfg/cmd/cmd.json

We provide in the folder /examples/domain_gap_cfg a set of config files for each domain_gap approaches`:

For some domain_gap examples, the 200_bird_dataset will be required. It can be downloaded from this link. The zip archive shall be extracted into the examples/datasets/ folder.

1 pipeline example that instanciates every metrics implemented in dqm-ml named pipeline_example.yaml and its corresponding results results_pipeline_example.yaml.

1 pipeline example similar to the previous one, but with different datasets path, as shown in the example of how using the containerized version.