Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" (Version 2)
These are the experimental data for the second version (v2) of the paper> Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions"
published on arXiv in 2025.
If we create further versions of this paper in the future, these experimental data may cover them as well.
Check our GitHub repository for the code and instructions to reproduce the experiments.
We obtained the experimental results on a server with an AMD EPYC 7551
CPU (32 physical cores, base clock of 2.0 GHz) and 160 GB RAM.
The operating system was Ubuntu 20.04.6 LTS
.
The Python version was 3.8
for the main experiments and 3.9
for the competitor-runtime experiments.
With this configuration, running the main experimental pipeline (main_experiments/run_experiments.py
) took about 34 hours.
Note that the experimental data originate from multiple pipeline runs, as we have two experimental pipelines, and we reran one of them to include additional subgroup-discovery methods:
- The commit hash for the last run of the competitor-runtime experimental pipeline (
competitor_runtime_experiments/run_competitor_runtime_experiments.py
) is 1a026326b3 (tag: competitor-runtime-2025-01-12
).
- The commit hash for the last run of the main experimental pipeline (
main_experiments/run_experiments.py
) except the subgroup-discovery methods BSD and SD-Map is 0a57bcd529 (tag: run-2024-05-13
).
These data are identical to the experimental data for v1 of the paper.
- The commit hash for the last run of the main experimental pipeline (
main_experiments/run_experiments.py
) for the (new) subgroup-discovery methods BSD and SD-Map is 50dd82e0fc (tag: run-2025-01-21-arXiv-v2
).
The main experimental pipeline did not change between the last two mentioned commits (except for refactorings and to include two new subgroup-discovery methods),
and the competitor-runtime experimental pipeline changed neither.
Thus, using the tag run-2025-01-21-arXiv-v2
to reproduce all experiments should yield the same results (except runtimes and timeout-affected results).
The commit hash for the last run of the two evaluation pipelines (main_experiments/run_evaluation_arxiv.py
and competitor_runtime_experiments/run_competitor_runtime_evaluation_arxiv.py
) is bc3aafc904 (tag: evaluation-2025-02-16-arXiv-v2
).
The experimental data are stored in five folders, i.e., competitor-runtime-datasets/
, competitor-runtime-results/
, datasets/
, plots/
, and results/
.
Further, the console output of main_experiments/run_evaluation_arxiv.py
is stored in Evaluation_console_output_main.txt
,
and the console output of competitor_runtime_experiments/run_competitor_runtime_evaluation_arxiv.py
is stored in Evaluation_console_output_competitor_runtimes.txt
.
We manually copied both evaluation outputs from the console to a file.
In the following, we describe the structure and content of each data file.
competitor-runtime-datasets/
These are the input data for the competitor-runtime experimental pipeline competitor_runtime_experiments/run_competitor_runtime_experiments.py
.
They were obtained with the script competitor_runtime_experiments/prepare_competitor_runtime_datasets.py
.
The folder structure of competitor-runtime-datasets/
is similar to that of the dataset folder of the main experiments (datasets/
), so please consult the corresponding section of this document for more details on the contained file types.
The main difference is that only five of the 27 PMLB datasets are included, plus the iris
dataset provided by scikit-learn
.
competitor-runtime-results/
These are the output data of the competitor-runtime experimental pipeline in the form of CSVs, produced by the script competitor_runtime_experiments/run_competitor_runtime_experiments.py
.
_results.csv
contains all results merged into one file and acts as input for the script competitor_runtime_experiments/run_competitor_runtime_evaluation_arxiv.py
.
The remaining 325 files are subsets of the results.
The competitor-runtime experimental pipeline parallelizes over 6 datasets, 5 cross-validation folds, and 17 subgroup-discovery methods.
Thus, a full cross-product would yield 6 * 5 * 17 = 510
files containing subsets of the results, but some subgroup-discovery methods timed out on some datasets,
so the corresponding result files are missing.
Each row in a result file corresponds to one subgroup.
One can identify individual subgroup-discovery runs with a combination of multiple columns, i.e.:
- dataset
dataset_name
- cross-validation fold
split_idx
- subgroup-discovery method
sd_name
- feature-cardinality threshold
param.k
The remaining column, fitting_time
, represents the evaluation metric.
In detail, all result files for the competitor-runtime experiments contain the following columns (whose names are consistent with the main experiments):
fitting_time
(non-negative float): The runtime (in seconds) of the subgroup-discovery method.
dataset_name
(string, 6 different values): The name of the dataset used for subgroup discovery.
split_idx
(int in [0, 4]
): The index of the cross-validation fold of the dataset used for subgroup discovery.
sd_name
(string, 17 different values): The name of the subgroup-discovery method, consisting of the package name and the algorithm name (e.g., sd4py.Beam
).
param.k
(int in [1, 5]
): The feature-cardinality threshold for subgroup descriptions.
datasets/
These are the input data for the main experimental pipeline main_experiments/run_experiments.py
, i.e., the prediction datasets.
The folder contains one overview file, one license file, and two files for each of the 27 datasets.
The original datasets were downloaded from PMLB with the script main_experiments/prepare_datasets.py
.
Note that we do not own the copyright for these datasets.
However, the GitHub repository of PMLB, which stores the original datasets, is MIT-licensed ((c) 2016 Epistasis Lab at UPenn).
Thus, we include the file LICENSE
from that repository.
After downloading from PMLB
, we split each dataset into the feature part (_X.csv
) and the target part (_y.csv
), which we save separately.
Both file types are CSVs that only contain numeric values (categorical features are ordinally encoded in PMLB
) except for the column names.
There are no missing values.
Each row corresponds to a data object (= instance, sample), and each column either corresponds to a feature (in _X
) or the target (in _y
).
The first line in each _X
file contains the names of the features as strings; for _y
files, there is only one column, always named target
.
For the prediction target, we ensured that the minority (i.e., less frequent) class is the positive class (i.e., has the class label 1
), so the labeling may differ from PMLB.
_dataset_overview.csv
contains meta-data for the datasets, like the number of instances and features.
plots/
These are the output files of the main evaluation pipeline main_experiments/run_evaluation_arxiv.py
.
We include these plots in our paper.
results/
These are the output data of the main experimental pipeline in the form of CSVs, produced by the script main_experiments/run_experiments.py
.
_results.csv
contains all results merged into one file and acts as input for the script main_experiments/run_evaluation_arxiv.py
.
The remaining files are subsets of the results, as the main experimental pipeline parallelizes over 27 datasets, 5 cross-validation folds, and 8 subgroup-discovery methods.
Thus, there are 27 * 5 * 8 = 1080
files containing subsets of the results.
Each row in a result file corresponds to one subgroup.
One can identify individual subgroup-discovery runs with a combination of multiple columns, i.e.:
- dataset
dataset_name
- cross-validation fold
split_idx
- subgroup-discovery method
sd_name
- feature-cardinality threshold
param.k
(missing value if no feature-cardinality constraint employed)
- solver timeout
param.timeout
(missing value if not solver-based search)
- number of alternatives
param.a
(missing value if only original subgroup searched)
- dissimilarity threshold
param.tau_abs
(missing value if only original subgroup searched)
For each value combination of these seven columns, there is either one subgroup (search for original subgroups)
or six subgroups (search for alternative subgroup descriptions, in which case the column alt.number
identifies individual subgroups within a search run).
Further, note that the last four mentioned columns contain missing values, which should be treated as a category on their own.
In particular, if you use groupby()
from pandas
for analyzing the results and you want to include any of the last four mentioned columns in the grouping,
you should either fill in the missing values with an (arbitrary) placeholder value or use the parameter dropna=False
,
because the grouping (by default) ignores the rows with missing values in the group columns otherwise.
The remaining columns represent results and evaluation metrics.
In detail, all result files contain the following columns:
objective_value
(float >= -0.25
+ missing values): Objective value of the subgroup-discovery method on the training set.
Usually quantifies WRAcc (in [-0.25, 0.25]
) when searching original subgroups and normalized Hamming similarity (in [0, 1]
) when searching alternative subgroup descriptions.
First exception: The subgroup-discovery method MORS has missing values since MORS does not explicitly compute an objective when searching for subgroups.
Second exception: The subgroup-discovery methods BSD and SD-Map use WRAcc times the number of data objects (a dataset-dependent constant) as objective,
so the objective value is higher than for the remaining subgroup-discovery methods.
optimization_status
(string, 2 different values + missing values): For SMT, sat
if optimal solution found and unknown
if timeout.
Missing value for all other subgroup-discovery methods (which do not use solver timeouts).
optimization_time
(non-negative float): The optimization runtime (in seconds) inside the subgroup-discovery method, i.e., without pre- and post-processing steps.
fitting_time
(non-negative float): The complete runtime (in seconds) of the subgroup-discovery method (as reported in the paper), i.e., including pre- and post-processing steps.
Very similar to optimization_time
except for SMT as the subgroup-discovery method, which may spend a considerable amount of time formulating the optimization problem.
train_wracc
(float in [-0.25, 0.25]
): The weighted relative accuracy (WRAcc) of the subgroup description on the training set.
test_wracc
(float in [-0.25, 0.25]
): The weighted relative accuracy (WRAcc) of the subgroup description on the test set.
train_nwracc
(float in [-1, 1]
): The normalized weighted relative accuracy (WRAcc divided by its dataset-dependent maximum) of the subgroup description on the training set.
test_nwracc
(float in [-1, 1]
): The normalized weighted relative accuracy (WRAcc divided by its dataset-dependent maximum) of the subgroup description on the test set.
box_lbs
(list of floats, e.g., [-inf, 0, -inf, -2, 8]
): The lower bounds for each feature in the subgroup description.
Negative infinity if a feature's lower bound did not exclude any data objects from the subgroup.
box_ubs
(list of floats, e.g., [inf, 10, inf, 5, 9]
): The upper bounds for each feature in the subgroup description.
Positive infinity if a feature's upper bound did not exclude any data objects from the subgroup.
selected_feature_idxs
(list of non-negative ints, e.g., [0, 4, 5]
): The indices (starting from 0) of the features selected (= restricted) in the subgroup description.
Is an empty list, i.e., []
, if no feature was restricted (so the subgroup contains all data objects).
dataset_name
(string, 27 different values): The name of the PMLB
dataset used for subgroup discovery.
split_idx
(int in [0, 4]
): The index of the cross-validation fold of the dataset used for subgroup discovery.
sd_name
(string, 8 different values): The name of the subgroup-discovery method (Beam
, BI
, BSD
, MORS
, PRIM
, Random
, SD-Map
, or SMT
).
param.k
(int in [1, 5]
+ missing values): The feature-cardinality threshold for subgroup descriptions.
Missing value if unconstrained subgroup discovery.
Always 3
if alternative subgroup descriptions searched.
param.timeout
(int in [1, 2048]
+ missing values): For SMT, solver timeout (in seconds) for optimization (not including time for formulating the optimization problem).
Missing value for all other subgroup-discovery methods.
alt.hamming
(float in [0, 1]
+ missing values): Normalized Hamming similarity between the current subgroup (original or alternative) and the original subgroup if alternative subgroup descriptions searched.
Missing value if only original subgroup searched.
alt.jaccard
(float in [0, 1]
+ missing values): Jaccard similarity between the current subgroup (original or alternative) and the original subgroup if alternative subgroup descriptions searched.
Missing value if only original subgroup searched.
alt.number
(int in [0, 5]
+ missing values): The number of the current alternative if alternative subgroup descriptions searched.
Missing value if only original subgroup searched.
Thus, original subgroups either have 0
or a missing value in this column (i.e., for experimental settings where alternative subgroup descriptions searched, there is no separate search for an original subgroup, only a joint sequential search for original and alternatives).
param.a
(int with value 5
+ missing values): The number of desired alternative subgroup descriptions, not counting the original (zeroth) subgroup description.
Missing value if only original subgroup searched.
param.tau_abs
(int in [1, 3]
+ missing values) The dissimilarity threshold for alternatives, corresponding to the absolute number of features that have to be deselected from the original subgroup description and each prior alternative.
Missing value if only original subgroup searched.
You can easily read in any of the result files with pandas
:
import pandas as pd
results = pd.read_csv('results/_results.csv')
All result files are comma-separated and contain plain numbers and unquoted strings, apart from the columns box_lbs
, box_ubs
, and selected_feature_idxs
(which represent lists and whose values are quoted except for empty lists).
The first line in each result file contains the column names.
You can use the following code to make sure that the lists of feature indices are treated as lists (rather than plain strings):
import ast
results['selected_feature_idxs'] = results['selected_feature_idxs'].apply(ast.literal_eval)
Note that this conversion does not work for box_lbs
and box_ubs
, where the lists not only contain ordinary numbers but also -inf
, and inf
;
see this Stack Overflow post for potential alternatives.