Alternativer Identifier:

(KITopen-DOI) 10.5445/IR/1000148891

Verwandter Identifier:

(Is Identical To) https://publikationen.bibliothek.kit.edu/1000148891 - URL

Ersteller/in:

Bach, Jakob https://orcid.org/0000-0003-0301-2798 [Institut für Programmstrukturen und Datenorganisation (IPD), Karlsruher Institut für Technologie (KIT)]

Zoller, Kolja [Computational Materials Science (IAM-CMS), Karlsruher Institut für Technologie (KIT)]

Schulz, Katrin [Computational Materials Science (IAM-CMS), Karlsruher Institut für Technologie (KIT)]

Beitragende:

Titel:

Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection"

Weitere Titel:

Beschreibung:

(Abstract) These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection" published at the journal [*SN Computer Science*](https://www.springer.com/journal/42979). You can find the paper [here](https://doi.org/10.1007/s42979-022-01338-z) and the co...

Zeige alles

(Technical Remarks) # Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection" These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection" accepted at the journal [*SN Computer Science*](https://www.springer.com/journal/... # Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection" These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection" accepted at the journal [*SN Computer Science*](https://www.springer.com/journal/42979). Check our [GitHub repository](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection) for the code and instructions to reproduce the experiments. The data were obtained on a server with an `AMD EPYC 7551` [CPU](https://www.amd.com/en/products/cpu/amd-epyc-7551) (32 physical cores, base clock of 2.0 GHz) and 128 GB RAM. The Python version was `3.8`. Our paper contains two studies, and we provide data for both of them. Running the experimental pipeline for the study with synthetic constraints (`syn_pipeline.py`) took several hours. The commit hash for the last run of this pipeline is [`acc34cf5d2`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/acc34cf5d22b0a8427852a01288bb8b34f5d8c98). The commit hash for the last run of the corresponding evaluation (`syn_evaluation.py`) is [`c1a7e7e99e`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/c1a7e7e99e56c1a178a602596c13641d7771df0a). Running the experimental pipeline for the case study in materials science (`ms_pipeline.py`) took less than one hour. The commit hash for the last run of this pipeline is [`ba30bf9f11`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/ba30bf9f11703e2a8a942425e2cd4b9f36ead513). The commit hash for the last run of the corresponding evaluation (`ms_evaluation.py`) is [`c1a7e7e99e`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/c1a7e7e99e56c1a178a602596c13641d7771df0a). All these commits are also tagged. In the following, we describe the structure/content of each data file. All files are plain CSVs, so you can read them with `pandas.read_csv()`. ## `ms/` The input data for the case study in materials science (`ms_pipeline.py`). Output of the script `prepare_ms_dataset.py`. As the raw simulation dataset is quite large, we only provide a pre-processed version of it (we do not provide the input to `prepare_ms_dataset.py`). In this pre-processed version, the feature and target parts of the data are already separated into two files: `voxel_data_predict_glissile_X.csv` and `voxel_data_predict_glissile_y.csv`. In `voxel_data_predict_glissile_X.csv`, each column is a numeric feature. `voxel_data_predict_glissile_y.csv` only contains one column, the numeric prediction target (reaction density of glissile reactions). ## `ms-results/` Only contains one result file (`results.csv`) for the case study in materials science. Output of the script `ms_pipeline.py`, input to the script `ms_evaluation.py`. The columns of the file mostly correspond to evaluation metrics used in the paper; see Appendix A.1 for definitions of the latter. - `objective_value` (float): Objective `Q(s, X, y)`, the sum of the qualities of the selected features. - `num_selected` (int): `n_{se}`, the number of selected features. - `selected` (string, but actually a list of strings): Names of the selected features. - `num_variables` (int): `n`, the total number of features in the dataset. - `num_constrained_variables` (int): `n_{cf}`, the number of features involved in constraints. - `num_unique_constrained_variables` (int): `n_{ucf}`, the number of unique features involved in constraints. - `num_constraints` (int): `n_{co}`, the number of constraints. - `frac_solutions` (float): `n_{so}^{norm}`, the number of valid (regarding constraints) feature sets relative to the total number of feature sets. - `linear-regression_train_r2` (float): `R^2` (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the training set. - `linear-regression_test_r2` (float): `R^2` (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the test set. - `regression-tree_train_r2` (float): `R^2` (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the training set. - `regression-tree_test_r2` (float): `R^2` (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the test set. - `xgb-linear_train_r2` (float): `R^2` (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the training set. - `xgb-linear_test_r2` (float): `R^2` (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the test set. - `xgb-tree_train_r2` (float): `R^2` (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the training set. - `xgb-tree_test_r2` (float): `R^2` (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the test set. - `evaluation_time` (float): Runtime (in s) for evaluating one set of constraints. - `split_idx` (int): Index of the cross-validation fold. - `quality_name` (string): Measure for feature quality (absolute correlation or mutual information). - `constraint_name` (string): Name of the constraint type (see paper). - `dataset_name` (string): Name of the dataset. ## `openml/` The input data for the study with synthetic constraints (`syn_pipeline.py`). Output of the script `prepare_openml_datasets.py`. We downloaded 35 datasets from [OpenML](https://www.openml.org) and removed non-numeric columns. Also, we separated the feature part (`*_X.csv`) and the target part (`*_y.csv`) of each dataset. `_data_overview.csv` contains meta-data for the datasets, including dataset id, dataset version, and uploader. **Licensing** Please consult each dataset's website on [OpenML](https://www.openml.org) for licensing information and citation requests. According to OpenML's [terms](https://www.openml.org/terms), OpenML datasets fall under the [CC-BY](https://creativecommons.org/licenses/by/4.0/) license. The datasets used in our study were uploaded by: - Jan van Rijn (user id: 1) - Joaquin Vanschoren (user id: 2) - Rafael Gomes Mantovani (user id: 64) - Tobias Kuehn (user id: 94) - Richard Ooms (user id: 8684) - R P (user id: 15317) See `_data_overview.csv` to match each dataset to its uploader. ## `openml-results/` Result files for the study with synthetic constraints. Output of the script `syn_pipeline.py`, input to the script `syn_evaluation.py`. One result file for each combination of the 10 constraint generators and the 35 datasets, plus one overall (merged) file, `results.csv`. The columns of the result files are the those of `ms-results/results.csv`, minus `selected` and `evaluation_time`; see above for detailed descriptions.

Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection"

These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection" accepted at the journal SN Computer Science. Check our GitHub repository for the code and instructions to reproduce the experiments. The data were obtained on a server with an AMD EPYC 7551 CPU (32 physical cores, base clock of 2.0 GHz) and 128 GB RAM. The Python version was 3.8. Our paper contains two studies, and we provide data for both of them. Running the experimental pipeline for the study with synthetic constraints (syn_pipeline.py) took several hours. The commit hash for the last run of this pipeline is acc34cf5d2. The commit hash for the last run of the corresponding evaluation (syn_evaluation.py) is c1a7e7e99e. Running the experimental pipeline for the case study in materials science (ms_pipeline.py) took less than one hour. The commit hash for the last run of this pipeline is ba30bf9f11. The commit hash for the last run of the corresponding evaluation (ms_evaluation.py) is c1a7e7e99e. All these commits are also tagged. In the following, we describe the structure/content of each data file. All files are plain CSVs, so you can read them with pandas.read_csv().

`ms/`

The input data for the case study in materials science (ms_pipeline.py). Output of the script prepare_ms_dataset.py. As the raw simulation dataset is quite large, we only provide a pre-processed version of it (we do not provide the input to prepare_ms_dataset.py). In this pre-processed version, the feature and target parts of the data are already separated into two files: voxel_data_predict_glissile_X.csv and voxel_data_predict_glissile_y.csv. In voxel_data_predict_glissile_X.csv, each column is a numeric feature. voxel_data_predict_glissile_y.csv only contains one column, the numeric prediction target (reaction density of glissile reactions).

`ms-results/`

Only contains one result file (results.csv) for the case study in materials science. Output of the script ms_pipeline.py, input to the script ms_evaluation.py. The columns of the file mostly correspond to evaluation metrics used in the paper; see Appendix A.1 for definitions of the latter.

objective_value (float): Objective Q(s, X, y), the sum of the qualities of the selected features.
num_selected (int): n_{se}, the number of selected features.
selected (string, but actually a list of strings): Names of the selected features.
num_variables (int): n, the total number of features in the dataset.
num_constrained_variables (int): n_{cf}, the number of features involved in constraints.
num_unique_constrained_variables (int): n_{ucf}, the number of unique features involved in constraints.
num_constraints (int): n_{co}, the number of constraints.
frac_solutions (float): n_{so}^{norm}, the number of valid (regarding constraints) feature sets relative to the total number of feature sets.
linear-regression_train_r2 (float): R^2 (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the training set.
linear-regression_test_r2 (float): R^2 (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the test set.
regression-tree_train_r2 (float): R^2 (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the training set.
regression-tree_test_r2 (float): R^2 (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the test set.
xgb-linear_train_r2 (float): R^2 (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the training set.
xgb-linear_test_r2 (float): R^2 (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the test set.
xgb-tree_train_r2 (float): R^2 (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the training set.
xgb-tree_test_r2 (float): R^2 (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the test set.
evaluation_time (float): Runtime (in s) for evaluating one set of constraints.
split_idx (int): Index of the cross-validation fold.
quality_name (string): Measure for feature quality (absolute correlation or mutual information).
constraint_name (string): Name of the constraint type (see paper).
dataset_name (string): Name of the dataset.

`openml/`

The input data for the study with synthetic constraints (syn_pipeline.py). Output of the script prepare_openml_datasets.py. We downloaded 35 datasets from OpenML and removed non-numeric columns. Also, we separated the feature part (*_X.csv) and the target part (*_y.csv) of each dataset. _data_overview.csv contains meta-data for the datasets, including dataset id, dataset version, and uploader. Licensing Please consult each dataset's website on OpenML for licensing information and citation requests. According to OpenML's terms, OpenML datasets fall under the CC-BY license. The datasets used in our study were uploaded by:

Jan van Rijn (user id: 1)
Joaquin Vanschoren (user id: 2)
Rafael Gomes Mantovani (user id: 64)
Tobias Kuehn (user id: 94)
Richard Ooms (user id: 8684)
R P (user id: 15317) See _data_overview.csv to match each dataset to its uploader.

`openml-results/`

Result files for the study with synthetic constraints. Output of the script syn_pipeline.py, input to the script syn_evaluation.py. One result file for each combination of the 10 constraint generators and the 35 datasets, plus one overall (merged) file, results.csv. The columns of the result files are the those of ms-results/results.csv, minus selected and evaluation_time; see above for detailed descriptions.

Zeige alles

Schlagworte:

Feature selection
Constraints
Domain knowledge
Theory-guided data science

Zugehörige Informationen:

Sprache:

Herausgeber/in:

Karlsruhe Institute of Technology

Erstellungsjahr:

2021

Fachgebiet:

Computer Science

Objekttyp:

Dataset

Datenquelle:

Verwendete Software:

Datenverarbeitung:

Erscheinungsjahr:

2023

Rechteinhaber/in:

Bach, Jakob https://orcid.org/0000-0003-0301-2798

Zoller, Kolja

Schulz, Katrin

Förderung:

Zeige alles Zeige weniger

Name	Speichervolumen	Metadaten	Upload	Aktion

Status:

Publiziert

Eingestellt von:

kitopen

Erstellt am:

2023-04-20

Archivierungsdatum:

2023-06-21

Archivgröße:

266,9 MB

Archiversteller:

kitopen

Archiv-Prüfsumme:

213185fcdd4b34111aa2319a3848f4eb (MD5)

Embargo-Zeitraum:

Die Metadaten wurden nachträglich korrigiert. Die ursprünglichen Metadaten sind nach Download des Datenpakets verfügbar.

DOI: 10.35097/1345

Publikationsdatum: 2023-06-21

Datenpaket herunterladen

Herunterladen (266,9 MB)

Metadaten herunterladen

Statistik

0
Views

0
Downloads

Lizenz für das Datenpaket

Dieses Werk ist lizenziert unter
CC BY 4.0

Datenpaket zitieren

Bach, Jakob; Zoller, Kolja; Schulz, Katrin (2023): Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection". Karlsruhe Institute of Technology. DOI: 10.35097/1345

Datenpaket: Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection"

Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection"

ms/

ms-results/

openml/

openml-results/

`ms/`

`ms-results/`

`openml/`

`openml-results/`