Alternativer Identifier:
(KITopen-DOI) 10.5445/IR/1000148891
Verwandter Identifier:
Ersteller/in:
Bach, Jakob https://orcid.org/0000-0003-0301-2798 [Institut für Programmstrukturen und Datenorganisation (IPD), Karlsruher Institut für Technologie (KIT)]

Zoller, Kolja [Computational Materials Science (IAM-CMS), Karlsruher Institut für Technologie (KIT)]

Schulz, Katrin [Computational Materials Science (IAM-CMS), Karlsruher Institut für Technologie (KIT)]
Beitragende:
-
Titel:
Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection"
Weitere Titel:
-
Beschreibung:
(Abstract) These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection" published at the journal [*SN Computer Science*](https://www.springer.com/journal/42979). You can find the paper [here](https://doi.org/10.1007/s42979-022-01338-z) and the code [here](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection). See the `README` for details. Some of the datasets used in our study (which we also provide here) originate from [OpenML](https://www.openml.org) and are CC-BY-licensed. Please see the paragraph `Licensing` in the `README` for details, e.g., on the authors of these datasets.
(Technical Remarks) # Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection" These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection" accepted at the journal [*SN Computer Science*](https://www.springer.com/journal/42979). Check our [GitHub repository](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection) for the code and instructions to reproduce the experiments. The data were obtained on a server with an `AMD EPYC 7551` [CPU](https://www.amd.com/en/products/cpu/amd-epyc-7551) (32 physical cores, base clock of 2.0 GHz) and 128 GB RAM. The Python version was `3.8`. Our paper contains two studies, and we provide data for both of them. Running the experimental pipeline for the study with synthetic constraints (`syn_pipeline.py`) took several hours. The commit hash for the last run of this pipeline is [`acc34cf5d2`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/acc34cf5d22b0a8427852a01288bb8b34f5d8c98). The commit hash for the last run of the corresponding evaluation (`syn_evaluation.py`) is [`c1a7e7e99e`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/c1a7e7e99e56c1a178a602596c13641d7771df0a). Running the experimental pipeline for the case study in materials science (`ms_pipeline.py`) took less than one hour. The commit hash for the last run of this pipeline is [`ba30bf9f11`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/ba30bf9f11703e2a8a942425e2cd4b9f36ead513). The commit hash for the last run of the corresponding evaluation (`ms_evaluation.py`) is [`c1a7e7e99e`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/c1a7e7e99e56c1a178a602596c13641d7771df0a). All these commits are also tagged. In the following, we describe the structure/content of each data file. All files are plain CSVs, so you can read them with `pandas.read_csv()`. ## `ms/` The input data for the case study in materials science (`ms_pipeline.py`). Output of the script `prepare_ms_dataset.py`. As the raw simulation dataset is quite large, we only provide a pre-processed version of it (we do not provide the input to `prepare_ms_dataset.py`). In this pre-processed version, the feature and target parts of the data are already separated into two files: `voxel_data_predict_glissile_X.csv` and `voxel_data_predict_glissile_y.csv`. In `voxel_data_predict_glissile_X.csv`, each column is a numeric feature. `voxel_data_predict_glissile_y.csv` only contains one column, the numeric prediction target (reaction density of glissile reactions). ## `ms-results/` Only contains one result file (`results.csv`) for the case study in materials science. Output of the script `ms_pipeline.py`, input to the script `ms_evaluation.py`. The columns of the file mostly correspond to evaluation metrics used in the paper; see Appendix A.1 for definitions of the latter. - `objective_value` (float): Objective `Q(s, X, y)`, the sum of the qualities of the selected features. - `num_selected` (int): `n_{se}`, the number of selected features. - `selected` (string, but actually a list of strings): Names of the selected features. - `num_variables` (int): `n`, the total number of features in the dataset. - `num_constrained_variables` (int): `n_{cf}`, the number of features involved in constraints. - `num_unique_constrained_variables` (int): `n_{ucf}`, the number of unique features involved in constraints. - `num_constraints` (int): `n_{co}`, the number of constraints. - `frac_solutions` (float): `n_{so}^{norm}`, the number of valid (regarding constraints) feature sets relative to the total number of feature sets. - `linear-regression_train_r2` (float): `R^2` (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the training set. - `linear-regression_test_r2` (float): `R^2` (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the test set. - `regression-tree_train_r2` (float): `R^2` (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the training set. - `regression-tree_test_r2` (float): `R^2` (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the test set. - `xgb-linear_train_r2` (float): `R^2` (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the training set. - `xgb-linear_test_r2` (float): `R^2` (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the test set. - `xgb-tree_train_r2` (float): `R^2` (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the training set. - `xgb-tree_test_r2` (float): `R^2` (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the test set. - `evaluation_time` (float): Runtime (in s) for evaluating one set of constraints. - `split_idx` (int): Index of the cross-validation fold. - `quality_name` (string): Measure for feature quality (absolute correlation or mutual information). - `constraint_name` (string): Name of the constraint type (see paper). - `dataset_name` (string): Name of the dataset. ## `openml/` The input data for the study with synthetic constraints (`syn_pipeline.py`). Output of the script `prepare_openml_datasets.py`. We downloaded 35 datasets from [OpenML](https://www.openml.org) and removed non-numeric columns. Also, we separated the feature part (`*_X.csv`) and the target part (`*_y.csv`) of each dataset. `_data_overview.csv` contains meta-data for the datasets, including dataset id, dataset version, and uploader. **Licensing** Please consult each dataset's website on [OpenML](https://www.openml.org) for licensing information and citation requests. According to OpenML's [terms](https://www.openml.org/terms), OpenML datasets fall under the [CC-BY](https://creativecommons.org/licenses/by/4.0/) license. The datasets used in our study were uploaded by: - Jan van Rijn (user id: 1) - Joaquin Vanschoren (user id: 2) - Rafael Gomes Mantovani (user id: 64) - Tobias Kuehn (user id: 94) - Richard Ooms (user id: 8684) - R P (user id: 15317) See `_data_overview.csv` to match each dataset to its uploader. ## `openml-results/` Result files for the study with synthetic constraints. Output of the script `syn_pipeline.py`, input to the script `syn_evaluation.py`. One result file for each combination of the 10 constraint generators and the 35 datasets, plus one overall (merged) file, `results.csv`. The columns of the result files are the those of `ms-results/results.csv`, minus `selected` and `evaluation_time`; see above for detailed descriptions.
Schlagworte:
Feature selection
Constraints
Domain knowledge
Theory-guided data science
Zugehörige Informationen:
-
Sprache:
-
Erstellungsjahr:
Fachgebiet:
Computer Science
Objekttyp:
Dataset
Datenquelle:
-
Verwendete Software:
-
Datenverarbeitung:
-
Erscheinungsjahr:
Rechteinhaber/in:

Zoller, Kolja

Schulz, Katrin
Förderung:
-
Name Speichervolumen Metadaten Upload Aktion
Status:
Publiziert
Eingestellt von:
kitopen
Erstellt am:
Archivierungsdatum:
2023-06-21
Archivgröße:
266,9 MB
Archiversteller:
kitopen
Archiv-Prüfsumme:
213185fcdd4b34111aa2319a3848f4eb (MD5)
Embargo-Zeitraum:
-