Beschreibung:
(Abstract) These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection"
published at the journal [*SN Computer Science*](https://www.springer.com/journal/42979).
You can find the paper [here](https://doi.org/10.1007/s42979-022-01338-z) and the code [here](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection).
See the `README` for details.
Some of the datasets used in our study (which we also provide here) originate from [OpenML](https://www.openml.org) and are CC-BY-licensed.
Please see the paragraph `Licensing` in the `README` for details, e.g., on the authors of these datasets.
(Technical Remarks) # Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection"
These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection"
accepted at the journal [*SN Computer Science*](https://www.springer.com/journal/42979).
Check our [GitHub repository](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection) for the code and instructions to reproduce the experiments.
The data were obtained on a server with an `AMD EPYC 7551` [CPU](https://www.amd.com/en/products/cpu/amd-epyc-7551) (32 physical cores, base clock of 2.0 GHz) and 128 GB RAM.
The Python version was `3.8`.
Our paper contains two studies, and we provide data for both of them.
Running the experimental pipeline for the study with synthetic constraints (`syn_pipeline.py`) took several hours.
The commit hash for the last run of this pipeline is [`acc34cf5d2`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/acc34cf5d22b0a8427852a01288bb8b34f5d8c98).
The commit hash for the last run of the corresponding evaluation (`syn_evaluation.py`) is [`c1a7e7e99e`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/c1a7e7e99e56c1a178a602596c13641d7771df0a).
Running the experimental pipeline for the case study in materials science (`ms_pipeline.py`) took less than one hour.
The commit hash for the last run of this pipeline is [`ba30bf9f11`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/ba30bf9f11703e2a8a942425e2cd4b9f36ead513).
The commit hash for the last run of the corresponding evaluation (`ms_evaluation.py`) is [`c1a7e7e99e`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/c1a7e7e99e56c1a178a602596c13641d7771df0a).
All these commits are also tagged.
In the following, we describe the structure/content of each data file.
All files are plain CSVs, so you can read them with `pandas.read_csv()`.
## `ms/`
The input data for the case study in materials science (`ms_pipeline.py`).
Output of the script `prepare_ms_dataset.py`.
As the raw simulation dataset is quite large, we only provide a pre-processed version of it
(we do not provide the input to `prepare_ms_dataset.py`).
In this pre-processed version, the feature and target parts of the data are already separated into two files: `voxel_data_predict_glissile_X.csv` and `voxel_data_predict_glissile_y.csv`.
In `voxel_data_predict_glissile_X.csv`, each column is a numeric feature.
`voxel_data_predict_glissile_y.csv` only contains one column, the numeric prediction target (reaction density of glissile reactions).
## `ms-results/`
Only contains one result file (`results.csv`) for the case study in materials science.
Output of the script `ms_pipeline.py`, input to the script `ms_evaluation.py`.
The columns of the file mostly correspond to evaluation metrics used in the paper;
see Appendix A.1 for definitions of the latter.
- `objective_value` (float): Objective `Q(s, X, y)`, the sum of the qualities of the selected features.
- `num_selected` (int): `n_{se}`, the number of selected features.
- `selected` (string, but actually a list of strings): Names of the selected features.
- `num_variables` (int): `n`, the total number of features in the dataset.
- `num_constrained_variables` (int): `n_{cf}`, the number of features involved in constraints.
- `num_unique_constrained_variables` (int): `n_{ucf}`, the number of unique features involved in constraints.
- `num_constraints` (int): `n_{co}`, the number of constraints.
- `frac_solutions` (float): `n_{so}^{norm}`, the number of valid (regarding constraints) feature sets relative to the total number of feature sets.
- `linear-regression_train_r2` (float): `R^2` (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the training set.
- `linear-regression_test_r2` (float): `R^2` (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the test set.
- `regression-tree_train_r2` (float): `R^2` (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the training set.
- `regression-tree_test_r2` (float): `R^2` (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the test set.
- `xgb-linear_train_r2` (float): `R^2` (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the training set.
- `xgb-linear_test_r2` (float): `R^2` (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the test set.
- `xgb-tree_train_r2` (float): `R^2` (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the training set.
- `xgb-tree_test_r2` (float): `R^2` (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the test set.
- `evaluation_time` (float): Runtime (in s) for evaluating one set of constraints.
- `split_idx` (int): Index of the cross-validation fold.
- `quality_name` (string): Measure for feature quality (absolute correlation or mutual information).
- `constraint_name` (string): Name of the constraint type (see paper).
- `dataset_name` (string): Name of the dataset.
## `openml/`
The input data for the study with synthetic constraints (`syn_pipeline.py`).
Output of the script `prepare_openml_datasets.py`.
We downloaded 35 datasets from [OpenML](https://www.openml.org) and removed non-numeric columns.
Also, we separated the feature part (`*_X.csv`) and the target part (`*_y.csv`) of each dataset.
`_data_overview.csv` contains meta-data for the datasets, including dataset id, dataset version, and uploader.
**Licensing**
Please consult each dataset's website on [OpenML](https://www.openml.org) for licensing information and citation requests.
According to OpenML's [terms](https://www.openml.org/terms), OpenML datasets fall under the [CC-BY](https://creativecommons.org/licenses/by/4.0/) license.
The datasets used in our study were uploaded by:
- Jan van Rijn (user id: 1)
- Joaquin Vanschoren (user id: 2)
- Rafael Gomes Mantovani (user id: 64)
- Tobias Kuehn (user id: 94)
- Richard Ooms (user id: 8684)
- R P (user id: 15317)
See `_data_overview.csv` to match each dataset to its uploader.
## `openml-results/`
Result files for the study with synthetic constraints.
Output of the script `syn_pipeline.py`, input to the script `syn_evaluation.py`.
One result file for each combination of the 10 constraint generators and the 35 datasets, plus one overall (merged) file, `results.csv`.
The columns of the result files are the those of `ms-results/results.csv`, minus `selected` and `evaluation_time`;
see above for detailed descriptions.