Default XGBoost with default features#
Feature set description#
What features are used?#
We use all features that are available in the dataset, this includes:
statistics of persistence diagrams
histograms of persistence diagrams
average minimum distance
geometric properties (surface area, density)
This feature set is high-dimensional (2387) features.
Why are the features informative?#
One would expect that the features are redundant, but informative. From chemical insight, key properties for the Henry coefficient are the porosity and the chemistry. Both scopes are captured with the feature set.
However, the default set also includes the unit cell volume, which does not have a relation with the Henry coefficient.
Why do the features not leak information?#
We do not expect direct leakage (if the energy grid histogram or even the Henry coefficient were used as features this would be a clear case of leakage).
Will the features be accessible in real-world test applications?#
All features can be computed from the crystal structure.
Describe preprocessing steps and how data leakage is avoided#
We did not preprocess the data.
Describe the feature selection steps and how data leakage is avoided#
We did not perform feature selection.
Describe the model selection steps and how data leakage is avoided#
We did not perform model selection.
Regression Metrics: default-xgboost-default-feat RdedeStm-20220809154539