# Datasets#

## Structure datasets#

CoRE Dataset.

class CoREDataset(version='v0.0.1', drop_basename_duplicates=True, drop_graph_duplicates=True, subset=None, drop_nan=True)[source]#

Dataset of gas uptake related features for a subset of CoRE MOFs.

The labels were computed by Moosavi et al. (2020) [Moosavi2020]. The raw labels and structures can accessed also on MaterialsCloud.

To reduce the risk of data leakage, we (by default) also only keep one representative structure for a “base refcode” (i.e. the first five letters of a refcode). For instance, the base refcode for IGAHED001 is IGAHED. Structures with same base refcode but different refcodes are often different refinements, or measurements at different temperatures and hence chemically quite similar. For instance, the base refcode UMODEH would appear 21 times, KEDJAG 17 times, and UMOYOM 17 times in the CoRE dataset used by Moosavi et al. Additionally, we (by default) only keep one structure per “structure hash” which is an approximate graph-isomoprhism check, assuming the VESTA bond thresholds for the derivation of the structure graph (e.g. the structure graph of ULOMAL occurs 59 in the CoRE database used by Moosavi et al.).

Warning

Even though we performed some basic sanity checks, there are currently still some structures that might chemically not be reasonable. Also, even though we only keep one structure per base refcode, there is still potential for data leakge. We urge users to still drop duplicates (or close neighbors) after featurization.

If this set is used as test set, make sure to drop all overlapping entries in your training set.

The years refer to the publication dates of the paper crossreferenced in the CSD entry of the structure. We excluded structures that are not deposited in the CSD.

The available labels are:

• ‘pure_CO2_kH’: Henry coefficient of CO2 obtained by Widom method in mol kg-1 Pa-1

• ‘pure_CO2_widomHOA’: Heat of adsorption of CO2 obtained by Widom method in

• ‘pure_methane_kH’: Henry coefficient of methane obtained by Widom method in mol kg-1 Pa-1

• ‘pure_methane_widomHOA’: Heat of adsorption of methane obtained by Widom method

• ‘pure_uptake_CO2_298.00_15000’: Pure CO2 uptake at 298.00 K and 15000 Pa in mol kg-1

• ‘pure_uptake_CO2_298.00_1600000’: Pure CO2 uptake at 298.00 K and 1600000 Pa in mol kg-1

• ‘pure_uptake_methane_298.00_580000’: Pure methane uptake at 298.00 K and 580000 Pa in mol kg-1

• ‘pure_uptake_methane_298.00_6500000’: Pure methane uptake at 298.00 K and 6500000 Pa in mol kg-1

• ‘logKH_CO2’: Logarithm of Henry coefficient of CO2 obtained by Widom method

• ‘logKH_CH4’: Logarithm of Henry coefficient of methane obtained by Widom method

• ‘CH4DC’: CH4 deliverable capacity in vSTP/v

• ‘CH4HPSTP’: CH4 high pressure uptake in standard temperature and pressure in vSTP/v

• ‘CH4LPSTP’: CH4 low pressure uptake in standard temperature and pressure in vSTP/v

References

[Moosavi2020]

Moosavi, S. M.; Nandy, A.; Jablonka, K. M.; Ongari, D.; Janet, J. P.; Boyd, P. G.; Lee, Y.; Smit, B.; Kulik, H. J. Understanding the Diversity of the Metal-Organic Framework Ecosystem. Nature Communications 2020, 11 (1), 4068. https://doi.org/10.1038/s41467-020-17755-8.

Construct an instance of the CoRE dataset.

Parameters:
• version (str) – version number to use. Defaults to “v0.0.1”.

• drop_basename_duplicates (bool) – If True, keep only one structure per CSD basename. Defaults to True.

• drop_graph_duplicates (bool) – If True, keep only one structure per decorated graph hash. Defaults to True.

• subset (Collection[int], optional) – indices of the structures to include. Defaults to None.

• drop_nan (bool) – If True, drop rows with NaN values in features or hashes. Defaults to True.

Raises:

ValueError – If the provided version number is not available.

get_subset(indices)[source]#

Get a subset of the dataset.

Parameters:

indices (Collection[int]) – indices of the structures to include.

Returns:

a new dataset containing only the structures

specified by the indices.

Return type:

AbstractStructureDataset

Subset of the QMOF dataset.

class QMOFDataset(version='v0.0.1', flavor='all', drop_basename_duplicates=True, drop_graph_duplicates=True, subset=None, drop_nan=False)[source]#

Exposes the QMOF dataset by Rosen et al. [Rosen2021] [Rosen2022] .

Currently based on v14 of the QMOF dataset.

To reduce the risk of data leakage, we (by default) also only keep one representative structure for a “base refcode” (i.e. the first five letters of a refcode). For instance, the base refcode for IGAHED001 is IGAHED. Structures with same base refcode but different refcodes are often different refinements, or measurements at different temperatures and hence chemically quite similar. For instance, in the QMOF dataset the basecode BOJKAM appears four times. Additionally, we (by default) only keep one structure per “structure hash” which is an approximate graph-isomoprhism check, assuming the VESTA bond thresholds for the derivation of the structure graph.

Note that Rosen et al. already performed some deduplication using the pymatgen StructureMatcher. Our de-duplication is a bit more aggressive, and might be too aggressive in some cases.

Warning

Even though we performed some basic sanity checks and Rosen et al. included checks to ensure high-fidelity structures, there might still be some structures that are not chemically reasonable. Also, even though we only keep one structure per base refcode, there is still potential for data leakge. We urge users to still drop duplicates (or close neighbors) after featurization.

This dataset is available in different flavors:

• "all": the full dataset, all original QMOF structures for which we

could compute features and hashes

• "csd": the subset which comes from the CSD and for which

we could retrieve publication years.

• "gcmc": the subset for which we performed grand canonical Monte Carlo

simulations

• "gcmc-csd": the subset for which we performed grand canonical Monte Carlo

simulations and for which we could retrieve publication years.

Currently, we expose the following labels:

• outputs.pbe.energy_total

• outputs.pbe.energy_vdw

• outputs.pbe.energy_elec

• outputs.pbe.net_magmom

• outputs.pbe.bandgap

• outputs.pbe.cbm

• outputs.pbe.vbm

• outputs.pbe.directgap

• outputs.pbe.bandgap_spins

• outputs.pbe.cbm_spins

• outputs.pbe.vbm_spins

• outputs.pbe.directgap_spins

• outputs.hle17.energy_total

• outputs.hle17.energy_vdw

• outputs.hle17.energy_elec

• outputs.hle17.net_magmom

• outputs.hle17.bandgap

• outputs.hle17.cbm

• outputs.hle17.vbm

• outputs.hle17.directgap

• outputs.hle17.bandgap_spins

• outputs.hle17.cbm_spins

• outputs.hle17.vbm_spins

• outputs.hle17.directgap_spins

• outputs.hse06_10hf.energy_total

• outputs.hse06_10hf.energy_vdw

• outputs.hse06_10hf.energy_elec

• outputs.hse06_10hf.net_magmom

• outputs.hse06_10hf.bandgap

• outputs.hse06_10hf.cbm

• outputs.hse06_10hf.vbm

• outputs.hse06_10hf.directgap

• outputs.hse06_10hf.bandgap_spins

• outputs.hse06_10hf.cbm_spins

• outputs.hse06_10hf.vbm_spins

• outputs.hse06_10hf.directgap_spins

• outputs.hse06.energy_total

• outputs.hse06.energy_vdw

• outputs.hse06.energy_elec

• outputs.hse06.net_magmom

• outputs.hse06.bandgap

• outputs.hse06.cbm

• outputs.hse06.vbm

• outputs.hse06.directgap

• outputs.hse06.bandgap_spins

• outputs.hse06.cbm_spins

• outputs.hse06.vbm_spins

• outputs.hse06.directgap_spins

• outputs.CO2_Henry_coefficient

• outputs.N2_Henry_coefficient

• outputs.CO2_parasitic_energy_(coal)

• outputs.Gravimetric_working_capacity_(coal)

• outputs.Volumetric_working_capacity_(coal)

• outputs.CO2_parasitic_energy_(nat_gas)

• outputs.Gravimetric_working_capacity_(nat_gas)

• outputs.Volumetric_working_capacity_(nat_gas)

• outputs.Final_CO2_purity_(nat_gas)

• outputs.CH4_Henry_coefficient

• outputs.Working_capacity_vol_(58-65bar–298K)

• outputs.Working_capacity_mol_(58-65bar–298K)

• outputs.Working_capacity_fract_(58-65bar–298K)

• outputs.Working_capacity_wt%_(58-65bar–298K)

• outputs.O2_Henry_coefficient

• outputs.Working_capacity_vol_(5-140bar–298K)

• outputs.Working_capacity_mol_(5-140bar–298K)

• outputs.Working_capacity_fract_(5-140bar–298K)

• outputs.Working_capacity_wt%_(5-140bar–298K)

• outputs.Xe_Henry_coefficient

• outputs.Kr_Henry_coefficient

• outputs.Xe–Kr_selectivity__at__298K

• outputs.Working_capacity_g–L_(5-100bar–298-198K)

• outputs.Working_capacity_g–L_(5-100bar–77K)

• outputs.Working_capacity_g–L_(1-100bar–77K)

• outputs.Working_capacity_wt%_(5-100bar–298-198K)

• outputs.Working_capacity_wt%_(5-100bar–77K)

• outputs.Working_capacity_wt%_(1-100bar–77K)

• outputs.H2S_Henry_coefficient

• outputs.H2O_Henry_coefficient

• outputs.H2S–H2O_selectivity__at__298K

• outputs.CH4–N2_selectivity__at__298K

Note that many of the gas adsorption data are numpy.nan because the pores are not accessible to the guest molecules. Depending on your application you might want to fill them with zeros or drop them.

Warning

The class will load almost 1GB of data into memory.

Warning

By default, the values will be sorted by the PBE total energy

References

Construct an instance of the QMOF dataset.

Parameters:
• version (str) – version number to use. Defaults to “v0.0.1”.

• flavor (str) – flavor of the dataset to use. Accepted values are “all”, “csd”, “gcmc”, and “csd-gcmc”. Defaults to “all”.

• drop_basename_duplicates (bool) – If True, keep only one structure per CSD basename. Defaults to True.

• drop_graph_duplicates (bool) – If True, keep only one structure per decorated graph hash. Defaults to True.

• subset (Optional[Collection[int]]) – indices of the structures to include. This is useful for subsampling the dataset. Defaults to None.

• drop_nan (bool) – If True, drop rows with NaN values in features or hashes. Defaults to False.

Raises:

ValueError – If the provided version number is not available.

get_subset(indices)[source]#

Get a subset of the dataset.

Parameters:

indices (Collection[int]) – indices of the structures to include.

Returns:

a new dataset containing only the structures

specified by the indices.

Return type:

AbstractStructureDataset

Structures from the Boyd-Woo database and labels from Moosavi et al.

class BWDataset(version='v0.0.1', drop_graph_duplicates=True, subset=None)[source]#

Exposes the BW20K dataset used in [Moosavi2020].

The raw labels and structures can accessed also on MaterialsCloud.

It is a subset of the BW database [Boyd2019] [Boyd2016] with labels computed by Moosavi et al. Those labels deviate in value and computational approach from the original labels in [Boyd2019] but are consistent with the labels for the other databases in [Moosavi2020].

The available labels are:
• ‘pure_CO2_kH’: Henry coefficient of CO2 obtained by Widom method in mol kg-1 Pa-1

• ‘pure_CO2_widomHOA’: Heat of adsorption of CO2 obtained by Widom method in

• ‘pure_methane_kH’: Henry coefficient of methane obtained by Widom method in mol kg-1 Pa-1

• ‘pure_methane_widomHOA’: Heat of adsorption of methane obtained by Widom method

• ‘pure_uptake_CO2_298.00_15000’: Pure CO2 uptake at 298.00 K and 15000 Pa in mol kg-1

• ‘pure_uptake_CO2_298.00_1600000’: Pure CO2 uptake at 298.00 K and 1600000 Pa in mol kg-1

• ‘pure_uptake_methane_298.00_580000’: Pure methane uptake at 298.00 K and 580000 Pa in mol kg-1

• ‘pure_uptake_methane_298.00_6500000’: Pure methane uptake at 298.00 K and 6500000 Pa in mol kg-1

• ‘logKH_CO2’: Logarithm of Henry coefficient of CO2 obtained by Widom method

• ‘logKH_CH4’: Logarithm of Henry coefficient of methane obtained by Widom method

• ‘CH4DC’: CH4 deliverable capacity in vSTP/v

• ‘CH4HPSTP’: CH4 high pressure uptake in standard temperature and pressure in vSTP/v

• ‘CH4LPSTP’: CH4 low pressure uptake in standard temperature and pressure in vSTP/v

Note

The BW structures are hypothetical MOFs, therefore the following caveats apply:

• It is well known that the data distribution can be quite different from experimental structures

• The structures were only optimized using the UFF force field [UFF]

• A time-based split cannot be used for hypothetical structures

This dataset exposed information about the building blocks of the MOFs. You might find this useful for grouped-cross-validation (as MOFs with same building-blocks and/or net are not really independent).

You find this info under also in the info.rcsr_code, info.metal_bb, and info.organic_bb, info.functional_group columns.

Warning

Danger of data leakage

Cross validation for MOFs with same building-blocks and/or net is notoriously difficult. Since all combinations of building-blocks and/or net are considered, it is not trivial to find completely independent groups.

References

Construct an instance of the CoRE dataset.

Parameters:
• version (str) – version number to use. Defaults to “v0.0.1”.

• drop_graph_duplicates (bool) – If True, keep only one structure per decorated graph hash. Defaults to True.

• subset (Collection[int], optional) – indices of the structures to include. Defaults to None.

Raises:

ValueError – If the provided version number is not available.

get_subset(indices)[source]#

Get a subset of the dataset.

Parameters:

indices (Collection[int]) – indices of the structures to include.

Returns:

a new dataset containing only the structures

specified by the indices.

Return type:

AbstractStructureDataset

Structures from the ARABG database and labels from Moosavi et al.

class ARABGDataset(version='v0.0.1', drop_graph_duplicates=True, subset=None)[source]#

Exposes the ARABG dataset used in [Moosavi2020].

The raw labels and structures can accessed also on MaterialsCloud.

It is a subset of the ARABG database [Anderson2018] with labels computed by Moosavi et al. Those labels deviate in value and computational approach from the original labels in [Boyd2019] but are consistent with the labels for the other databases in [Moosavi2020].

The available labels are:
• ‘pure_CO2_kH’: Henry coefficient of CO2 obtained by Widom method in mol kg-1 Pa-1

• ‘pure_CO2_widomHOA’: Heat of adsorption of CO2 obtained by Widom method in

• ‘pure_methane_kH’: Henry coefficient of methane obtained by Widom method in mol kg-1 Pa-1

• ‘pure_methane_widomHOA’: Heat of adsorption of methane obtained by Widom method

• ‘pure_uptake_CO2_298.00_15000’: Pure CO2 uptake at 298.00 K and 15000 Pa in mol kg-1

• ‘pure_uptake_CO2_298.00_1600000’: Pure CO2 uptake at 298.00 K and 1600000 Pa in mol kg-1

• ‘pure_uptake_methane_298.00_580000’: Pure methane uptake at 298.00 K and 580000 Pa in mol kg-1

• ‘pure_uptake_methane_298.00_6500000’: Pure methane uptake at 298.00 K and 6500000 Pa in mol kg-1

• ‘logKH_CO2’: Logarithm of Henry coefficient of CO2 obtained by Widom method

• ‘logKH_CH4’: Logarithm of Henry coefficient of methane obtained by Widom method

• ‘CH4DC’: CH4 deliverable capacity in vSTP/v

• ‘CH4HPSTP’: CH4 high pressure uptake in standard temperature and pressure in vSTP/v

• ‘CH4LPSTP’: CH4 low pressure uptake in standard temperature and pressure in vSTP/v

Note

The ARABG structures are hypothetical MOFs, therefore the following caveats apply:

• It is well known that the data distribution can be quite different from experimental structures

• The structures were only optimized using the UFF force field [UFF]

• A time-based split cannot be used for hypothetical structures

This dataset exposed information about the building blocks of the MOFs. You might find this useful for grouped-cross-validation (as MOFs with same building-blocks and/or net are not really independent).

You find this info under also in the info.rcsr_code, info.metal_bb, and info.organic_bb, info.functional_group columns.

Warning

Danger of data leakage

Cross validation for MOFs with same building-blocks and/or net is notoriously difficult. Since all combinations of building-blocks and/or net are considered, it is not trivial to find completely independent groups.

References

`Anderson, R.; Rodgers, J.; Argueta, E.; Biong, A.; Gómez-Gualdrón, D. A. Role of Pore Chemistry and Topology in the CO2 Capture Capabilities of MOFs: From Molecular Simulation to Machine Learning. Chemistry of Materials 2018, 30 (18), 6325–6337. <https://doi.org/10.1021/acs.chemmater.8b02257>_

Construct an instance of the CoRE dataset.

Parameters:
• version (str) – version number to use. Defaults to “v0.0.1”.

• drop_graph_duplicates (bool) – If True, keep only one structure per decorated graph hash. Defaults to True.

• subset (Collection[int], optional) – indices of the structures to include. Defaults to None.

Raises:

ValueError – If the provided version number is not available.

get_subset(indices)[source]#

Get a subset of the dataset.

Parameters:

indices (Collection[int]) – indices of the structures to include.

Returns:

a new dataset containing only the structures

specified by the indices.

Return type:

AbstractStructureDataset

Structures from the ARC-MOF dataset.

class ARCMOFDataset(version='v0.0.1', drop_graph_duplicates=True, subset=None)[source]#

Implements access to a subset of structures and labels of the ARC-MOF dataset [Burner2022].

The subset consistes of the structures for which the authors reported process properties and for which we could compute graph hashes and features.

Warning

ARC-MOF is “a database of ~280,000 MOFs which have been either experimentally characterized or computationally generated, spanning all publicly available MOF databases” [Burner2022].

Therefore, there will be significant overlap with the other datasets.

References

Construct an instance of the CoRE dataset.

Parameters:
• version (str) – version number to use. Defaults to “v0.0.1”.

• drop_graph_duplicates (bool) – If True, keep only one structure per decorated graph hash. Defaults to True.

• subset (Collection[int], optional) – indices of the structures to include. Defaults to None.

Raises:

ValueError – If the provided version number is not available.

get_subset(indices)[source]#

Get a subset of the dataset.

Parameters:

indices (Collection[int]) – indices of the structures to include.

Returns:

a new dataset containing only the structures

specified by the indices.

Return type:

AbstractStructureDataset

Thermal Stability Dataset.

class ThermalStabilityDataset(version='v0.0.1', drop_basename_duplicates=True, drop_graph_duplicates=True, subset=None, drop_nan=False)[source]#

Thermal stability for a subset of CoRE MOFs.

Reproduced from [Nandy2022]. Nandy et al. (2022) digitized traces from thermogravimetric analysis. The decomposition temperature they determined in this way is reported in outputs.assigned_T_decomp.

To reduce the risk of data leakage, we (by default) also only keep one representative structure for a “base refcode” (i.e. the first five letters of a refcode). For instance, the base refcode for IGAHED001 is IGAHED. Structures with same base refcode but different refcodes are often different refinements, or measurements at different temperatures and hence chemically quite similar. For instance, the base refcode UMODEH would appear 21 times, KEDJAG 17 times, and UMOYOM 17 times in the CoRE dataset used by Moosavi et al. Additionally, we (by default) only keep one structure per “structure hash” which is an approximate graph-isomoprhism check, assuming the VESTA bond thresholds for the derivation of the structure graph (e.g. the structure graph of ULOMAL occurs 59 in the CoRE database used by Moosavi et al.).

The years refer to the publication dates of the paper crossreferenced in the CSD entry of the structure.

The available labels are:

• outputs.assigned_T_decomp: Decomposition temperature in Kelvin.

References::