Datasets#

Structure datasets#

CoRE Dataset.

class CoREDataset(version='v0.0.1', drop_basename_duplicates=True, drop_graph_duplicates=True, subset=None, drop_nan=True)[source]#

Dataset of gas uptake related features for a subset of CoRE MOFs.

The labels were computed by Moosavi et al. (2020) [Moosavi2020]. The raw labels and structures can accessed also on MaterialsCloud.

To reduce the risk of data leakage, we (by default) also only keep one representative structure for a “base refcode” (i.e. the first five letters of a refcode). For instance, the base refcode for IGAHED001 is IGAHED. Structures with same base refcode but different refcodes are often different refinements, or measurements at different temperatures and hence chemically quite similar. For instance, the base refcode UMODEH would appear 21 times, KEDJAG 17 times, and UMOYOM 17 times in the CoRE dataset used by Moosavi et al. Additionally, we (by default) only keep one structure per “structure hash” which is an approximate graph-isomoprhism check, assuming the VESTA bond thresholds for the derivation of the structure graph (e.g. the structure graph of ULOMAL occurs 59 in the CoRE database used by Moosavi et al.).

Warning

Even though we performed some basic sanity checks, there are currently still some structures that might chemically not be reasonable. Also, even though we only keep one structure per base refcode, there is still potential for data leakge. We urge users to still drop duplicates (or close neighbors) after featurization.

If this set is used as test set, make sure to drop all overlapping entries in your training set.

The years refer to the publication dates of the paper crossreferenced in the CSD entry of the structure. We excluded structures that are not deposited in the CSD.

The available labels are:

‘pure_CO2_kH’: Henry coefficient of CO2 obtained by Widom method in mol kg-1 Pa-1

‘pure_CO2_widomHOA’: Heat of adsorption of CO2 obtained by Widom method in

‘pure_methane_kH’: Henry coefficient of methane obtained by Widom method in mol kg-1 Pa-1

‘pure_methane_widomHOA’: Heat of adsorption of methane obtained by Widom method

‘pure_uptake_CO2_298.00_15000’: Pure CO2 uptake at 298.00 K and 15000 Pa in mol kg-1

‘pure_uptake_CO2_298.00_1600000’: Pure CO2 uptake at 298.00 K and 1600000 Pa in mol kg-1

‘pure_uptake_methane_298.00_580000’: Pure methane uptake at 298.00 K and 580000 Pa in mol kg-1

‘pure_uptake_methane_298.00_6500000’: Pure methane uptake at 298.00 K and 6500000 Pa in mol kg-1

‘logKH_CO2’: Logarithm of Henry coefficient of CO2 obtained by Widom method

‘logKH_CH4’: Logarithm of Henry coefficient of methane obtained by Widom method

‘CH4DC’: CH4 deliverable capacity in vSTP/v

‘CH4HPSTP’: CH4 high pressure uptake in standard temperature and pressure in vSTP/v

‘CH4LPSTP’: CH4 low pressure uptake in standard temperature and pressure in vSTP/v

References

[Moosavi2020]

Moosavi, S. M.; Nandy, A.; Jablonka, K. M.; Ongari, D.; Janet, J. P.; Boyd, P. G.; Lee, Y.; Smit, B.; Kulik, H. J. Understanding the Diversity of the Metal-Organic Framework Ecosystem. Nature Communications 2020, 11 (1), 4068. https://doi.org/10.1038/s41467-020-17755-8.

Construct an instance of the CoRE dataset.

Parameters:

version (str) – version number to use. Defaults to “v0.0.1”.
drop_basename_duplicates (bool) – If True, keep only one structure per CSD basename. Defaults to True.
drop_graph_duplicates (bool) – If True, keep only one structure per decorated graph hash. Defaults to True.
subset (Collection[int], optional) – indices of the structures to include. Defaults to None.
drop_nan (bool) – If True, drop rows with NaN values in features or hashes. Defaults to True.

Raises:

ValueError – If the provided version number is not available.

get_subset(indices)[source]#

Get a subset of the dataset.

Parameters:

indices (Collection[int]) – indices of the structures to include.

Returns:

a new dataset containing only the structures: specified by the indices.

Return type:

AbstractStructureDataset

Subset of the QMOF dataset.

class QMOFDataset(version='v0.0.1', flavor='all', drop_basename_duplicates=True, drop_graph_duplicates=True, subset=None, drop_nan=False)[source]#

Exposes the QMOF dataset by Rosen et al. [Rosen2021] [Rosen2022] .

Currently based on v14 of the QMOF dataset.

To reduce the risk of data leakage, we (by default) also only keep one representative structure for a “base refcode” (i.e. the first five letters of a refcode). For instance, the base refcode for IGAHED001 is IGAHED. Structures with same base refcode but different refcodes are often different refinements, or measurements at different temperatures and hence chemically quite similar. For instance, in the QMOF dataset the basecode BOJKAM appears four times. Additionally, we (by default) only keep one structure per “structure hash” which is an approximate graph-isomoprhism check, assuming the VESTA bond thresholds for the derivation of the structure graph.

Note that Rosen et al. already performed some deduplication using the pymatgen StructureMatcher. Our de-duplication is a bit more aggressive, and might be too aggressive in some cases.

Warning

Even though we performed some basic sanity checks and Rosen et al. included checks to ensure high-fidelity structures, there might still be some structures that are not chemically reasonable. Also, even though we only keep one structure per base refcode, there is still potential for data leakge. We urge users to still drop duplicates (or close neighbors) after featurization.

This dataset is available in different flavors:

"all": the full dataset, all original QMOF structures for which we
could compute features and hashes
"csd": the subset which comes from the CSD and for which
we could retrieve publication years.
"gcmc": the subset for which we performed grand canonical Monte Carlo
simulations
"gcmc-csd": the subset for which we performed grand canonical Monte Carlo
simulations and for which we could retrieve publication years.

Currently, we expose the following labels:

outputs.pbe.energy_total

outputs.pbe.energy_vdw

outputs.pbe.energy_elec

outputs.pbe.net_magmom

outputs.pbe.bandgap

outputs.pbe.cbm

outputs.pbe.vbm

outputs.pbe.directgap

outputs.pbe.bandgap_spins

outputs.pbe.cbm_spins

outputs.pbe.vbm_spins

outputs.pbe.directgap_spins

outputs.hle17.energy_total

outputs.hle17.energy_vdw

outputs.hle17.energy_elec

outputs.hle17.net_magmom

outputs.hle17.bandgap

outputs.hle17.cbm

outputs.hle17.vbm

outputs.hle17.directgap

outputs.hle17.bandgap_spins

outputs.hle17.cbm_spins

outputs.hle17.vbm_spins

outputs.hle17.directgap_spins

outputs.hse06_10hf.energy_total

outputs.hse06_10hf.energy_vdw

outputs.hse06_10hf.energy_elec

outputs.hse06_10hf.net_magmom

outputs.hse06_10hf.bandgap

outputs.hse06_10hf.cbm

outputs.hse06_10hf.vbm

outputs.hse06_10hf.directgap

outputs.hse06_10hf.bandgap_spins

outputs.hse06_10hf.cbm_spins

outputs.hse06_10hf.vbm_spins

outputs.hse06_10hf.directgap_spins

outputs.hse06.energy_total

outputs.hse06.energy_vdw

outputs.hse06.energy_elec

outputs.hse06.net_magmom

outputs.hse06.bandgap

outputs.hse06.cbm

outputs.hse06.vbm

outputs.hse06.directgap

outputs.hse06.bandgap_spins

outputs.hse06.cbm_spins

outputs.hse06.vbm_spins

outputs.hse06.directgap_spins

outputs.CO2_Henry_coefficient

outputs.CO2_adsorption_energy

outputs.N2_Henry_coefficient

outputs.N2_adsorption_energy

outputs.CO2_parasitic_energy_(coal)

outputs.Gravimetric_working_capacity_(coal)

outputs.Volumetric_working_capacity_(coal)

outputs.CO2_parasitic_energy_(nat_gas)

outputs.Gravimetric_working_capacity_(nat_gas)

outputs.Volumetric_working_capacity_(nat_gas)

outputs.Final_CO2_purity_(nat_gas)

outputs.CH4_Henry_coefficient

outputs.CH4_adsorption_energy

outputs.Enthalphy_of_Adsorption__at__58_bar,_298K

outputs.Enthalphy_of_Adsorption__at__65bar–298K

outputs.Working_capacity_vol_(58-65bar–298K)

outputs.Working_capacity_mol_(58-65bar–298K)

outputs.Working_capacity_fract_(58-65bar–298K)

outputs.Working_capacity_wt%_(58-65bar–298K)

outputs.O2_Henry_coefficient

outputs.O2_adsorption_energy

outputs.Enthalphy_of_Adsorption__at__5_bar,_298K

outputs.Enthalphy_of_Adsorption__at__140bar–298K

outputs.Working_capacity_vol_(5-140bar–298K)

outputs.Working_capacity_mol_(5-140bar–298K)

outputs.Working_capacity_fract_(5-140bar–298K)

outputs.Working_capacity_wt%_(5-140bar–298K)

outputs.Xe_Henry_coefficient

outputs.Xe_adsorption_energy

outputs.Kr_Henry_coefficient

outputs.Kr_adsorption_energy

outputs.Xe–Kr_selectivity__at__298K

outputs.Working_capacity_g–L_(5-100bar–298-198K)

outputs.Working_capacity_g–L_(5-100bar–77K)

outputs.Working_capacity_g–L_(1-100bar–77K)

outputs.Working_capacity_wt%_(5-100bar–298-198K)

outputs.Working_capacity_wt%_(5-100bar–77K)

outputs.Working_capacity_wt%_(1-100bar–77K)

outputs.H2S_Henry_coefficient

outputs.H2S_adsorption_energy

outputs.H2O_Henry_coefficient

outputs.H2O_adsorption_energy

outputs.H2S–H2O_selectivity__at__298K

outputs.CH4–N2_selectivity__at__298K

Note that many of the gas adsorption data are numpy.nan because the pores are not accessible to the guest molecules. Depending on your application you might want to fill them with zeros or drop them.

Warning

The class will load almost 1GB of data into memory.

Warning

By default, the values will be sorted by the PBE total energy

References

[Rosen2021]

Rosen, A. S.; Iyer, S. M.; Ray, D.; Yao, Z.; Aspuru-Guzik, A.; Gagliardi, L.; Notestein, J. M.; Snurr, R. Q. Machine Learning the Quantum-Chemical Properties of Metal–Organic Frameworks for Accelerated Materials Discovery. Matter 2021, 4 (5), 1578–1597.

[Rosen2022]

Rosen, A. S.; Fung, V.; Huck, P.; O’Donnell, C. T.; Horton, M. K.; Truhlar, D. G.; Persson, K. A.; Notestein, J. M.; Snurr, R. Q. High-Throughput Predictions of Metal–Organic Framework Electronic Properties: Theoretical Challenges, Graph Neural Networks, and Data Exploration. npj Computational Materials, 8, 112.

Construct an instance of the QMOF dataset.

Parameters:

version (str) – version number to use. Defaults to “v0.0.1”.
flavor (str) – flavor of the dataset to use. Accepted values are “all”, “csd”, “gcmc”, and “csd-gcmc”. Defaults to “all”.
drop_basename_duplicates (bool) – If True, keep only one structure per CSD basename. Defaults to True.
drop_graph_duplicates (bool) – If True, keep only one structure per decorated graph hash. Defaults to True.
subset (Optional[Collection[int]]) – indices of the structures to include. This is useful for subsampling the dataset. Defaults to None.
drop_nan (bool) – If True, drop rows with NaN values in features or hashes. Defaults to False.

Raises:

ValueError – If the provided version number is not available.

get_subset(indices)[source]#

Get a subset of the dataset.

Parameters:

indices (Collection[int]) – indices of the structures to include.

Returns:

a new dataset containing only the structures: specified by the indices.

Return type:

AbstractStructureDataset

Structures from the Boyd-Woo database and labels from Moosavi et al.

class BWDataset(version='v0.0.1', drop_graph_duplicates=True, subset=None)[source]#

Exposes the BW20K dataset used in [Moosavi2020].

The raw labels and structures can accessed also on MaterialsCloud.

It is a subset of the BW database [Boyd2019] [Boyd2016] with labels computed by Moosavi et al. Those labels deviate in value and computational approach from the original labels in [Boyd2019] but are consistent with the labels for the other databases in [Moosavi2020].

The available labels are:

‘pure_CO2_kH’: Henry coefficient of CO2 obtained by Widom method in mol kg-1 Pa-1
‘pure_CO2_widomHOA’: Heat of adsorption of CO2 obtained by Widom method in
‘pure_methane_kH’: Henry coefficient of methane obtained by Widom method in mol kg-1 Pa-1
‘pure_methane_widomHOA’: Heat of adsorption of methane obtained by Widom method
‘pure_uptake_CO2_298.00_15000’: Pure CO2 uptake at 298.00 K and 15000 Pa in mol kg-1
‘pure_uptake_CO2_298.00_1600000’: Pure CO2 uptake at 298.00 K and 1600000 Pa in mol kg-1
‘pure_uptake_methane_298.00_580000’: Pure methane uptake at 298.00 K and 580000 Pa in mol kg-1
‘pure_uptake_methane_298.00_6500000’: Pure methane uptake at 298.00 K and 6500000 Pa in mol kg-1
‘logKH_CO2’: Logarithm of Henry coefficient of CO2 obtained by Widom method
‘logKH_CH4’: Logarithm of Henry coefficient of methane obtained by Widom method
‘CH4DC’: CH4 deliverable capacity in vSTP/v
‘CH4HPSTP’: CH4 high pressure uptake in standard temperature and pressure in vSTP/v
‘CH4LPSTP’: CH4 low pressure uptake in standard temperature and pressure in vSTP/v

Note

The BW structures are hypothetical MOFs, therefore the following caveats apply:

It is well known that the data distribution can be quite different from experimental structures
The structures were only optimized using the UFF force field [UFF]
A time-based split cannot be used for hypothetical structures

Information about building blocks

This dataset exposed information about the building blocks of the MOFs. You might find this useful for grouped-cross-validation (as MOFs with same building-blocks and/or net are not really independent).

You find this info under also in the info.rcsr_code, info.metal_bb, and info.organic_bb, info.functional_group columns.

Warning

Danger of data leakage

Cross validation for MOFs with same building-blocks and/or net is notoriously difficult. Since all combinations of building-blocks and/or net are considered, it is not trivial to find completely independent groups.

References

[Moosavi2020]

[Boyd2019] (1,2,3)

Boyd, P. G.; Chidambaram, A.; García-Díez, E.; Ireland, C. P.; Daff, T. D.; Bounds, R.; Gładysiak, A.; Schouwink, P.; Moosavi, S. M.; Maroto-Valer, M. M.; Reimer, J. A.; Navarro, J. A. R.; Woo, T. K.; Garcia, S.; Stylianou, K. C.; Smit, B. Data-Driven Design of Metal–Organic Frameworks for Wet Flue Gas CO2 Capture. Nature 2019, 576 (7786), 253–256.

[Boyd2016]

Boyd, P. G.; Woo, T. K. A Generalized Method for Constructing Hypothetical Nanoporous Materials of Any Net Topology from Graph Theory. CrystEngComm 2016, 18 (21), 3777–3792.

Construct an instance of the CoRE dataset.

Parameters:

version (str) – version number to use. Defaults to “v0.0.1”.
drop_graph_duplicates (bool) – If True, keep only one structure per decorated graph hash. Defaults to True.
subset (Collection[int], optional) – indices of the structures to include. Defaults to None.

Raises:

ValueError – If the provided version number is not available.

get_subset(indices)[source]#

Get a subset of the dataset.

Parameters:

indices (Collection[int]) – indices of the structures to include.

Returns:

a new dataset containing only the structures: specified by the indices.

Return type:

AbstractStructureDataset

Structures from the ARABG database and labels from Moosavi et al.

class ARABGDataset(version='v0.0.1', drop_graph_duplicates=True, subset=None)[source]#

Exposes the ARABG dataset used in [Moosavi2020].

The raw labels and structures can accessed also on MaterialsCloud.

It is a subset of the ARABG database [Anderson2018] with labels computed by Moosavi et al. Those labels deviate in value and computational approach from the original labels in [Boyd2019] but are consistent with the labels for the other databases in [Moosavi2020].

The available labels are:

‘pure_CO2_kH’: Henry coefficient of CO2 obtained by Widom method in mol kg-1 Pa-1
‘pure_CO2_widomHOA’: Heat of adsorption of CO2 obtained by Widom method in
‘pure_methane_kH’: Henry coefficient of methane obtained by Widom method in mol kg-1 Pa-1
‘pure_methane_widomHOA’: Heat of adsorption of methane obtained by Widom method
‘pure_uptake_CO2_298.00_15000’: Pure CO2 uptake at 298.00 K and 15000 Pa in mol kg-1
‘pure_uptake_CO2_298.00_1600000’: Pure CO2 uptake at 298.00 K and 1600000 Pa in mol kg-1
‘pure_uptake_methane_298.00_580000’: Pure methane uptake at 298.00 K and 580000 Pa in mol kg-1
‘pure_uptake_methane_298.00_6500000’: Pure methane uptake at 298.00 K and 6500000 Pa in mol kg-1
‘logKH_CO2’: Logarithm of Henry coefficient of CO2 obtained by Widom method
‘logKH_CH4’: Logarithm of Henry coefficient of methane obtained by Widom method
‘CH4DC’: CH4 deliverable capacity in vSTP/v
‘CH4HPSTP’: CH4 high pressure uptake in standard temperature and pressure in vSTP/v
‘CH4LPSTP’: CH4 low pressure uptake in standard temperature and pressure in vSTP/v

Note

The ARABG structures are hypothetical MOFs, therefore the following caveats apply:

It is well known that the data distribution can be quite different from experimental structures
The structures were only optimized using the UFF force field [UFF]
A time-based split cannot be used for hypothetical structures

Information about building blocks

You find this info under also in the info.rcsr_code, info.metal_bb, and info.organic_bb, info.functional_group columns.

Warning

Danger of data leakage

References

[Moosavi2020]

[Anderson2018]

`Anderson, R.; Rodgers, J.; Argueta, E.; Biong, A.; Gómez-Gualdrón, D. A. Role of Pore Chemistry and Topology in the CO2 Capture Capabilities of MOFs: From Molecular Simulation to Machine Learning. Chemistry of Materials 2018, 30 (18), 6325–6337. <https://doi.org/10.1021/acs.chemmater.8b02257>_

Construct an instance of the CoRE dataset.

Parameters:

version (str) – version number to use. Defaults to “v0.0.1”.
drop_graph_duplicates (bool) – If True, keep only one structure per decorated graph hash. Defaults to True.
subset (Collection[int], optional) – indices of the structures to include. Defaults to None.

Raises:

ValueError – If the provided version number is not available.

get_subset(indices)[source]#

Get a subset of the dataset.

Parameters:

indices (Collection[int]) – indices of the structures to include.

Returns:

a new dataset containing only the structures: specified by the indices.

Return type:

AbstractStructureDataset

Structures from the ARC-MOF dataset.

class ARCMOFDataset(version='v0.0.1', drop_graph_duplicates=True, subset=None)[source]#

Implements access to a subset of structures and labels of the ARC-MOF dataset [Burner2022].

The subset consistes of the structures for which the authors reported process properties and for which we could compute graph hashes and features.

Warning

ARC-MOF is “a database of ~280,000 MOFs which have been either experimentally characterized or computationally generated, spanning all publicly available MOF databases” [Burner2022].

Therefore, there will be significant overlap with the other datasets.

References

[Burner2022] (1,2)

Burner, J.; Luo, J.; White, A.; Mirmiran, A.; Kwon, O.; Boyd, P. G.; Maley, S.; Gibaldi, M.; Simrod, S.; Ogden, V.; Woo, T. K. ChemRxiv 2022

Construct an instance of the CoRE dataset.

Parameters:

version (str) – version number to use. Defaults to “v0.0.1”.
drop_graph_duplicates (bool) – If True, keep only one structure per decorated graph hash. Defaults to True.
subset (Collection[int], optional) – indices of the structures to include. Defaults to None.

Raises:

ValueError – If the provided version number is not available.

get_subset(indices)[source]#

Get a subset of the dataset.

Parameters:

indices (Collection[int]) – indices of the structures to include.

Returns:

a new dataset containing only the structures: specified by the indices.

Return type:

AbstractStructureDataset

Thermal Stability Dataset.

class ThermalStabilityDataset(version='v0.0.1', drop_basename_duplicates=True, drop_graph_duplicates=True, subset=None, drop_nan=False)[source]#

Thermal stability for a subset of CoRE MOFs.

Reproduced from [Nandy2022]. Nandy et al. (2022) digitized traces from thermogravimetric analysis. The decomposition temperature they determined in this way is reported in outputs.assigned_T_decomp.

The years refer to the publication dates of the paper crossreferenced in the CSD entry of the structure.

The available labels are:

outputs.assigned_T_decomp: Decomposition temperature in Kelvin.

References::: [Nandy2022]
Nandy, A.; Terrones, G.; Arunachalam, N.; Duan, C.; Kastner, D. W.; Kulik, H. J. MOFSimplify, Machine Learning Models with Extracted Stability Data of Three Thousand Metal–Organic Frameworks. Scientific Data 2022, 9 (1).

Construct an instance of the ThermalStabilityDataset.

Parameters:

version (str) – version number to use. Defaults to “v0.0.1”.
drop_basename_duplicates (bool) – If True, keep only one structure per CSD basename. Defaults to True.
drop_graph_duplicates (bool) – If True, keep only one structure per decorated graph hash. Defaults to True.
subset (Collection[int], optional) – indices of the structures to include. Defaults to None.
drop_nan (bool) – If True, drop rows with NaN values in features or hashes. Defaults to True.

Raises:

ValueError – If the provided version number is not available.

get_subset(indices)[source]#

Get a subset of the dataset.

Parameters:

indices (Collection[int]) – indices of the structures to include.

Returns:

a new dataset containing only the structures: specified by the indices.

Return type:

AbstractStructureDataset

Interface for creating a custom StructureDataset.

class StructureDataset(files, df=None, structure_name_column=None, year_column=None, label_columns=None, feature_columns=None, decorated_graph_hash_column=None, undecorated_graph_hash_column=None, decorated_scaffold_hash_column=None, undecorated_scaffold_hash_column=None, density_column=None)[source]#

Custom dataset class for loading structures from a files

Initialize a dataset.

Parameters:

files (Collection[PathType]) – List of files to load structures from.
df (Optional[pd.DataFrame], optional) – Dataframe containing the structures. Defaults to None.
structure_name_column (str) – Name of the column containing the structure names. Defaults to None.
year_column (str, optional) – Name of the column containing the year of the structure. Defaults to None.
label_columns (Optional[List[str]], optional) – List of columns containing the labels. Defaults to None.
feature_columns (Optional[List[str]], optional) – List of columns containing the features. Defaults to None.
decorated_graph_hash_column (str, optional) – Name of the column containing the decorated graph hash. Defaults to None.
undecorated_graph_hash_column (str, optional) – Name of the column containing the undecorated graph hash. Defaults to None.
decorated_scaffold_hash_column (str, optional) – Name of the column containing the decorated scaffold hash. Defaults to None.
undecorated_scaffold_hash_column (str, optional) – Name of the column containing the undecorated scaffold hash. Defaults to None.
density_column (str, optional) – Name of the column containing the density of the structure. Defaults to None.

classmethod from_folder_and_dataframe(folder, extension='cif', dataframe=None, structure_name_column=None, year_column=None, label_columns=None, decorated_graph_hash_column=None, undecorated_graph_hash_column=None, decorated_scaffold_hash_column=None, undecorated_scaffold_hash_column=None, density_column=None)[source]#

Create a dataset from a folder and a dataframe.

Parameters:

folder (PathType) – Path to the folder containing the structures.
extension (str) – Extension of the files. Defaults to ‘cif’.
dataframe (Optional[pd.DataFrame], optional) – Dataframe containing the structures. Defaults to None.
structure_name_column (str) – Name of the column containing the structure names. Defaults to None.
year_column (str, optional) – Name of the column containing the year of the structure. Defaults to None.
label_columns (Optional[List[str]], optional) – List of columns containing the labels. Defaults to None.
decorated_graph_hash_column (str, optional) – Name of the column containing the decorated graph hash. Defaults to None.
undecorated_graph_hash_column (str, optional) – Name of the column containing the undecorated graph hash. Defaults to None.
decorated_scaffold_hash_column (str, optional) – Name of the column containing the decorated scaffold hash. Defaults to None.
undecorated_scaffold_hash_column (str, optional) – Name of the column containing the undecorated scaffold hash. Defaults to None.
density_column (str, optional) – Name of the column containing the density of the structure. Defaults to None.

Returns:

Dataset containing the structures.

Return type:

StructureDataset

class FrameDataset(df, structure_name_column, year_column=None, label_columns=None, decorated_graph_hash_column=None, undecorated_graph_hash_column=None, decorated_scaffold_hash_column=None, undecorated_scaffold_hash_column=None, density_column=None)[source]#

Dataset containing structure information read from a dataframe.

Initialize the dataset.

Parameters:

df (pd.DataFrame) – Dataframe containing the structures.
structure_name_column (str) – Name of the column containing the structure names.
year_column (str, optional) – Name of the column containing the year of the structure. Defaults to None.
label_columns (Optional[List[str]], optional) – List of columns containing the labels. Defaults to None.
decorated_graph_hash_column (str, optional) – Name of the column containing the decorated graph hash. Defaults to None.
undecorated_graph_hash_column (str, optional) – Name of the column containing the undecorated graph hash. Defaults to None.
decorated_scaffold_hash_column (str, optional) – Name of the column containing the decorated scaffold hash. Defaults to None.
undecorated_scaffold_hash_column (str, optional) – Name of the column containing the undecorated scaffold hash. Defaults to None.
density_column (str, optional) – Name of the column containing the density of the structure. Defaults to None.

get_subset(indices)[source]#

Get a subset of the dataset.

Parameters:

indices (Collection[int]) – indices of the structures to include.

Returns:

a new dataset containing only the structures: specified by the indices.

Return type:

FrameDataset