Featurizers#

Many of the descriptors implemented in mofdscribe have been discussed in our 2020 Chem. Rev. article.

ID

Considers Geometry

Considers Structure Graph

Encodes Chemistry

Scalar

Scope

AccessibleVolume

True

False

False

False

global

AcidGroupCounter

False

True

False

True

bu

AMD

True

False

optionally

False

local

APRDF

True

False

optionally

False

global

Asphericity

True

True

False

True

bu

AtomCenteredPH

True

False

optionally

False

global

BaseGroupCounter

False

True

False

True

bu

BUMatch

True

True

False

True

bu

DiskLikeness

True

True

False

True

bu

Eccentricity

True

True

False

True

bu

EnergyGridHistogram

True

False

True

False

global

GuestCenteredAPRDF

True

False

optionally

False

global

Henry

True

False

True

False

global

InertialShapeFactor

True

True

False

True

bu

LSOP

True

False

False

False

bu

MOFDescriber

True

True

True

False

global

NConf20

True

True

False

True

bu

NPR1

True

True

False

True

bu

NPR2

True

True

False

True

bu

PairwiseDistanceHist

True

False

False

False

bu

PairwiseDistanceStats

True

False

False

False

bu

PartialChargeHistogram

True

False

True

False

global

PartialChargeStats

True

False

True

False

global

PHHist

True

False

optionally

False

global

PHImage

True

False

optionally

False

global

PHStats

True

False

optionally

False

global

PHVect

True

False

optionally

False

global

PMI1

True

True

False

True

bu

PMI2

True

True

False

True

bu

PMI3

True

True

False

True

bu

PoreDiameters

True

False

False

False

global

PoreSizeDistribution

True

False

False

False

global

PriceLowerBound

False

False

True

True

global

RACS

False

True

optionally

False

local

RadiusOfGyration

True

True

False

True

bu

RayTracingHistogram

True

False

False

False

global

RodLikeness

True

True

False

True

bu

SmartsMatchCounter

False

True

False

True

bu

SphereLikeness

True

True

False

True

bu

SpherocityIndex

True

True

False

True

bu

SurfaceArea

True

False

False

False

global

VoxelGrid

True

False

optionally

False

global

Warning

Note that different featurizers have different ways in which they deal with solvent molecules. The RACs and SBU-centered features will ignore floating solvent molecules. Other featurizers, e.g. the APRDF, or pore geometry descriptors will do consider floating solvent molecules in the same way as framework molecules.

If you do not want the solvent molecules to impact the featurization, you’ll have to remove it from the structure.

Bound solvent is not suppressed by any of the featurizers.

To identify bound and unbound solvents, you might find the moffragmentor package useful.

Note

You might be surprised to see new folders appear and disappear in your working directory when you run certain featurizers. This is expected. For some featurizers we need to create temporary files, for which we will create the temporary folders. You can ignore this, they will be cleaned up automatically.

Atom-centered featurizers#

A key approximation for machine learning in chemistry is the locality approximation. Effectively, this allows to train models on small fragments which then (hopefully) can be used to predict the properties of larger structures.

Global featurizers#

In particular for porous materials, some properties are not local. For instance, the pore geometry (key for gas adsorption) cannot be captured by descriptor that only considers the local environment (of e.g., 3 atoms). For this reason it can make sense to also consider features that consider the full structure as a whole.

BU-centered featurizers#

Reticular chemistry describes materials built via a tinker-toy approach. Hence, a natural approach is to focus on the building blocks.

mofdscribe can compute descriptors that are BU-centred, for instance, using RDKit descriptors on the building blocks. However, you are not limited to descriptors operating on molecules – you can convert any featurizer into an BU-centered fearturizer:

from matminer.featurizers.site import SOAP
from matminer.featurizers.structure import SiteStatsFingerprint
from pymatgen.core import Structure
from mofdscribe.featurizers.bu import BUFeaturizer

base_feat = SiteStatsFingerprint(SOAP.from_preset("formation_energy"))
base_feat.fit([hkust_structure])
featurizer = BUFeaturizer(base_feat, aggregations=("mean",))
features = featurizer.featurize(structure=hkust_structure)

For this, you can either provide your building blocks that you extracted with any of the available tools, or use our integration with our moffragmentor package. In this case, we will fragment the MOF into its building blocks and then compute the features for each building block and let you choose how you want to aggregate them.

Host-Guest Featurization#

If you have structures loaded with guest molecules as input, you might find the HostGuestFeaturizer convenient. This featurizer will automatically extract the host and guest structures from the input structures and then featurize them separately. If there are multiple guests in the input structures, the featurizer will featurize each guest separately and then aggregate the features.

from matminer.featurizers.structure.sites import SiteStatsFingerprint

from mofdscribe.featurizers.hostguest import HostGuestFeaturizer

featurizer = HostGuestFeaturizer(
    featurizer=SiteStatsFingerprint.from_preset("SOAP_formation_energy"),
    aggregations=("mean",),
)
featurizer.fit([structure])
features = featurizer.featurize(structure)

If you are interested in surface chemistry features, you might also find suitable featurizers in the matminer package.

mofdscribe also implements some featurizers that are specifically designed for host-guest systems.

Encoding of chemistry#

Many featurizers that traditionally do not capture the chemistry of the structure are implemented in mofdscribe in a way that still allows to capture the chemistry. One example for this are the topology-based descriptors.

We encode the chemistry in those case by computing the descriptor for substructures of subsets of element types; for instance, for the metal- or organic-substructure.

Chemistry encoding