Splitters#

Classes that help performing cross-validation.

Our splitters attempt to reduce any potential for data leakage by using grouping by default– and prioritizing grouping over stratficiation or exactly matching the requested train test ratio.

See also the sklearn docs.

Warning

Due to the grouping operations, the train/test ratios the methods produce will not exactly match the one you requested. For this reason, please get the length of the train/test/valid indices the methods produce.

class DensitySplitter(ds, density_q=None, shuffle=True, random_state=None, sample_frac=1.0, stratification_col=None, center=<function median>, q=(0, 0.25, 0.5, 0.75, 1), sort_by_len=True)[source]#

Splitter that uses the density of the structures to split the data.

For this, we sort structures according to their density and then group the based on the density. You can modify the number of groups using the density_q parameter, those values indicate the quantiles which we use for the grouping.

This ensures that the validation is quite stringent as the different folds will have different densities.

The motivations for doing this are:

density is often one of the most important descriptors for gas uptake properties.

there is often is a very large difference in density distribution
between hypothetical and experimental databases.

Initialize the DensitySplitter class.

Parameters:

ds (AbstractStructureDataset) – A structure dataset. The BaseSplitter only requires the length magic method to be implemented. However, other splitters might require additional methods.
density_q (Collection[float], optional) – List of quantiles used for quantile binning for the density. Defaults to None. If None, then we use two bins for test/train split, three for validation/train/test split and k for k-fold.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (Union[int, np.random.RandomState], optional) – Random state for the shuffling. Defaults to None.
sample_frac (float, optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance sample_frac=0.8 will randomly select 80% of the indices before splitting. Defaults to 1.0.
stratification_col (Union[str, np.typing.ArrayLike], optional) – Data used for stratification. If it is categorical (see mofdscribe.splitters.utils.is_categorical()) then we directly use it for stratification. Otherwise, we use quantile binning. Defaults to None.
center (callable) – Aggregation function to compute a measure of centrality of all the points in a group such that this can then be used for stratification. This is only used for continuos inputs. For categorical inputs, we always use the mode. Defaults to np.median.
q (Collection[float], optional) – List of quantiles used for quantile binning. Defaults to (0, 0.25, 0.5, 0.75, 1]. Defaults to [0, 0.25, 0.5, 0.75, 1).
sort_by_len (bool) – If True, sort the splits by length. (Applies to the train/test/valid and train/test splits). Defaults to True.

class HashSplitter(ds, hash_type='undecorated_scaffold_hash', shuffle=True, random_state=None, sample_frac=1.0, stratification_col=None, center=<function median>, q=(0, 0.25, 0.5, 0.75, 1), sort_by_len=True)[source]#

Splitter that uses Weisfeiller-Lehman graph hashes [WL] to split the data in more stringent ways.

Note that the hashes we use do not allow for a meaningful measure of similarity. That is, there is no way to measure the distance between two strings. The only meaningful measure is if they are identical or not.

Note

Weisfeiller-Lehman graph hashes do not give a guarantee for graph-isomorphism. That is, there might be identical hashes that do not correspond to isomorphic graphs.

Note

There are certain graphs that a Weisfeiller-Lehman test cannot distinguish [Bouritsas].

Note

We speak about Weisfeiller-Lehman hashes as they are the defaults for the mofdscribe datasets. However, you can also overwrite this method with a custom hashing function.

Initialize a HashSplitter.

Parameters:

ds (AbstractStructureDataset) – A structure dataset. The BaseSplitter only requires the length magic method to be implemented. However, other splitters might require additional methods.
hash_type (str) – Hash type to use. Must be one of the following: * undecorated_scaffold_hash * decorated_graph_hash * decorated_scaffold_hash * undecorated_graph_hash Defaults to “undecorated_scaffold_hash”.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (Union[int, np.random.RandomState], optional) – Random state for the shuffling. Defaults to None.
sample_frac (float, optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance sample_frac=0.8 will randomly select 80% of the indices before splitting. Defaults to 1.0.
stratification_col (Union[str, np.typing.ArrayLike], optional) – Data used for stratification. If it is categorical (see mofdscribe.splitters.utils.is_categorical()) then we directly use it for stratification. Otherwise, we use quantile binning. Defaults to None.
center (callable, optional) – Aggregation function to compute a measure of centrality of all the points in a group such that this can then be used for stratification. This is only used for continuos inputs. For categorical inputs, we always use the mode. Defaults to np.median.
q (Collection[float], optional) – List of quantiles used for quantile binning. Defaults to (0, 0.25, 0.5, 0.75, 1).
sort_by_len (bool) – If True, sort the splits by length. (Applies to the train/test/valid and train/test splits). Defaults to True.

class TimeSplitter(ds, year_q=None, shuffle=True, random_state=None, sample_frac=1.0, stratification_col=None, center=<function median>, q=(0, 0.25, 0.5, 0.75, 1), sort_by_len=True)[source]#

This splitter sorts structures according to their publication date.

That is, the training set will contain structures that are “older” (have been discovered earlier) than the ones in the test set. This can mimick real-life model development conditions [MoleculeNet].

It has for instance also be used with ICSD data in [Palizhati] and been the focus of [Sheridan].

Initialize the TimeSplitter class.

Parameters:

ds (AbstractStructureDataset) – A structure dataset. The BaseSplitter only requires the length magic method to be implemented. However, other splitters might require additional methods.
year_q (Collection[float]) – List of quantiles used for quantile binning on the years. Defaults to None. If None, then we use two bins for test/train split, three for validation/train/test split and k for k-fold.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (Union[int, np.random.RandomState], optional) – Random state for the shuffling. Defaults to None.
sample_frac (float, optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance sample_frac=0.8 will randomly select 80% of the indices before splitting. Defaults to 1.0.
stratification_col (Union[str, np.typing.ArrayLike], optional) – Data used for stratification. If it is categorical (see mofdscribe.splitters.utils.is_categorical()) then we directly use it for stratification. Otherwise, we use quantile binning. Defaults to None.
center (callable) – Aggregation function to compute a measure of centrality of all the points in a group such that this can then be used for stratification. This is only used for continuos inputs. For categorical inputs, we always use the mode. Defaults to np.median.
q (Collection[float], optional) – List of quantiles used for quantile binning. Defaults to (0, 0.25, 0.5, 0.75, 1).
sort_by_len (bool) – If True, sort the splits by length. (Applies to the train/test/valid and train/test splits). Defaults to True.

class BaseSplitter(ds, shuffle=True, random_state=None, sample_frac=1.0, stratification_col=None, center=<function median>, q=(0, 0.25, 0.5, 0.75, 1), sort_by_len=True)[source]#

A BaseSplitter implements the basic logic for dataset partition as well as k-fold cross-validation.

Methods that inherit from this class typically implement the

code:

_get_stratification_col: Should return an ArrayLike object of floats, categories, or ints. If it is categorical data, the BaseSplitter will handle the discretization.

code:

_get_groups: Should return an ArrayLike object of categories (integers or strings)

methods. Internally, the BaseSplitter uses those to group and/or stratify the splits.

Initialize a BaseSplitter.

Parameters:

ds (AbstractStructureDataset) – A structure dataset. The BaseSplitter only requires the length magic method to be implemented. However, other splitters might require additional methods.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (Optional[Union[int, np.random.RandomState]], optional) – Random state for the shuffling. Defaults to None.
sample_frac (Optional[float], optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance sample_frac=0.8 will randomly select 80% of the indices before splitting. Defaults to 1.0.
stratification_col (Union[str, np.typing.ArrayLike], optional) – Data used for stratification. If it is categorical (see mofdscribe.splitters.utils.is_categorical()) then we directly use it for stratification. Otherwise, we use quantile binning. Defaults to None.
center (callable) – Aggregation function to compute a measure of centrality of all the points in a group such that this can then be used for stratification. This is only used for continuos inputs. For categorical inputs, we always use the mode. Defaults to np.median.
q (Collection[float], optional) – List of quantiles used for quantile binning. Defaults to (0, 0.25, 0.5, 0.75, 1).
sort_by_len (bool) – If True, sort the splits by length. (Applies to the train/test/valid and train/test splits). Defaults to True.

train_test_split(frac_train=0.7)[source]#

Perform a train/test partition.

Parameters:: frac_train (float) – Fraction of the data to use for the training set. Defaults to 0.7.
Returns:: Train indices, test indices
Return type:: Tuple[Collection[int], Collection[int]]

train_valid_test_split(frac_train=0.7, frac_valid=0.1)[source]#

Perform a train/valid/test partition.

Parameters:

frac_train (float) – Fraction of data to use for the training set. Defaults to 0.7.
frac_valid (float) – Fraction of data to use for the validation set. Defaults to 0.1.

Returns:

Training, validation, test set.

Return type:

Tuple[Collection[int], Collection[int], Collection[int]]

k_fold(k=5)[source]#

Peform k-fold crossvalidation.

Parameters:: k (int) – Number of folds. Defaults to 5.
Yields:: Tuple[Collection[int], Collection[int]] – Train indices, test indices.
Return type:: Tuple[Collection[int], Collection[int]]

class KennardStoneSplitter(ds, feature_names, shuffle=True, random_state=None, sample_frac=1.0, scale=True, centrality_measure='mean', metric='euclidean', ascending=False)[source]#

Run the Kennard-Stone sampling algorithm [KennardStone].

The algorithm selects samples with uniform converage. The initial samples are biased towards the boundaries of the dataset. Hence, it might be biased by outliers.

This algorithm ensures a flat coverage of the dataset. It is also known as CADEX algorithm and has been later refined in the DUPLEX algorithm [Snee].

Warning

This splitter can be slow for large datasets as it requires us to perform distance matrices N times for a dataset with N structures.

Warning

Stratification is not supported for this splitter.

Warning

I couldn’t find a good reference for the k-fold version of this algorihm.

Construct a KennardStoneSplitter.

Parameters:

ds (AbstractStructureDataset) – A structure dataset. The BaseSplitter only requires the length magic method to be implemented. However, other splitters might require additional methods.
feature_names (List[str]) – Names of features to consider.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (Union[int, np.random.RandomState], optional) – Random state for the shuffling. Defaults to None.
sample_frac (float, optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance sample_frac=0.8 will randomly select 80% of the indices before splitting. Defaults to 1.0.
scale (bool) – If True, apply z-score normalization prior to running the sampling. Defaults to True.
centrality_measure (str) – The first sample is selected to be maximally distanct from this value. It can be one of “mean”, “median”, “random”. In case of “random” we simply select a random point. In the case of “mean” and “median” the initial point is maximally distanct from the mean and median of the feature matrix, respectively. Defaults to “mean”.
metric (Union[Callable, str]) – The distance metric to use. If a string, the distance function can be ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘kulczynski1’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’. Defaults to “euclidean”.
ascending (bool) – If True, sort samples in asceding distance to the center. That is, the first samples (maximally distant to center) would be sampled last. Defaults to False.

get_sorted_indices(ds)[source]#

Return a list of indices, sorted by similarity using the Kennard-Stone algorithm.

The first sample will be maximally distant from the center.

Parameters:: ds (AbstractStructureDataset) – A mofdscribe AbstractStructureDataset
Returns:: Sorted indices.
Return type:: Collection[int]

train_test_split(frac_train=0.7)[source]#

Perform a train/test partition.

Parameters:: frac_train (float) – Fraction of the data to use for the training set. Defaults to 0.7.
Returns:: Train indices, test indices
Return type:: Tuple[Collection[int], Collection[int]]

train_valid_test_split(frac_train=0.7, frac_valid=0.1)[source]#

Perform a train/valid/test partition.

Parameters:

frac_train (float) – Fraction of data to use for the training set. Defaults to 0.7.
frac_valid (float) – Fraction of data to use for the validation set. Defaults to 0.1.

Returns:

Training, validation, test set.

Return type:

Tuple[Collection[int], Collection[int], Collection[int]]

k_fold(k=5)[source]#

Peform k-fold crossvalidation.

Parameters:: k (int) – Number of folds. Defaults to 5.
Yields:: Tuple[Collection[int], Collection[int]] – Train indices, test indices.
Return type:: Tuple[Collection[int], Collection[int]]

class ClusterSplitter(ds, feature_names, shuffle=True, random_state=None, sample_frac=1.0, stratification_col=None, center=<function median>, q=(0, 0.25, 0.5, 0.75, 1), sort_by_len=False, scaled=True, n_pca_components='mle', n_clusters=4, pca_kwargs=None, kmeans_kwargs=None)[source]#

Split the data into clusters and use the clusters as groups.

The approach has been proposed on Kaggle. In principle, we perform the following steps:

Scale the data (optional).

Perform PCA for de-correlation.

Perform k-means clustering.

Construct a ClusterSplitter.

Parameters:

ds (AbstractStructureDataset) – A structure dataset. The BaseSplitter only requires the length magic method to be implemented. However, other splitters might require additional methods.
feature_names (List[str]) –

Names of features to consider.
shuffle (bool): If True, perform a shuffled split.

Defaults to True.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (int) – Random state for the shuffling. Defaults to None.
sample_frac (float, optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance sample_frac=0.8 will randomly select 80% of the indices before splitting. Defaults to 1.0.
stratification_col (Union[str, np.typing.ArrayLike], optional) – Data used for stratification. If it is categorical (see mofdscribe.splitters.utils.is_categorical()) then we directly use it for stratification. Otherwise, we use quantile binning. Defaults to None.
center (callable) – Aggregation function to compute a measure of centrality of all the points in a group such that this can then be used for stratification. This is only used for continuos inputs. For categorical inputs, we always use the mode. Defaults to np.median.
q (Collection[float], optional) – List of quantiles used for quantile binning. Defaults to (0, 0.25, 0.5, 0.75, 1]. Defaults to [0, 0.25, 0.5, 0.75, 1).
sort_by_len (bool) – If True, sort the splits by length. (Applies to the train/test/valid and train/test splits). Defaults to True.
scaled (bool) – If True, scale the data before clustering. Defaults to True.
n_pca_components (Union[int, str]) – Number of components to use for PCA. If “mle”, use the number of components that maximizes the variance. Defaults to “mle”.
n_clusters (int) – Number of clusters to use. Defaults to 4.
random_state – Random seed. Defaults to 42.
pca_kwargs (Dict[str, Any]) – Keyword arguments to pass to PCA. Defaults to None.
kmeans_kwargs (Dict[str, Any]) – Keyword arguments to pass to k-means. Defaults to None.

class LOCOCV(ds, feature_names, shuffle=True, random_state=None, sample_frac=1.0, scaled=True, n_pca_components='mle', pca_kwargs=None, kmeans_kwargs=None)[source]#

Leave-one-cluster-out cross-validation.

The general idea has been discussed before, e.g. in [Kramer]. Perhaps more widely used in the materials community is [Meredig]. Here, we perform PCA, followed by k-means clustering.

Where k = 2 for a train/test split
Where k = 3 for a train/valid/test split
Where k = k for k-fold crossvalidation

By default, we will sort outputs such that the cluster sizes are train >= test >= valid.

Construct a LOCOCV.

Parameters:

ds (AbstractStructureDataset) – A structure dataset. The BaseSplitter only requires the length magic method to be implemented. However, other splitters might require additional methods.
feature_names (List[str]) – Names of features to consider.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (int) – Random state for the shuffling. Defaults to None.
sample_frac (float, optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance sample_frac=0.8 will randomly select 80% of the indices before splitting. Defaults to 1.0.
scaled (bool) – If True, scale the data before clustering. Defaults to True.
n_pca_components (int) – Number of components to use for PCA. If “mle”, use the number of components that maximizes the variance. Defaults to “mle”.
random_state – Random seed. Defaults to 42.
pca_kwargs (Dict[str, Any], optional) – Additional keyword arguments for sklearn’s sklearn.decomposition.PCA. Defaults to None.
kmeans_kwargs (Dict[str, Any], optional) – Additional keyword arguments for sklearn’s sklearn.clustering.KMeans. Defaults to None.

train_test_split()[source]#

Perform a train/test partition.

Returns:: Train indices, test indices
Return type:: Tuple[Collection[int], Collection[int]]

train_valid_test_split()[source]#

Perform a train/valid/test partition.

Returns:: Training, validation, test set.
Return type:: Tuple[Collection[int], Collection[int], Collection[int]]

k_fold(k)[source]#

Peform k-fold crossvalidation.

Parameters:: k (int) – Number of folds. Defaults to 5.
Yields:: Iterator[Tuple[Collection[int], Collection[int]]] – Train indices, test indices.
Return type:: Tuple[Collection[int], Collection[int]]

class ClusterStratifiedSplitter(ds, feature_names, shuffle=True, random_state=None, sample_frac=1.0, scaled=True, n_pca_components='mle', n_clusters=4, pca_kwargs=None, kmeans_kwargs=None)[source]#

Split the data into clusters and stratify on those clusters

The approach has been proposed on Kaggle. In principle, we perform the following steps:

Scale the data (optional).
Perform PCA for de-correlation.
Perform k-means clustering.

Construct a ClusterStratifiedSplitter.

Parameters:

ds (AbstractStructureDataset) – A structure dataset. The BaseSplitter only requires the length magic method to be implemented. However, other splitters might require additional methods.
feature_names (List[str]) – Names of features to consider.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (int) – Random state for the shuffling. Defaults to None.
sample_frac (float, optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance sample_frac=0.8 will randomly select 80% of the indices before splitting. Defaults to 1.0.
scaled (bool) – If True, scale the data before clustering. Defaults to True.
n_pca_components (int) – Number of components to use for PCA. If “mle”, use the number of components that maximizes the variance. Defaults to “mle”.
n_clusters (int) – Number of clusters to use. Defaults to 4.
random_state – Random seed. Defaults to 42.
pca_kwargs (Dict[str, Any]) – Keyword arguments to pass to PCA. Defaults to None.
kmeans_kwargs (Dict[str, Any]) – Keyword arguments to pass to k-means. Defaults to None.

Helper functions for the splitters.

Some of these methods might also be useful for constructing nested cross-validation loops.

kennard_stone_sampling(X, scale=True, centrality_measure='mean', metric='euclidean')[source]#

Run the Kennard-Stone sampling algorithm [KennardStone].

The algorithm selects samples with uniform converage. The initial samples are biased towards the boundaries of the dataset.

Note

You also might this algorithm useful for creating a “diverse” sample of points, e.g., to initalize an active learnign loop.

Warning

This algorithm has a high computational complexity. It is not recommended for large datasets.

Parameters:

X (ArrayLike) – Input feature matrix.
scale (bool) – If True, apply z-score normalization prior to running the sampling. Defaults to True.
centrality_measure (str) – The first sample is selected to be maximally distanct from this value. It can be one of “mean”, “median”, “random”. In case of “random” we simply select a random point. In the case of “mean” and “median” the initial point is maximally distanct from the mean and median of the feature matrix, respectively. Defaults to “mean”.
metric (Union[Callable, str]) –
The distance metric to use. If a string, the distance function can be ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘kulczynski1’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’. See scipy.spatial.distance.cdist().

Defaults to “euclidean”.

Raises:

ValueError – If non-implemented centrality measure is used.

Returns:

indices sorted by their Max-Min distance.

Return type:

List[int]

pca_kmeans(X, scaled, n_pca_components, n_clusters, random_state=None, pca_kwargs=None, kmeans_kwargs=None)[source]#

Run principal component analysis (PCA) followed by K-means clustering on the data.

Uses sklearn’s implementation of PCA, and k-means.

Parameters:

X (np.ndarray) – Input data
scaled (bool) – If True, use standard scaling for clustering
n_pca_components (Union[int, str]) – number of principal components to keep
n_clusters (int) – number of clusters
random_state (Optional[Union[int, np.random.RandomState]], optional) – Random state for sklearn. Defaults to None.
pca_kwargs (Dict[str, Any], optional) – Additional keyword arguments for sklearn’s sklearn.decomposition.PCA. Defaults to None.
kmeans_kwargs (Dict[str, Any], optional) – Additional keyword arguments for sklearn’s sklearn.clustering.KMeans. Defaults to None.

Returns:

Cluster indices.

Return type:

np.ndarray

is_categorical(x)[source]#

Return true if x is categorial or composed of integers.

Return type:: bool

stratified_train_test_partition(idxs, stratification_col, train_size, valid_size, test_size, shuffle=True, random_state=None, q=(0, 0.25, 0.5, 0.75, 1))[source]#

Perform a stratified train/test split.