Splitters#
Classes that help performing cross-validation.
Our splitters attempt to reduce any potential for data leakage by using grouping by default– and prioritizing grouping over stratficiation or exactly matching the requested train test ratio.
See also the sklearn docs.
Warning
Due to the grouping operations, the train/test ratios the methods produce will not exactly match the one you requested. For this reason, please get the length of the train/test/valid indices the methods produce.
- class DensitySplitter(ds, density_q=None, shuffle=True, random_state=None, sample_frac=1.0, stratification_col=None, center=<function median>, q=(0, 0.25, 0.5, 0.75, 1), sort_by_len=True)[source]#
Splitter that uses the density of the structures to split the data.
For this, we sort structures according to their density and then group the based on the density. You can modify the number of groups using the
density_q
parameter, those values indicate the quantiles which we use for the grouping.This ensures that the validation is quite stringent as the different folds will have different densities.
The motivations for doing this are:
density is often one of the most important descriptors for gas uptake properties.
- there is often is a very large difference in density distribution
between hypothetical and experimental databases.
Initialize the DensitySplitter class.
- Parameters:
ds (AbstractStructureDataset) – A structure dataset. The
BaseSplitter
only requires the length magic method to be implemented. However, other splitters might require additional methods.density_q (Collection[float], optional) – List of quantiles used for quantile binning for the density. Defaults to None. If None, then we use two bins for test/train split, three for validation/train/test split and k for k-fold.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (Union[int, np.random.RandomState], optional) – Random state for the shuffling. Defaults to None.
sample_frac (float, optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance
sample_frac=0.8
will randomly select 80% of the indices before splitting. Defaults to 1.0.stratification_col (Union[str, np.typing.ArrayLike], optional) – Data used for stratification. If it is categorical (see
mofdscribe.splitters.utils.is_categorical()
) then we directly use it for stratification. Otherwise, we use quantile binning. Defaults to None.center (callable) – Aggregation function to compute a measure of centrality of all the points in a group such that this can then be used for stratification. This is only used for continuos inputs. For categorical inputs, we always use the mode. Defaults to np.median.
q (Collection[float], optional) – List of quantiles used for quantile binning. Defaults to (0, 0.25, 0.5, 0.75, 1]. Defaults to [0, 0.25, 0.5, 0.75, 1).
sort_by_len (bool) – If True, sort the splits by length. (Applies to the train/test/valid and train/test splits). Defaults to True.
- class HashSplitter(ds, hash_type='undecorated_scaffold_hash', shuffle=True, random_state=None, sample_frac=1.0, stratification_col=None, center=<function median>, q=(0, 0.25, 0.5, 0.75, 1), sort_by_len=True)[source]#
Splitter that uses Weisfeiller-Lehman graph hashes [WL] to split the data in more stringent ways.
Note that the hashes we use do not allow for a meaningful measure of similarity. That is, there is no way to measure the distance between two strings. The only meaningful measure is if they are identical or not.
Note
Weisfeiller-Lehman graph hashes do not give a guarantee for graph-isomorphism. That is, there might be identical hashes that do not correspond to isomorphic graphs.
Note
There are certain graphs that a Weisfeiller-Lehman test cannot distinguish [Bouritsas].
Note
We speak about Weisfeiller-Lehman hashes as they are the defaults for the mofdscribe datasets. However, you can also overwrite this method with a custom hashing function.
Initialize a HashSplitter.
- Parameters:
ds (AbstractStructureDataset) – A structure dataset. The
BaseSplitter
only requires the length magic method to be implemented. However, other splitters might require additional methods.hash_type (str) – Hash type to use. Must be one of the following: * undecorated_scaffold_hash * decorated_graph_hash * decorated_scaffold_hash * undecorated_graph_hash Defaults to “undecorated_scaffold_hash”.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (Union[int, np.random.RandomState], optional) – Random state for the shuffling. Defaults to None.
sample_frac (float, optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance
sample_frac=0.8
will randomly select 80% of the indices before splitting. Defaults to 1.0.stratification_col (Union[str, np.typing.ArrayLike], optional) – Data used for stratification. If it is categorical (see
mofdscribe.splitters.utils.is_categorical()
) then we directly use it for stratification. Otherwise, we use quantile binning. Defaults to None.center (callable, optional) – Aggregation function to compute a measure of centrality of all the points in a group such that this can then be used for stratification. This is only used for continuos inputs. For categorical inputs, we always use the mode. Defaults to np.median.
q (Collection[float], optional) – List of quantiles used for quantile binning. Defaults to (0, 0.25, 0.5, 0.75, 1).
sort_by_len (bool) – If True, sort the splits by length. (Applies to the train/test/valid and train/test splits). Defaults to True.
- class TimeSplitter(ds, year_q=None, shuffle=True, random_state=None, sample_frac=1.0, stratification_col=None, center=<function median>, q=(0, 0.25, 0.5, 0.75, 1), sort_by_len=True)[source]#
This splitter sorts structures according to their publication date.
That is, the training set will contain structures that are “older” (have been discovered earlier) than the ones in the test set. This can mimick real-life model development conditions [MoleculeNet].
It has for instance also be used with ICSD data in [Palizhati] and been the focus of [Sheridan].
Initialize the TimeSplitter class.
- Parameters:
ds (AbstractStructureDataset) – A structure dataset. The
BaseSplitter
only requires the length magic method to be implemented. However, other splitters might require additional methods.year_q (Collection[float]) – List of quantiles used for quantile binning on the years. Defaults to None. If None, then we use two bins for test/train split, three for validation/train/test split and k for k-fold.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (Union[int, np.random.RandomState], optional) – Random state for the shuffling. Defaults to None.
sample_frac (float, optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance
sample_frac=0.8
will randomly select 80% of the indices before splitting. Defaults to 1.0.stratification_col (Union[str, np.typing.ArrayLike], optional) – Data used for stratification. If it is categorical (see
mofdscribe.splitters.utils.is_categorical()
) then we directly use it for stratification. Otherwise, we use quantile binning. Defaults to None.center (callable) – Aggregation function to compute a measure of centrality of all the points in a group such that this can then be used for stratification. This is only used for continuos inputs. For categorical inputs, we always use the mode. Defaults to np.median.
q (Collection[float], optional) – List of quantiles used for quantile binning. Defaults to (0, 0.25, 0.5, 0.75, 1).
sort_by_len (bool) – If True, sort the splits by length. (Applies to the train/test/valid and train/test splits). Defaults to True.
- class BaseSplitter(ds, shuffle=True, random_state=None, sample_frac=1.0, stratification_col=None, center=<function median>, q=(0, 0.25, 0.5, 0.75, 1), sort_by_len=True)[source]#
A
BaseSplitter
implements the basic logic for dataset partition as well as k-fold cross-validation.Methods that inherit from this class typically implement the
- code:
_get_stratification_col: Should return an ArrayLike object of floats, categories, or ints. If it is categorical data, the
BaseSplitter
will handle the discretization.
- code:
_get_groups: Should return an ArrayLike object of categories (integers or strings)
methods. Internally, the
BaseSplitter
uses those to group and/or stratify the splits.Initialize a BaseSplitter.
- Parameters:
ds (AbstractStructureDataset) – A structure dataset. The
BaseSplitter
only requires the length magic method to be implemented. However, other splitters might require additional methods.shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (Optional[Union[int, np.random.RandomState]], optional) – Random state for the shuffling. Defaults to None.
sample_frac (Optional[float], optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance
sample_frac=0.8
will randomly select 80% of the indices before splitting. Defaults to 1.0.stratification_col (Union[str, np.typing.ArrayLike], optional) – Data used for stratification. If it is categorical (see
mofdscribe.splitters.utils.is_categorical()
) then we directly use it for stratification. Otherwise, we use quantile binning. Defaults to None.center (callable) – Aggregation function to compute a measure of centrality of all the points in a group such that this can then be used for stratification. This is only used for continuos inputs. For categorical inputs, we always use the mode. Defaults to np.median.
q (Collection[float], optional) – List of quantiles used for quantile binning. Defaults to (0, 0.25, 0.5, 0.75, 1).
sort_by_len (bool) – If True, sort the splits by length. (Applies to the train/test/valid and train/test splits). Defaults to True.
- train_test_split(frac_train=0.7)[source]#
Perform a train/test partition.
- Parameters:
frac_train (float) – Fraction of the data to use for the training set. Defaults to 0.7.
- Returns:
Train indices, test indices
- Return type:
Tuple[Collection[int], Collection[int]]
- train_valid_test_split(frac_train=0.7, frac_valid=0.1)[source]#
Perform a train/valid/test partition.
- Parameters:
frac_train (float) – Fraction of data to use for the training set. Defaults to 0.7.
frac_valid (float) – Fraction of data to use for the validation set. Defaults to 0.1.
- Returns:
Training, validation, test set.
- Return type:
Tuple[Collection[int], Collection[int], Collection[int]]
- class KennardStoneSplitter(ds, feature_names, shuffle=True, random_state=None, sample_frac=1.0, scale=True, centrality_measure='mean', metric='euclidean', ascending=False)[source]#
Run the Kennard-Stone sampling algorithm [KennardStone].
The algorithm selects samples with uniform converage. The initial samples are biased towards the boundaries of the dataset. Hence, it might be biased by outliers.
This algorithm ensures a flat coverage of the dataset. It is also known as CADEX algorithm and has been later refined in the DUPLEX algorithm [Snee].
Warning
This splitter can be slow for large datasets as it requires us to perform distance matrices N times for a dataset with N structures.
Warning
Stratification is not supported for this splitter.
Warning
I couldn’t find a good reference for the k-fold version of this algorihm.
Construct a KennardStoneSplitter.
- Parameters:
ds (AbstractStructureDataset) – A structure dataset. The
BaseSplitter
only requires the length magic method to be implemented. However, other splitters might require additional methods.feature_names (List[str]) – Names of features to consider.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (Union[int, np.random.RandomState], optional) – Random state for the shuffling. Defaults to None.
sample_frac (float, optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance
sample_frac=0.8
will randomly select 80% of the indices before splitting. Defaults to 1.0.scale (bool) – If True, apply z-score normalization prior to running the sampling. Defaults to True.
centrality_measure (str) – The first sample is selected to be maximally distanct from this value. It can be one of “mean”, “median”, “random”. In case of “random” we simply select a random point. In the case of “mean” and “median” the initial point is maximally distanct from the mean and median of the feature matrix, respectively. Defaults to “mean”.
metric (Union[Callable, str]) – The distance metric to use. If a string, the distance function can be ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘kulczynski1’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’. Defaults to “euclidean”.
ascending (bool) – If True, sort samples in asceding distance to the center. That is, the first samples (maximally distant to center) would be sampled last. Defaults to False.
- get_sorted_indices(ds)[source]#
Return a list of indices, sorted by similarity using the Kennard-Stone algorithm.
The first sample will be maximally distant from the center.
- Parameters:
ds (AbstractStructureDataset) – A mofdscribe AbstractStructureDataset
- Returns:
Sorted indices.
- Return type:
Collection[int]
- train_test_split(frac_train=0.7)[source]#
Perform a train/test partition.
- Parameters:
frac_train (float) – Fraction of the data to use for the training set. Defaults to 0.7.
- Returns:
Train indices, test indices
- Return type:
Tuple[Collection[int], Collection[int]]
- train_valid_test_split(frac_train=0.7, frac_valid=0.1)[source]#
Perform a train/valid/test partition.
- Parameters:
frac_train (float) – Fraction of data to use for the training set. Defaults to 0.7.
frac_valid (float) – Fraction of data to use for the validation set. Defaults to 0.1.
- Returns:
Training, validation, test set.
- Return type:
Tuple[Collection[int], Collection[int], Collection[int]]
- class ClusterSplitter(ds, feature_names, shuffle=True, random_state=None, sample_frac=1.0, stratification_col=None, center=<function median>, q=(0, 0.25, 0.5, 0.75, 1), sort_by_len=False, scaled=True, n_pca_components='mle', n_clusters=4, pca_kwargs=None, kmeans_kwargs=None)[source]#
Split the data into clusters and use the clusters as groups.
The approach has been proposed on Kaggle. In principle, we perform the following steps:
Scale the data (optional).
Perform PCA for de-correlation.
Perform k-means clustering.
Construct a ClusterSplitter.
- Parameters:
ds (AbstractStructureDataset) – A structure dataset. The
BaseSplitter
only requires the length magic method to be implemented. However, other splitters might require additional methods.feature_names (List[str]) –
- Names of features to consider.
shuffle (bool): If True, perform a shuffled split.
Defaults to True.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (int) – Random state for the shuffling. Defaults to None.
sample_frac (float, optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance
sample_frac=0.8
will randomly select 80% of the indices before splitting. Defaults to 1.0.stratification_col (Union[str, np.typing.ArrayLike], optional) – Data used for stratification. If it is categorical (see
mofdscribe.splitters.utils.is_categorical()
) then we directly use it for stratification. Otherwise, we use quantile binning. Defaults to None.center (callable) – Aggregation function to compute a measure of centrality of all the points in a group such that this can then be used for stratification. This is only used for continuos inputs. For categorical inputs, we always use the mode. Defaults to np.median.
q (Collection[float], optional) – List of quantiles used for quantile binning. Defaults to (0, 0.25, 0.5, 0.75, 1]. Defaults to [0, 0.25, 0.5, 0.75, 1).
sort_by_len (bool) – If True, sort the splits by length. (Applies to the train/test/valid and train/test splits). Defaults to True.
scaled (bool) – If True, scale the data before clustering. Defaults to True.
n_pca_components (Union[int, str]) – Number of components to use for PCA. If “mle”, use the number of components that maximizes the variance. Defaults to “mle”.
n_clusters (int) – Number of clusters to use. Defaults to 4.
random_state – Random seed. Defaults to 42.
pca_kwargs (Dict[str, Any]) – Keyword arguments to pass to PCA. Defaults to None.
kmeans_kwargs (Dict[str, Any]) – Keyword arguments to pass to k-means. Defaults to None.
- class LOCOCV(ds, feature_names, shuffle=True, random_state=None, sample_frac=1.0, scaled=True, n_pca_components='mle', pca_kwargs=None, kmeans_kwargs=None)[source]#
Leave-one-cluster-out cross-validation.
The general idea has been discussed before, e.g. in [Kramer]. Perhaps more widely used in the materials community is [Meredig]. Here, we perform PCA, followed by k-means clustering.
Where k = 2 for a train/test split
Where k = 3 for a train/valid/test split
Where k = k for k-fold crossvalidation
By default, we will sort outputs such that the cluster sizes are train >= test >= valid.
Construct a LOCOCV.
- Parameters:
ds (AbstractStructureDataset) – A structure dataset. The
BaseSplitter
only requires the length magic method to be implemented. However, other splitters might require additional methods.feature_names (List[str]) – Names of features to consider.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (int) – Random state for the shuffling. Defaults to None.
sample_frac (float, optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance
sample_frac=0.8
will randomly select 80% of the indices before splitting. Defaults to 1.0.scaled (bool) – If True, scale the data before clustering. Defaults to True.
n_pca_components (int) – Number of components to use for PCA. If “mle”, use the number of components that maximizes the variance. Defaults to “mle”.
random_state – Random seed. Defaults to 42.
pca_kwargs (Dict[str, Any], optional) – Additional keyword arguments for sklearn’s
sklearn.decomposition.PCA
. Defaults to None.kmeans_kwargs (Dict[str, Any], optional) – Additional keyword arguments for sklearn’s
sklearn.clustering.KMeans
. Defaults to None.
- train_test_split()[source]#
Perform a train/test partition.
- Returns:
Train indices, test indices
- Return type:
Tuple[Collection[int], Collection[int]]
- class ClusterStratifiedSplitter(ds, feature_names, shuffle=True, random_state=None, sample_frac=1.0, scaled=True, n_pca_components='mle', n_clusters=4, pca_kwargs=None, kmeans_kwargs=None)[source]#
Split the data into clusters and stratify on those clusters
The approach has been proposed on Kaggle. In principle, we perform the following steps:
Scale the data (optional).
Perform PCA for de-correlation.
Perform k-means clustering.
Construct a ClusterStratifiedSplitter.
- Parameters:
ds (AbstractStructureDataset) – A structure dataset. The
BaseSplitter
only requires the length magic method to be implemented. However, other splitters might require additional methods.feature_names (List[str]) – Names of features to consider.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (int) – Random state for the shuffling. Defaults to None.
sample_frac (float, optional) – This can be used for downsampling. It will randomly select a subset of indices from all indices before splittings. For instance
sample_frac=0.8
will randomly select 80% of the indices before splitting. Defaults to 1.0.scaled (bool) – If True, scale the data before clustering. Defaults to True.
n_pca_components (int) – Number of components to use for PCA. If “mle”, use the number of components that maximizes the variance. Defaults to “mle”.
n_clusters (int) – Number of clusters to use. Defaults to 4.
random_state – Random seed. Defaults to 42.
pca_kwargs (Dict[str, Any]) – Keyword arguments to pass to PCA. Defaults to None.
kmeans_kwargs (Dict[str, Any]) – Keyword arguments to pass to k-means. Defaults to None.
Helper functions for the splitters.
Some of these methods might also be useful for constructing nested cross-validation loops.
- kennard_stone_sampling(X, scale=True, centrality_measure='mean', metric='euclidean')[source]#
Run the Kennard-Stone sampling algorithm [KennardStone].
The algorithm selects samples with uniform converage. The initial samples are biased towards the boundaries of the dataset.
Note
You also might this algorithm useful for creating a “diverse” sample of points, e.g., to initalize an active learnign loop.
Warning
This algorithm has a high computational complexity. It is not recommended for large datasets.
- Parameters:
X (ArrayLike) – Input feature matrix.
scale (bool) – If True, apply z-score normalization prior to running the sampling. Defaults to True.
centrality_measure (str) – The first sample is selected to be maximally distanct from this value. It can be one of “mean”, “median”, “random”. In case of “random” we simply select a random point. In the case of “mean” and “median” the initial point is maximally distanct from the mean and median of the feature matrix, respectively. Defaults to “mean”.
metric (Union[Callable, str]) –
The distance metric to use. If a string, the distance function can be ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘kulczynski1’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’. See
scipy.spatial.distance.cdist()
.Defaults to “euclidean”.
- Raises:
ValueError – If non-implemented centrality measure is used.
- Returns:
indices sorted by their Max-Min distance.
- Return type:
List[int]
- pca_kmeans(X, scaled, n_pca_components, n_clusters, random_state=None, pca_kwargs=None, kmeans_kwargs=None)[source]#
Run principal component analysis (PCA) followed by K-means clustering on the data.
Uses sklearn’s implementation of PCA, and k-means.
- Parameters:
X (np.ndarray) – Input data
scaled (bool) – If True, use standard scaling for clustering
n_pca_components (Union[int, str]) – number of principal components to keep
n_clusters (int) – number of clusters
random_state (Optional[Union[int, np.random.RandomState]], optional) – Random state for sklearn. Defaults to None.
pca_kwargs (Dict[str, Any], optional) – Additional keyword arguments for sklearn’s
sklearn.decomposition.PCA
. Defaults to None.kmeans_kwargs (Dict[str, Any], optional) – Additional keyword arguments for sklearn’s
sklearn.clustering.KMeans
. Defaults to None.
- Returns:
Cluster indices.
- Return type:
np.ndarray
- is_categorical(x)[source]#
Return true if x is categorial or composed of integers.
- Return type:
bool
- stratified_train_test_partition(idxs, stratification_col, train_size, valid_size, test_size, shuffle=True, random_state=None, q=(0, 0.25, 0.5, 0.75, 1))[source]#
Perform a stratified train/test split.
See also
mofdscribe.splitters.utils.grouped_stratified_train_test_partition()
: performs an grouped stratified train/test splitmofdscribe.splitters.utils.grouped_train_valid_test_partition()
: performs an grouped un-stratified train/test split
- Parameters:
idxs (Sequence[int]) – Indices of points to split
stratification_col (np.typing.ArrayLike) – Data used for stratification. If it is categorical (see
mofdscribe.splitters.utils.is_categorical()
) then we directly use it for stratification. Otherwise, we use quantile binning.train_size (float) – Size of the training set as fraction.
valid_size (float) – Size of the validation set as fraction.
test_size (float) – Size of the test set as fraction.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (Union[int, np.random.RandomState], optional) – Random state for the suffler. Defaults to None.
q (Sequence[float], optional) – List of quantiles used for quantile binning. Defaults to (0, 0.25, 0.5, 0.75, 1).
- Returns:
Train, validation, test indices.
- Return type:
Tuple[np.array, np.array, np.array]
- grouped_stratified_train_test_partition(stratification_col, group_col, train_size, valid_size, test_size, shuffle=True, random_state=None, q=(0, 0.25, 0.5, 0.75, 1), center=<function median>)[source]#
Return grouped stratified train-test partition.
First, we compute the most common stratification category / centrality measure of the stratification column for every group. Then, we perform a stratified train/test partition on the groups. We then “expand” by concatenating the indices belonging to each group.
Warning
Note that this won’t work well if the number of groups and datapoints is small. It will also cause issues if the number of datapoints in the groups is very imbalanced.
See also
mofdscribe.splitters.utils.stratified_train_test_partition()
: performs an un-grouped stratified train/test splitmofdscribe.splitters.utils.grouped_train_valid_test_partition()
: performs an grouped un-stratified train/test split
- Parameters:
stratification_col (np.typing.ArrayLike) – Data used for stratification. If it is categorical (see
mofdscribe.splitters.utils.is_categorical()
) then we directly use it for stratification. Otherwise, we use quantile binning.group_col (np.typing.ArrayLike) – Data used for grouping.
train_size (float) – Size of the training set as fraction.
valid_size (float) – Size of the validation set as fraction.
test_size (float) – Size of the test set as fraction.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (Union[int, np.random.RandomState], optional) – Random state for the suffler. Defaults to None.
q (Sequence[float], optional) – List of quantiles used for quantile binning. Defaults to [0, 0.25, 0.5, 0.75, 1].
center (Callable) – Aggregation function to compute a measure of centrality of all the points in a group such that this can then be used for stratification. This is only used for continuos inputs. For categorical inputs, we always use the mode.
- Returns:
Train, validation, test indices.
- Return type:
Tuple[np.array, np.array, np.array]
- get_train_valid_test_sizes(size, train_size, valid_size, test_size)[source]#
Compute the number of points in every split.
- Return type:
Tuple
[int
,int
,int
]
- grouped_train_valid_test_partition(groups, train_size, valid_size, test_size, shuffle=True, random_state=None)[source]#
Perform a grouped train/test split without stratification.
See also
mofdscribe.splitters.utils.stratified_train_test_partition()
: performs an un-grouped stratified train/test splitmofdscribe.splitters.utils.grouped_stratified_train_test_partition()
: performs an grouped stratified train/test split
- Parameters:
groups (np.typing.ArrayLike) – Data used for grouping.
train_size (float) – Size of the training set as fraction.
valid_size (float) – Size of the validation set as fraction.
test_size (float) – Size of the test set as fraction.
shuffle (bool) – If True, perform a shuffled split. Defaults to True.
random_state (Union[int, np.random.RandomState], optional) – Random state for the suffler. Defaults to None.
- Returns:
Train, validation, test indices.
- Return type:
Tuple[np.array, Optional[np.array], np.array]
- quantile_binning(values, q)[source]#
Use
pandas.qcut()
to bin the values based on quantiles.- Return type:
array