# Difference between revisions of "Stratified K-Fold Cross-Validation Algorithm"

(ContinuousReplacement) (Tag: continuous replacement) |
|||

(7 intermediate revisions by the same user not shown) | |||

Line 4: | Line 4: | ||

** It can be implemented by a [[Stratified K-Fold Cross-Validation System]] to solve a [[Stratified K-Fold Cross-Validation Task]]. | ** It can be implemented by a [[Stratified K-Fold Cross-Validation System]] to solve a [[Stratified K-Fold Cross-Validation Task]]. | ||

* <B>Example(s):</B> | * <B>Example(s):</B> | ||

− | **<code>[[sklearn.model_selection.StratifiedKFold]]</code> ([[#2020b|SciKit-Learn, | + | ** <code>[[sklearn.model_selection.StratifiedKFold]]</code> ([[#2020b|SciKit-Learn, 2020a]]) |

+ | *** [https://scikit-learn.org/stable/auto_examples/feature_selection/plot_permutation_test_for_classification.html Example 1: Test with permutations the significance of a classification score], | ||

+ | *** [https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html Example 2: Recursive feature elimination with cross-validation], | ||

+ | *** [https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html Example 3: GMM covariances], | ||

+ | *** [https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html Example 4: Receiver Operating Characteristic (ROC) with cross validation], | ||

+ | *** [https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html Example 5: Visualizing cross-validation behavior in scikit-learn], | ||

+ | ** <code>[[sklearn.model_selection.RepeatedStratifiedKFold]]</code> ([[#2020c|SciKit-Learn, 2020b]]). | ||

* <B>Counter-Example(s):</B> | * <B>Counter-Example(s):</B> | ||

** a [[10-fold Cross-Validation Algorithm]], | ** a [[10-fold Cross-Validation Algorithm]], | ||

Line 24: | Line 30: | ||

=== 2020b === | === 2020b === | ||

− | * (SciKit-Learn, | + | * (SciKit-Learn, 2020a) https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html Retrieved:2020-2-14. |

** QUOTE: [[Stratified K-Folds Cross-validator]] <P>Provides train/test indices to split data in train/test sets. <P>This [[cross-validation]] object is a variation of [[KFold]] that returns [[stratified fold]]s. The [[fold]]s are made by preserving the percentage of [[sample]]s for each class. <P>Read more in the [https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation User Guide]. | ** QUOTE: [[Stratified K-Folds Cross-validator]] <P>Provides train/test indices to split data in train/test sets. <P>This [[cross-validation]] object is a variation of [[KFold]] that returns [[stratified fold]]s. The [[fold]]s are made by preserving the percentage of [[sample]]s for each class. <P>Read more in the [https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation User Guide]. | ||

=== 2020c === | === 2020c === | ||

+ | * (SciKit-Learn, 2020b) ⇒ https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html | ||

+ | ** QUOTE: Repeats [[Stratified K-Fold]] n times with different randomization in each repetition. | ||

+ | === 2020d === | ||

* ([[2020_StratifiedCrossValidationForUnb|Bey et al., 2020]]) ⇒ [[R. Bey]], [[R. Goussault]], [[M. Benchoufi]], and [[R. Porcher]] (2020). [https://arxiv.org/pdf/2001.08090.pdf "Stratified Cross-Validation For Unbiased And Privacy-Preserving Federated Learning"]. [https://arxiv.org/abs/2001.08090 ArXiv:2001.08090] | * ([[2020_StratifiedCrossValidationForUnb|Bey et al., 2020]]) ⇒ [[R. Bey]], [[R. Goussault]], [[M. Benchoufi]], and [[R. Porcher]] (2020). [https://arxiv.org/pdf/2001.08090.pdf "Stratified Cross-Validation For Unbiased And Privacy-Preserving Federated Learning"]. [https://arxiv.org/abs/2001.08090 ArXiv:2001.08090] | ||

** QUOTE: [[Stratified cross-validation]] complements [[cross-validation]] with an initial [[stratification]] of [[EHR]] in [[fold]]s containing similar patients, thus ensuring that duplicates of a record are jointly present either in [[training]] or in [[validation fold]]s. [[Monte Carlo simulation]]s are performed to investigate the properties of [[stratified cross-validation]] in the case of a model [[data analysis]]. | ** QUOTE: [[Stratified cross-validation]] complements [[cross-validation]] with an initial [[stratification]] of [[EHR]] in [[fold]]s containing similar patients, thus ensuring that duplicates of a record are jointly present either in [[training]] or in [[validation fold]]s. [[Monte Carlo simulation]]s are performed to investigate the properties of [[stratified cross-validation]] in the case of a model [[data analysis]]. | ||

Line 38: | Line 47: | ||

* (del Pozo et al., 2013) ⇒ [[Juan A. Fernandez del Pozo]] [[Pedro Larranaga]], and [[Concha Bielza]] (2013). [https://pdfs.semanticscholar.org/presentation/5a2a/4fc79644bebf6d3459273f6372c6a3ad3cf5.pdf "Stratified Cross-Validation in Multi-Label Classification Using Genetic Algorithms"]. In: TIN2010-20900-C04. Computational Intelligence Group Universidad Politecnica de Madrid. | * (del Pozo et al., 2013) ⇒ [[Juan A. Fernandez del Pozo]] [[Pedro Larranaga]], and [[Concha Bielza]] (2013). [https://pdfs.semanticscholar.org/presentation/5a2a/4fc79644bebf6d3459273f6372c6a3ad3cf5.pdf "Stratified Cross-Validation in Multi-Label Classification Using Genetic Algorithms"]. In: TIN2010-20900-C04. Computational Intelligence Group Universidad Politecnica de Madrid. | ||

** QUOTE: [[Stratified cross-validation]] reduces the [[variance]] of the [[estimate]]s and improves the [[estimation]] of the [[generalization performance]] of [[classifier algorithm]]s. <P>However, how to stratify a [[data set]] in a [[multi-label supervised classification]] setting is a hard problem, since each fold should try to mimic the [[joint probability distribution]] of the whole [[set]] of [[class variable]]s. | ** QUOTE: [[Stratified cross-validation]] reduces the [[variance]] of the [[estimate]]s and improves the [[estimation]] of the [[generalization performance]] of [[classifier algorithm]]s. <P>However, how to stratify a [[data set]] in a [[multi-label supervised classification]] setting is a hard problem, since each fold should try to mimic the [[joint probability distribution]] of the whole [[set]] of [[class variable]]s. | ||

+ | |||

+ | === 2011 === | ||

+ | * (Purushotham & Tripathy, 2011) ⇒ [[Swarnalatha Purushotham]], and [[B. K. Tripathy]] (2011, December). [https://link.springer.com/chapter/10.1007%2F978-3-642-29216-3_74 "Evaluation of Classifier Models Using Stratified Tenfold Cross Validation Techniques"]. In: International Conference on Computing and Communication Systems. [https://doi.org/10.1007/978-3-642-29216-3_74 DOI:10.1007/978-3-642-29216-3_74] | ||

+ | |||

+ | === 1995 === | ||

+ | * (Kohavi, 1995) ⇒ [[Ron Kohavi]] (1995). [http://ai.stanford.edu/~ronnyk/accEst.pdf "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection"]. In: [[Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI 95)]]. | ||

+ | ** QUOTE: Formally, let $\mathcal{D}_{(i)}$ be the [[test set]] that includes [[instance]] $x_i =\langle v_i, y_i\rangle $ then the [[cross-validation]] [[estimate]] of [[accuracy]] <P><div style="text-align:center"><math>acc_{cv} =\dfrac{1}{n} \displaystyle \sum_{\langle v_i,y_i\rangle\in \mathcal{D}} \delta \left(\mathcal{I}(\mathcal{D}\backslash\mathcal{D}_{(i)},v_i),y_i\right) </math></div><P>The [[cross-validation]] [[estimate]] is a [[random number]] that depends on the [[division]] into [[fold]]s. '''[[Complete cross-validation]]''' is the [[average]] of all $\binom{m}{m/k}$ possibilities for choosing $m/k$ [[instance]]s out of $m$, but it is usually too expensive. Except for [[leave-one-one (n-fold cross-validation)]], which is always complete, [[k-fold cross-validation]] is estimating [[complete k-fold cross-validation]] using a single split of the [[data]] into the [[fold]]s. Repeating [[cross-validation]] multiple times using different splits into folds provides a better [[Monte Carlo estimate]] to the [[complele cross-validation]] at an added cost. In '''[[stratified cross-validation]]''' the folds are stratified so that they contain approximately the same [[proportion]]s of labels as the original [[dataset]]. | ||

+ | |||

---- | ---- | ||

__NOTOC__ | __NOTOC__ | ||

[[Category:Concept]] | [[Category:Concept]] | ||

[[Category:Machine Learning]] | [[Category:Machine Learning]] | ||

+ | [[Category:Statistical Inference]] |

## Latest revision as of 00:40, 15 February 2020

A Stratified K-Fold Cross-Validation Algorithm is a K-Fold Cross Validation Algorithm in which a class distribution remains closely the same across all the folds.

**AKA:**Stratified Cross Validation Algorithm.**Context:**- It can be implemented by a Stratified K-Fold Cross-Validation System to solve a Stratified K-Fold Cross-Validation Task.

**Example(s):**`sklearn.model_selection.StratifiedKFold`

(SciKit-Learn, 2020a)- Example 1: Test with permutations the significance of a classification score,
- Example 2: Recursive feature elimination with cross-validation,
- Example 3: GMM covariances,
- Example 4: Receiver Operating Characteristic (ROC) with cross validation,
- Example 5: Visualizing cross-validation behavior in scikit-learn,

`sklearn.model_selection.RepeatedStratifiedKFold`

(SciKit-Learn, 2020b).

**Counter-Example(s):****See:**Cross-Validation Task, Exhaustive Cross-Validation Task, Non-exhaustive Cross-validation Task, Nested Cross-validation Task.

## References

### 2020a

- (Wikipedia, 2020) ⇒ https://www.wikiwand.com/en/Cross-validation_(statistics)#/Non-exhaustive_cross-validation Retrieved:2020-2-14.
- In
*k*-fold cross-validation, the original sample is randomly partitioned into*k*equal sized subsamples. Of the*k*subsamples, a single subsample is retained as the validation data for testing the model, and the remaining*k*− 1 subsamples are used as training data. The cross-validation process is then repeated*k*times, with each of the*k*subsamples used exactly once as the validation data. The*k*results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,^{[1]}but in general*k*remains an unfixed parameter.For example, setting

*k*=*2*results in 2-fold cross-validation. In 2-fold cross-validation, we randomly shuffle the dataset into two sets*d*_{0}and*d*_{1}, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two). We then train on*d*_{0}and validate on*d*_{1}, followed by training on*d*_{1}and validating on*d*_{0}.When

*k*=*n*(the number of observations),*k*-fold cross-validation is equivalent to leave-one-out cross-validation^{[2]}. In*stratified**k*-fold cross-validation, the partitions are selected so that the mean response value is approximately equal in all the partitions. In the case of binary classification, this means that each partition contains roughly the same proportions of the two types of class labels. In*repeated*cross-validation the data is randomly split into*k*partitions several times. The performance of the model can thereby be averaged over several runs, but this is rarely desirable in practice.

- In

- ↑ McLachlan, Geoffrey J.; Do, Kim-Anh; Ambroise, Christophe (2004). Analyzing microarray gene expression data. Wiley.
- ↑ "Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition". web.stanford.edu. Retrieved 2019-04-04.

### 2020b

- (SciKit-Learn, 2020a) https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html Retrieved:2020-2-14.
- QUOTE: Stratified K-Folds Cross-validator
Provides train/test indices to split data in train/test sets.

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

Read more in the User Guide.

- QUOTE: Stratified K-Folds Cross-validator

### 2020c

- (SciKit-Learn, 2020b) ⇒ https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html
- QUOTE: Repeats Stratified K-Fold n times with different randomization in each repetition.

### 2020d

- (Bey et al., 2020) ⇒ R. Bey, R. Goussault, M. Benchoufi, and R. Porcher (2020). "Stratified Cross-Validation For Unbiased And Privacy-Preserving Federated Learning". ArXiv:2001.08090
- QUOTE: Stratified cross-validation complements cross-validation with an initial stratification of EHR in folds containing similar patients, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of stratified cross-validation in the case of a model data analysis.

### 2017

- (Sammut & Webb, 2017) ⇒ Claude Sammut, and Geoffrey I. Webb. (2017). "Stratified Cross Validation". In: (Sammut & Webb, 2017).DOI:10.1007/978-1-4899-7687-1_788
- QUOTE: Stratified Cross Validation is a form of cross validation in which the class distribution is kept as close as possible to being the same across all folds.

### 2013

- (del Pozo et al., 2013) ⇒ Juan A. Fernandez del Pozo Pedro Larranaga, and Concha Bielza (2013). "Stratified Cross-Validation in Multi-Label Classification Using Genetic Algorithms". In: TIN2010-20900-C04. Computational Intelligence Group Universidad Politecnica de Madrid.
- QUOTE: Stratified cross-validation reduces the variance of the estimates and improves the estimation of the generalization performance of classifier algorithms.
However, how to stratify a data set in a multi-label supervised classification setting is a hard problem, since each fold should try to mimic the joint probability distribution of the whole set of class variables.

- QUOTE: Stratified cross-validation reduces the variance of the estimates and improves the estimation of the generalization performance of classifier algorithms.

### 2011

- (Purushotham & Tripathy, 2011) ⇒ Swarnalatha Purushotham, and B. K. Tripathy (2011, December). "Evaluation of Classifier Models Using Stratified Tenfold Cross Validation Techniques". In: International Conference on Computing and Communication Systems. DOI:10.1007/978-3-642-29216-3_74

### 1995

- (Kohavi, 1995) ⇒ Ron Kohavi (1995). "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection". In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI 95).
- QUOTE: Formally, let $\mathcal{D}_{(i)}$ be the test set that includes instance $x_i =\langle v_i, y_i\rangle $ then the cross-validation estimate of accuracy [math]acc_{cv} =\dfrac{1}{n} \displaystyle \sum_{\langle v_i,y_i\rangle\in \mathcal{D}} \delta \left(\mathcal{I}(\mathcal{D}\backslash\mathcal{D}_{(i)},v_i),y_i\right) [/math]
The cross-validation estimate is a random number that depends on the division into folds.

**Complete cross-validation**is the average of all $\binom{m}{m/k}$ possibilities for choosing $m/k$ instances out of $m$, but it is usually too expensive. Except for leave-one-one (n-fold cross-validation), which is always complete, k-fold cross-validation is estimating complete k-fold cross-validation using a single split of the data into the folds. Repeating cross-validation multiple times using different splits into folds provides a better Monte Carlo estimate to the complele cross-validation at an added cost. In**stratified cross-validation**the folds are stratified so that they contain approximately the same proportions of labels as the original dataset.

- QUOTE: Formally, let $\mathcal{D}_{(i)}$ be the test set that includes instance $x_i =\langle v_i, y_i\rangle $ then the cross-validation estimate of accuracy