Difference between revisions of "Stratified K-Fold Cross-Validation Algorithm"

From GM-RKB
Jump to: navigation, search
(2017)
(2 intermediate revisions by 2 users not shown)
Line 26: Line 26:
 
* (SciKit-Learn, 2020) https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html Retrieved:2020-2-14.
 
* (SciKit-Learn, 2020) https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html Retrieved:2020-2-14.
 
** QUOTE: [[Stratified K-Folds Cross-validator]] <P>Provides train/test indices to split data in train/test sets. <P>This [[cross-validation]] object is a variation of [[KFold]] that returns [[stratified fold]]s. The [[fold]]s are made by preserving the percentage of [[sample]]s for each class. <P>Read more in the [https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation User Guide].
 
** QUOTE: [[Stratified K-Folds Cross-validator]] <P>Provides train/test indices to split data in train/test sets. <P>This [[cross-validation]] object is a variation of [[KFold]] that returns [[stratified fold]]s. The [[fold]]s are made by preserving the percentage of [[sample]]s for each class. <P>Read more in the [https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation User Guide].
 +
 +
=== 2020c ===
 +
* ([[2020_StratifiedCrossValidationForUnb|Bey et al., 2020]]) ⇒ [[R. Bey]], [[R. Goussault]], [[M. Benchoufi]], and [[R. Porcher]] (2020). [https://arxiv.org/pdf/2001.08090.pdf "Stratified Cross-Validation For Unbiased And Privacy-Preserving Federated Learning"]. [https://arxiv.org/abs/2001.08090 ArXiv:2001.08090]
 +
** QUOTE: [[Stratified cross-validation]] complements [[cross-validation]] with an initial [[stratification]] of [[EHR]] in [[fold]]s containing similar patients, thus ensuring that duplicates of a record are jointly present either in [[training]] or in [[validation fold]]s. [[Monte Carlo simulation]]s are performed to investigate the properties of [[stratified cross-validation]] in the case of a model [[data analysis]].
  
 
=== 2017 ===
 
=== 2017 ===
 
* ([[Sammut & Webb, 2017]]) ⇒ [[Claude Sammut]], and [[Geoffrey I. Webb]]. ([[2017]]). [https://link.springer.com/referenceworkentry/10.1007/978-1-4899-7687-1_788 "Stratified Cross Validation"]. In: ([[Sammut & Webb, 2017]]).[https://doi.org/10.1007/978-1-4899-7687-1_788 DOI:10.1007/978-1-4899-7687-1_788]
 
* ([[Sammut & Webb, 2017]]) ⇒ [[Claude Sammut]], and [[Geoffrey I. Webb]]. ([[2017]]). [https://link.springer.com/referenceworkentry/10.1007/978-1-4899-7687-1_788 "Stratified Cross Validation"]. In: ([[Sammut & Webb, 2017]]).[https://doi.org/10.1007/978-1-4899-7687-1_788 DOI:10.1007/978-1-4899-7687-1_788]
 
** QUOTE: [[Stratified Cross Validation]] is a form of  [[cross validation]] in which the [[class distribution]] is kept as close as possible to being the same across all [[fold]]s.
 
** QUOTE: [[Stratified Cross Validation]] is a form of  [[cross validation]] in which the [[class distribution]] is kept as close as possible to being the same across all [[fold]]s.
 +
=== 2013 ===
 +
* (del Pozo et al., 2013) &rArr; [[Juan A. Fernandez del Pozo]] [[Pedro Larranaga]], and [[Concha Bielza]] (2013). [https://pdfs.semanticscholar.org/presentation/5a2a/4fc79644bebf6d3459273f6372c6a3ad3cf5.pdf "Stratified Cross-Validation in Multi-Label Classification Using Genetic Algorithms"]. In: TIN2010-20900-C04. Computational Intelligence Group Universidad Politecnica de Madrid.
 +
** QUOTE: [[Stratified cross-validation]] reduces the [[variance]] of the [[estimate]]s and improves the [[estimation]] of the [[generalization performance]] of [[classifier algorithm]]s. <P>However, how to stratify a [[data set]] in a [[multi-label supervised classification]] setting is a hard problem, since each fold should try to mimic the [[joint probability distribution]] of the whole [[set]] of [[class variable]]s.
 
----
 
----
 
__NOTOC__
 
__NOTOC__
 
[[Category:Concept]]
 
[[Category:Concept]]
 
[[Category:Machine Learning]]
 
[[Category:Machine Learning]]

Revision as of 23:31, 14 February 2020

A Stratified K-Fold Cross-Validation Algorithm is a K-Fold Cross Validation Algorithm in which a class distribution remains closely the same across all the folds.



References

2020a

  • (Wikipedia, 2020) ⇒ https://www.wikiwand.com/en/Cross-validation_(statistics)#/Non-exhaustive_cross-validation Retrieved:2020-2-14.
    • In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,[1] but in general k remains an unfixed parameter.

      For example, setting k = 2 results in 2-fold cross-validation. In 2-fold cross-validation, we randomly shuffle the dataset into two sets d0 and d1, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two). We then train on d0 and validate on d1, followed by training on d1 and validating on d0.

      When k = n (the number of observations), k-fold cross-validation is equivalent to leave-one-out cross-validation[2]. In stratified k-fold cross-validation, the partitions are selected so that the mean response value is approximately equal in all the partitions. In the case of binary classification, this means that each partition contains roughly the same proportions of the two types of class labels. In repeated cross-validation the data is randomly split into k partitions several times. The performance of the model can thereby be averaged over several runs, but this is rarely desirable in practice.

  1. McLachlan, Geoffrey J.; Do, Kim-Anh; Ambroise, Christophe (2004). Analyzing microarray gene expression data. Wiley.
  2. "Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition". web.stanford.edu. Retrieved 2019-04-04.

2020b

2020c

2017

2013