Difference between revisions of "K-Fold Cross-Validation Algorithm"

From GM-RKB
Jump to: navigation, search
(2020)
Line 19: Line 19:
 
== References ==
 
== References ==
  
=== 2020 ===
+
=== 2020a ===
 
* (Wikipedia, 2020) ⇒ https://www.wikiwand.com/en/Cross-validation_(statistics)#/Non-exhaustive_cross-validation Retrieved:2020-2-14.
 
* (Wikipedia, 2020) ⇒ https://www.wikiwand.com/en/Cross-validation_(statistics)#/Non-exhaustive_cross-validation Retrieved:2020-2-14.
 
** In ''k''-fold cross-validation, the original sample is randomly partitioned into ''k'' equal sized subsamples. Of the ''k'' subsamples, a single subsample is retained as the validation data for testing the model, and the remaining ''k''&nbsp;−&nbsp;1 subsamples are used as training data. The cross-validation process is then repeated ''k'' times, with each of the ''k'' subsamples used exactly once as the validation data. The ''k'' results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,<ref name="McLachlan">McLachlan, Geoffrey J.; Do, Kim-Anh; Ambroise, Christophe (2004). Analyzing microarray gene expression data. Wiley.</ref> but in general ''k'' remains an unfixed parameter. <P> For example, setting ''k''&nbsp;=&nbsp;''2'' results in 2-fold cross-validation. In 2-fold cross-validation, we randomly shuffle the dataset into two sets ''d''<sub>0</sub> and ''d''<sub>1</sub>, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two). We then train on ''d''<sub>0</sub> and validate on ''d''<sub>1</sub>, followed by training on ''d''<sub>1</sub> and validating on&nbsp;''d''<sub>0</sub>. <P> When ''k''&nbsp;=&nbsp;''n'' (the number of observations), ''k''-fold cross-validation is equivalent to leave-one-out cross-validation<ref>[https://web.stanford.edu/~hastie/ElemStatLearn/ "Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition"]. web.stanford.edu. Retrieved 2019-04-04.</ref>.  In ''stratified'' ''k''-fold cross-validation, the partitions are selected so that the mean response value is approximately equal in all the partitions. In the case of binary classification, this means that each partition contains roughly the same proportions of the two types of class labels. In ''repeated'' cross-validation the data is randomly split into ''k'' partitions several times. The performance of the model can thereby be averaged over several runs, but this is rarely desirable in practice.  
 
** In ''k''-fold cross-validation, the original sample is randomly partitioned into ''k'' equal sized subsamples. Of the ''k'' subsamples, a single subsample is retained as the validation data for testing the model, and the remaining ''k''&nbsp;−&nbsp;1 subsamples are used as training data. The cross-validation process is then repeated ''k'' times, with each of the ''k'' subsamples used exactly once as the validation data. The ''k'' results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,<ref name="McLachlan">McLachlan, Geoffrey J.; Do, Kim-Anh; Ambroise, Christophe (2004). Analyzing microarray gene expression data. Wiley.</ref> but in general ''k'' remains an unfixed parameter. <P> For example, setting ''k''&nbsp;=&nbsp;''2'' results in 2-fold cross-validation. In 2-fold cross-validation, we randomly shuffle the dataset into two sets ''d''<sub>0</sub> and ''d''<sub>1</sub>, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two). We then train on ''d''<sub>0</sub> and validate on ''d''<sub>1</sub>, followed by training on ''d''<sub>1</sub> and validating on&nbsp;''d''<sub>0</sub>. <P> When ''k''&nbsp;=&nbsp;''n'' (the number of observations), ''k''-fold cross-validation is equivalent to leave-one-out cross-validation<ref>[https://web.stanford.edu/~hastie/ElemStatLearn/ "Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition"]. web.stanford.edu. Retrieved 2019-04-04.</ref>.  In ''stratified'' ''k''-fold cross-validation, the partitions are selected so that the mean response value is approximately equal in all the partitions. In the case of binary classification, this means that each partition contains roughly the same proportions of the two types of class labels. In ''repeated'' cross-validation the data is randomly split into ''k'' partitions several times. The performance of the model can thereby be averaged over several runs, but this is rarely desirable in practice.  
 
<references/>
 
<references/>
 
+
=== 2020b ===
 +
* (SciKit-Learn, 2020) &rArr; https://scikit-learn.org/stable/modules/cross_validation.html Retrieved: 2020-02-15.
 +
** QUOTE: In the basic approach, called [[k-fold CV]], the [[training set]] is split into k smaller [[set]]s (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the [[K-Fold Cross-Validation Task|k “folds”]]:
 +
*** A [[model is trained]] using $k-1$ of the [[fold]]s as [[training data]];
 +
*** the resulting [[model]] is validated on the remaining part of the [[data]] (i.e., it is used as a [[test set]] to compute a [[performance measure]] such as [[accuracy]]).
 +
:: The [[performance measure]] reported by [[k-fold cross-validation]] is then the [[average]] of the values computed in the [[loop]]. This approach can be [[computationally expensive]], but does not waste too much [[data]] (as is the case when fixing an arbitrary [[validation set]]), which is a major advantage in problems such as [[inverse inference]] where the number of samples is very small.<P><div style="text-align:center"><html><img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" width=50%/></html></div>
 
----
 
----
 
[[Category:Concept]]
 
[[Category:Concept]]

Revision as of 00:55, 15 February 2020

A K-Fold Cross-Validation Algorithm is a Non-Exhaustive Cross-Validation Algorithm in which training datasets are randomly partitioned into equal sized subsamples.



References

2020a

  • (Wikipedia, 2020) ⇒ https://www.wikiwand.com/en/Cross-validation_(statistics)#/Non-exhaustive_cross-validation Retrieved:2020-2-14.
    • In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,[1] but in general k remains an unfixed parameter.

      For example, setting k = 2 results in 2-fold cross-validation. In 2-fold cross-validation, we randomly shuffle the dataset into two sets d0 and d1, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two). We then train on d0 and validate on d1, followed by training on d1 and validating on d0.

      When k = n (the number of observations), k-fold cross-validation is equivalent to leave-one-out cross-validation[2]. In stratified k-fold cross-validation, the partitions are selected so that the mean response value is approximately equal in all the partitions. In the case of binary classification, this means that each partition contains roughly the same proportions of the two types of class labels. In repeated cross-validation the data is randomly split into k partitions several times. The performance of the model can thereby be averaged over several runs, but this is rarely desirable in practice.

  1. McLachlan, Geoffrey J.; Do, Kim-Anh; Ambroise, Christophe (2004). Analyzing microarray gene expression data. Wiley.
  2. "Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition". web.stanford.edu. Retrieved 2019-04-04.

2020b

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.