Chi-Square Test of Independence

From GM-RKB
Jump to navigation Jump to search

A Chi-Square Test of Independence is a non-parametric test that is based on a chi-square statistic and is used for determining statistical independence or association between two or more categorical variables.



References

2017a

(...)The Chi-Square Test of Independence can only compare categorical variables. It cannot make comparisons between continuous variables or between categorical and continuous variables. Additionally, the Chi-Square Test of Independence only assesses associations between categorical variables, and can not provide any inferences about causation.
If your categorical variables represent "pre-test" and "post-test" observations, then the chi-square test of independence is not appropriate. (...) data must meet the following requirements:
  1. Two categorical variables.
  2. Two or more categories (groups) for each variable.
  3. Independence of observations.
  • There is no relationship between the subjects in each group.
  • The categorical variables are not "paired" in any way (e.g. pre-test/post-test observations).
  1. Relatively large sample size.
  • Expected frequencies for each cell are at least 1.
  • Expected frequencies should be at least 5 for the majority (80%) of the cells.
(...) The null hypothesis (H0) and alternative hypothesis (H1) of the Chi-Square Test of Independence can be expressed in two different but equivalent ways:
H0: "[Variable 1] is independent of [Variable 2]"
H1: "[Variable 1] is not independent of [Variable 2]"
(...)The test statistic for the Chi-Square Test of Independence is denoted [math]\displaystyle{ \chi^2 }[/math], and is computed as:
[math]\displaystyle{ \chi^2=\sum_{i=1}^R\sum_{j=1}^C\frac{(o_{ij}−e_{ij})^2}{e_{ij}} }[/math]
where [math]\displaystyle{ o_{ij} }[/math] is the observed cell count in the ith row and jth column of the table, [math]\displaystyle{ e_{ij} }[/math] is the expected cell count in the ith row and jth column of the table, computed as
[math]\displaystyle{ eij=\frac{row \;i\; total∗col\; j\; total}{grand\; total} }[/math]
The quantity [math]\displaystyle{ (o_{ij} - e_{ij}) }[/math] is sometimes referred to as the residual of cell (i, j).
The calculated [math]\displaystyle{ \chi^2 }[/math] value is then compared to the critical value from the [math]\displaystyle{ Χ^2 }[/math] distribution table with degrees of freedom [math]\displaystyle{ df = (R - 1)(C - 1) }[/math] and chosen confidence level. If the calculated [math]\displaystyle{ \chi^2 }[/math] value > critical [math]\displaystyle{ \chi^2 }[/math] value, then we reject the null hypothesis.

2017b

The test consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results.
State the hypotheses. A chi-square test for independence is conducted on two categorical variables. Suppose that Variable A has r levels, and Variable B has c levels. The null hypothesis states that knowing the level of Variable A does not help you predict the level of Variable B. That is, the variables are independent. The alternative hypothesis states that the variables are not independent.
Formulate analysis plan. The analysis plan describes how to use sample data to accept or reject the null hypothesis. The plan should specify a significance level and should identify the chi-square test for independence as the test method.
Analyze sample data. Using sample data, find the degrees of freedom, expected frequencies, test statistic, and the P-value associated with the test statistic.
Degrees of freedom. The degrees of freedom (DF) is equal to:
[math]\displaystyle{ DF = (r - 1) * (c - 1) }[/math]
where [math]\displaystyle{ r }[/math] is the number of levels for one categorical variable, and [math]\displaystyle{ c }[/math] is the number of levels for the other categorical variable.
Expected frequencies. The expected frequency counts are computed separately for each level of one categorical variable at each level of the other categorical variable. Compute [math]\displaystyle{ r*c }[/math] expected frequencies, according to the following formula.
[math]\displaystyle{ E_{r,c} = (n_r * n_c) / n }[/math]
where [math]\displaystyle{ E_{r,c} }[/math] is the expected frequency count for level r of Variable A and level c of Variable B, nr is the total number of sample observations at level r of Variable A, nc is the total number of sample observations at level c of Variable B, and n is the total sample size.
Test statistic. The test statistic is a chi-square random variable ([math]\displaystyle{ \chi^2 }[/math]) defined by the following equation.
[math]\displaystyle{ \chi^2 = \sum \frac{(O_{r,c} - E_{r,c})^2} {E_{r,c}} }[/math]
where [math]\displaystyle{ O_{r,c} }[/math] is the observed frequency count at level r of Variable A and level c of Variable B, and Er,c is the expected frequency count at level r of Variable A and level c of Variable B.
P-value. The P-value is the probability of observing a sample statistic as extreme as the test statistic. Since the test statistic is a chi-square, use the Chi-Square Distribution Calculator to assess the probability associated with the test statistic. Use the degrees of freedom computed above.

2017c

A B C D total
White collar 90 60 104 95 349
Blue collar 30 50 51 20 151
No collar 30 40 45 35 150
Total 150 150 200 150 650
Let us take the sample living in neighborhood A, 150, to estimate what proportion of the whole 1 million people live in neighborhood A. Similarly we take 349/650 to estimate what proportion of the 1 million people are white-collar workers. By the assumption of independence under the hypothesis we should "expect" the number of white-collar workers in neighborhood A to be
[math]\displaystyle{ 150\times\frac{349}{650} \approx 80.54. }[/math]
Then in that "cell" of the table, we have
[math]\displaystyle{ \frac{(\text{observed}-\text{expected})^2}{\text{expected}} = \frac{(90-80.54)^2}{80.54}. }[/math]
The sum of these quantities over all of the cells is the test statistic. Under the null hypothesis, it has approximately a chi-squared distribution whose number of degrees of freedom are
[math]\displaystyle{ (\text{number of rows}-1)(\text{number of columns}-1) = (3-1)(4-1) = 6. \, }[/math]
If the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of independence.