Mastering the Chi-Square Test in R: From Theory to Practice

Lesson 4

Introduction to the Chi-Square Test

Hello, learners! Today's topic is a powerful statistical test known as the Chi-Square Test. This test allows us to assess whether significant differences exist between observed and expected frequencies in one or more categories. The test is often used in fields such as health sciences, business, and market research.

Are you ready to delve into the Chi-Square Test? Here we go!

What is the Chi-Square Test?

Consider the Chi-Square Test as a detective, which determines whether what we observe aligns with what we expect. For example, if you have a bag of differently colored marbles and you predict how many of each color you will extract, the Chi-Square test is the statistical tool that can help confirm if your observations match your predictions.

The Chi-Square Test makes the following assumptions:

Randomness: The data has been randomly sampled.
Adequacy: Each cell in the table should possess at least five items to ensure the test's validity.

The Chi-Square Test calculates a test statistic, denoted as $\chi^2$ , which, under the null hypothesis (the observed data matches the expected data), follows a chi-square distribution. This test statistic measures how much the observed data diverges from the expected. The larger the Chi-Square Test statistic, the less probable it is that the observed and expected data match by chance.

A Bag of Marbles

Let's say we have documented the color of each marble drawn from a bag of marbles. Given a predicted distribution of marble colors, we want to know whether our observations align with our expectations. Let's use R to examine this situation using the Chi-Square Test!

R
1# Define the colours and their observed and expected counts
2colors <- c('Red', 'Blue', 'Green', 'Yellow', 'Purple')
3observed <- c(30, 20, 15, 10, 25)
4expected <- c(20, 20, 20, 20, 20)
5
6# Create a dataframe to store the data
7data <- data.frame(colors, observed, expected)

We now have our observed and expected color distribution for the marbles drawn.

Organizing Data

We can extract the 'Observed' and 'Expected' vectors from the data frame like this:

R
1# Extract observed and expected frequencies
2observed_frequencies <- data$observed
3expected_frequencies <- data$expected

By denoting the data frame name followed by the $ sign and the column name, we can extract the specified columns.

Performing the Chi-Square Test

Let's now perform the Chi-Square Test to verify if our observations significantly differ from our assumptions:

R
1# Perform Chi-Square Test
2chi_square_test <- chisq.test(observed_frequencies, p = expected_frequencies / sum(expected_frequencies))

The chisq.test() function in R computes the chi-square test. It takes the real observed frequencies and frequencies we expect to see. Also, we normalize our expected frequencies by their sum to get a probability distribution (This is a requirement of R's chisq.test() function).

R
1# Print the chi-square statistic and P-value
2print(paste("Chi-Square Statistic:", chi_square_test$statistic))  # 12.5
3print(paste("P-value:", chi_square_test$p.value))  # 0.014

The function provides us with a chi-square statistic and a P-value, which assist us in interpreting the test results.

Interpretation

So, the obtained results are: Chi-Square Statistics = 12.5 and P-value = 0.014.

The p-value is the probability of obtaining test results as extreme as the observed results, given the null hypothesis. A P-value of 0.05 or less usually suggests that our observed data significantly deviate from our expected data. In our case, we need to inspect the P-value to determine if the observed marble distribution significantly differs from the one we expected.

Calculating Expected Frequencies

Now, let's learn how to calculate expected frequencies, which is useful when we do not have predefined expectations.

Suppose you have a bag with 100 different colored marbles. You randomly draw a certain number of marbles from this bag, recording the color of each marble drawn. The attained frequencies of each color are stored in a vector called observed.

R
1# generate a vector of observed frequencies
2observed <- round(runif(100, min=10, max=50))

If each marble color was equally likely to be selected, the expected frequency for each color would be equal. We calculate this as the total number of draws divided by the number of different marble colors.

We create a vector, expected, where each item is this computed expected frequency. Here it is:

R
1# Compute expected frequency
2expected <- rep(sum(observed) / length(observed), length(observed))

Now, just as we did with the previous marble example, you can compare these observed and expected frequencies using the Chi-Square Test.

Conclusion

You should be proud of having navigated through the realm of the Chi-Square Test! With R's built-in functions, testing your observations against expectations is straightforward.

Remember, the more you practice, the better you become. Some engaging exercises are coming your way to give you hands-on practice performing Chi-Square tests on real-world scenarios. Are you ready? Let the fun with R begin!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.