Welcome to the second lesson on Mastering Hypothesis Testing with R! Our focus today is on the Mann-Whitney U test. We've engaged with T-tests previously and have now set our sights on the Mann-Whitney U test — a valuable tool when data do not meet the T-test's normality assumption. In this lesson, we'll unpack the nuances of the Mann-Whitney U test by applying it to a realistic dataset using R's wilcox.test()
function.
We'll begin with non-parametric tests. They're also known as distribution-free tests because they cater to data that does not follow a normal distribution. We resort to them when our data is either skewed, ordinal, or has outliers. Ordinal data is a particular type in which the order of data points matters, though the difference between the data points does not. For example, the sequence in which runners finish a race matters, but the exact time difference between each runner does not necessarily matter.
The Mann-Whitney U test is used to compare two independent groups when the dependent variable is either ordinal or continuous but does not follow a normal distribution. By ranking the values from both groups and summing the ranks, equivalent sums of ranks suggest that the two groups do not differ significantly.
The Mann-Whitney U test yields two values: the U-statistic
and the p-value
. The U-statistic
reflects the rank sum difference between the two groups in relation to their observed data values. Essentially, a larger U-statistic
indicates a greater separation or difference between the data of the two groups. The p-value
conveys the same information as in the T-test: If the p-value
is less than 0.05, the difference is statistically significant and not due to chance.
To perform the U test, we use R's wilcox.test()
function. This function takes two data samples as inputs and outputs a test statistic (W) and a p-value (p). Check out this code for a better insight:
R1# Define two distinct data samples 2data1 <- c(5, 22, 15, 18, 12, 17, 14) 3data2 <- c(25, 24, 30, 19, 23) 4 5# Perform the Mann-Whitney U test 6result <- wilcox.test(data1, data2, exact = FALSE) 7 8# Print the test statistic and p-value 9print(paste('W-value:', result$statistic)) # 1.5 10print(paste('p-value:', result$p.value)) # 0.0117
If the p-value
is less than 0.05, this result suggests that we should reject the null hypothesis.
The exact = FALSE
parameter in the wilcox.test()
function instructs R not to use the exact distribution method for computing the p-value. This is particularly useful when dealing with larger samples, as calculating the exact p-value can become computationally intensive. By setting exact = FALSE
, the function instead approximates the p-value using normal distribution assumptions, making the computation more efficient for larger datasets.
To illustrate the Mann-Whitney U test with real data, let's assume that we have information about the time users from two regions spent interacting on a website. The goal is to determine if there is a significant difference in user behaviour between the two regions.
R1# Data on time spent (in minutes) on the website by users 2time_A <- c(31, 22, 39, 27, 35, 28, 34, 26, 23, 33) 3time_B <- c(26, 25, 30, 28, 29, 28, 27, 30, 27, 28) 4 5# Perform the Mann-Whitney U test 6result <- wilcox.test(time_A, time_B, exact = FALSE) 7 8# Print out the results 9print(paste('W-value:', result$statistic)) # 60 10print(paste('p-value:', result$p.value)) # 0.47
Because the p-value
is not under 0.05, this result implies that there isn't a significant difference.
Great job! You've now grasped the fundamentals of the Mann-Whitney U test and how to use R to perform it. You're equipped to work with datasets that don't follow a normal distribution. Ready for the practice session? It will help reinforce your understanding and provide hands-on experience. Remember, practice is essential for mastering new techniques. Enjoy your learning journey!