« Longitudinal data (change scores) | Main | Hiring staff »

Very large sample sizes = statistical significance

"Everything in my data set is associated, because the sample size is so large." I have heard students suggest any that you can find statistically significant associations between any two variables, if the sample size is extremely large. I see a misunderstanding in that statement.

Assume the null hypothesis is true. You CHOOSE your type I/false positive error rate, typically 0.05, and sample size does not play a role.

Two uncorrelated variables will not show a statistical association no matter how large the sample size is. Larger sample sizes simply let you 'find' smaller correlations. This has one causal implication and one practical implication:

Practically speaking, at some point a correlation can be so small that it makes no difference. For example: If parking your bicycle near the smoking area increases your change of getting lung cancer by .00001% then a study of a billion bicyclists might detect the increased risk - but it would not make sense to advise against parking there.

Causally speaking, I the smaller a correlation the more likely it is spurious. Weak confounders and residual confounding will be hard to eliminate. (If this is the only problem, a randomized trial may not be affected.) Note that