May 26, 2010

Hiring staff

Notes from OHR presentation on effective interviews

Post most positions for short periods. If you post "Open until filled" you may receive hundreds of applications.
Writing a good job description is the first step - lots of specifics
Write open ended, historical questions based upon job description
Ask all questions of every candidate, regardless of how good or bad they are doing (Reduces bias in your interview)
Look for evidence contrary to your first impression, whether good or bad

Data collection staff can be asked to simulate the collection process.

Very large sample sizes = statistical significance

"Everything in my data set is associated, because the sample size is so large." I have heard students suggest any that you can find statistically significant associations between any two variables, if the sample size is extremely large. I see a misunderstanding in that statement.

Assume the null hypothesis is true. You CHOOSE your type I/false positive error rate, typically 0.05, and sample size does not play a role.

Two uncorrelated variables will not show a statistical association no matter how large the sample size is. Larger sample sizes simply let you 'find' smaller correlations. This has one causal implication and one practical implication:

Practically speaking, at some point a correlation can be so small that it makes no difference. For example: If parking your bicycle near the smoking area increases your change of getting lung cancer by .00001% then a study of a billion bicyclists might detect the increased risk - but it would not make sense to advise against parking there.

Causally speaking, I the smaller a correlation the more likely it is spurious. Weak confounders and residual confounding will be hard to eliminate. (If this is the only problem, a randomized trial may not be affected.) Note that

May 18, 2010

Longitudinal data (change scores)

Baseline-adjustment is generally preferred to change scores, because it is less parameterized - allowing the data more freedom to fit itself. In calculation terms it is the difference between models like these:

wt2 = 1*wt1 + error

wt2 = 0.973*wt1 + error

Interestingly, you can use change on the left side and get "identical" results - i.e. treatment effects will be identical but the wt1 coefficient will change its estimate/units.

wt2-wt1 = 0.973*wt1 + error

That makes sense because you are modeling two equations like these:

y = ax + b

y-x = ax + b (alternatively y = (a+1)x + b , a linear transformation)

On to a more complicated question...

Study design: Randomized trial of weight loss
Question: How to model weight loss maintenance from time 4 to time 5

Warning: I went out on a limb to answer this question

The specific question was whether to adjust for time 4 ("baseline") when modeling change from 4 to 5 and adjusting for change from 1 to 4. (Background: This is about maintenance and weight loss from baseline to 4 may predict performance from 4 to 5. Do the successful people remain successful or do the people who lost have greater regain potential? Are there two effects here, and will the results be influenced by whether completeness of data is related to early performance?)

In other words, choose between these 2 models:

wt(5-4) = wt(4-1) + error

wt(5-4) = wt(4-1) + wt4 + error

The latter is algebraically equivalent to wt5 = wt4 + wt1 + error and gives identical results.
The former is NOT equivalent to wt5 = wt1 + error, but it's close. Therefore I prefer the model with wt4 because I don't want wt4 to drop out of the equation. This is my intuition rather than something I know to be correct.