### Fed 4, 2009

HW.Hours StdMathScore
1 -0.04391888 59.514
2 0.26639017 47.954
3 0.30237000 42.799
4 0.28818846 49.205
5 0.64999216 52.519
6 0.23066355 44.493
HW.Hours StdMathScore
483 0.03774998 56.053
138 0.31898249 44.345
908 1.61900567 59.514
689 -1.80284598 44.641
130 -0.70691857 45.892
189 1.09322485 61.871
HW.Hours StdMathScore
776 0.18038788 42.211
805 0.20964010 58.704
653 -0.05089566 63.270
52 1.37044411 64.522
329 1.20164587 45.597
569 0.10342771 54.949
> library(lattice)
> xyplot(StdMathScore~HW.Hours, data=nels, type=c("p","r"))
> model<-lm(StdMathScore~HW.Hours, data=nels)
> coef(model)
(Intercept) HW.Hours
51.987840 1.374264
We are assuming now that all 887 students in the NELS data is the entire population

## Drawing a Random Sample

With Replacement
-We already have a random sample in "Sample1.txt"
-There is another random sample in "Sample2.txt"
> xyplot(StdMathScore~HW.Hours, data=sample1, type=c("p","r"))
> xyplot(sd(StdMathScore)/StdMathScore~sd(HW.Hours)/HW.Hours, data=sample1, type=c("p","r"))
> xyplot((sd(StdMathScore)/StdMathScore)~(sd(HW.Hours)/HW.Hours), data=sample1, type=c("p","r"))
> xyplot(StdMathScore/sd(StdMathScore)~HW.Hours/sd(HW.Hours), data=sample1, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=nels, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=sample1, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=sample2, type=c("p","r"))
> model1<-lm(StdMathScore~HW.Hours, data=sample1)
> model2<-lm(StdMathScore~HW.Hours, data=sample2)
> coef(model1)
(Intercept) HW.Hours
53.4549682 0.3746038
> coef(model2)
(Intercept) HW.Hours
51.6687436 -0.2858852
> plot(StdMathScore~HW.Hours, data=sample1)
> abline(model1)
> abline(model,lty="dotted")
> abline(model2,lty="solid", lwd="2")
-The only reason that slopes and intercepts for equations for different random samples is sampling error.
-If we assume an infinite number of samples the the mean of all possible regression slopes is equal to the true population slope.

## Hypothesis testing

Hnull: beta-sub-1 = 0
Halt: beta-sub-1 != 0
-the basic idea is that you get a sample of data and ask under the NULL HYPOTHESIS how likely is it that we will see the coeficient for our sample.
HW.Hours StdMathScore
345 -0.3329931 42.432
759 -0.2136822 53.698
95 -1.0077991 49.205
325 0.2059000 53.698
355 -0.1177185 55.980
377 0.1413540 65.331
> model3<-lm(StdMathScore~HW.Hours, data=sample3)
> coef(model3)
(Intercept) HW.Hours
51.394680 1.782601
t statistic: t = Ybar - HypValue/(beta/n^(1/2))
in regression t = (BETA-hat-sub-one - 0)/Standard Error of BETA-hat-sub-1
for our data n = 200, t = (1.78 - 0)/.5811
> (1.78-0)/.5811
[1] 3.063156
p value for a regression = pt(-t,df) = cumulative density in a t distribution, we use the negative value for the t stat.
> pt(-3.07,198)
[1] 0.001220481
> 2*pt(-3.07,198)
[1] 0.002440963
p = .002
It is likely that the true regression slope is not zero, so we know there is some form of relationship.
summary(modelName)
> summary(model3)

Call:
lm(formula = StdMathScore ~ HW.Hours, data = sample3)

Residuals:
Min 1Q Median 3Q Max
-19.9886 -8.5163 -0.7377 8.2180 21.5530

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.3947 0.7055 72.853 < 2e-16 ***
HW.Hours 1.7826 0.5811 3.068 0.00246 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.959 on 198 degrees of freedom
Multiple R-squared: 0.04538, Adjusted R-squared: 0.04055
F-statistic: 9.411 on 1 and 198 DF, p-value: 0.002458

> anova(model3)
Analysis of Variance Table

Response: StdMathScore
Df Sum Sq Mean Sq F value Pr(>F)
HW.Hours 1 933.5 933.5 9.4113 0.002458 **
Residuals 198 19638.9 99.2
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>