« January 2009 | Main

February 9, 2009

Feb. 9, 2009

> nels<-read.table("NELSsample.txt", header=T)
> head(nels)
HW.Hours StdMathScore
345 -0.3329931 42.432
759 -0.2136822 53.698
95 -1.0077991 49.205
325 0.2059000 53.698
355 -0.1177185 55.980
377 0.1413540 65.331
> attach(nels)
> model<-lm(StdMathScore~HW.Hours, data=nels)
> summary(model)

Call:
lm(formula = StdMathScore ~ HW.Hours, data = nels)

Residuals:
Min 1Q Median 3Q Max
-19.9886 -8.5163 -0.7377 8.2180 21.5530

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.3947 0.7055 72.853 < 2e-16 ***
HW.Hours 1.7826 0.5811 3.068 0.00246 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.959 on 198 degrees of freedom
Multiple R-squared: 0.04538, Adjusted R-squared: 0.04055
F-statistic: 9.411 on 1 and 198 DF, p-value: 0.002458

Y-hat = 51.34 + 1.78(x)

H-not (intercept): Beta-not = 0
H-not (coefficient): Beta-sub-one = 0

From the above output we can tell that HW.Hours accounts for only about 5% of the variation in Math Score.
It also tell sus that our 95% margin of error is about +/- 19.92 (9.96*2) points.
BETA-hat-sub-one = 1.78

Confidence interval for slope

We are saying that we used a method that works 95% of the time.
> confint(model)
2.5 % 97.5 %
(Intercept) 50.0035113 52.785848
HW.Hours 0.6367181 2.928483

Our interval estimate in this case is anywhere from .64 to 2.93
Remember we have been talking about the confidence interval for the parameter.

Other confidence intervals in regression besides parameter estimates


If our end goal is to use the model to predict we probably are more interested in a conf. interval for the prediciton that we can make based on that model.
We can get a confidence interval for the predicted individual value or we can get the interval for the conditional mean (mean of all points at a particular measurement).

predicting the conditional mean


mu-sub-X|Y
predict(modelName,interval="confidence")
> predict(model,interval="confidence")
fit lwr upr
345 50.80109 49.33697 52.26520
759 51.01377 49.58705 52.44049
95 49.59818 47.73843 51.45792
325 51.76172 50.36448 53.15895
...
> model.predictions<-predict(model,interval="confidence")
Confidence Bands
> library(NCStats)
Loading required package: car
Loading required package: gplots
Loading required package: gtools

Attaching package: 'gtools'


The following object(s) are masked from package:car :

logit

Loading required package: gdata

Attaching package: 'gplots'


The following object(s) are masked from package:stats :

lowess

Loading required package: Hmisc

Attaching package: 'Hmisc'


The following object(s) are masked from package:gdata :

combine,
reorder.factor


The following object(s) are masked from package:car :

recode


The following object(s) are masked from package:base :

format.pval,
round.POSIXt,
trunc.POSIXt,
units

Loading required package: multcomp
Loading required package: mvtnorm
Loading required package: nortest
Loading required package: sciplot
Loading required package: tcltk
Loading Tcl/Tk interface ... done
Loading required package: TeachingDemos

Attaching package: 'TeachingDemos'


The following object(s) are masked from package:Hmisc :

cnvrt.coords,
subplot

##########################################
## NCStats package by Derek H. Ogle ##
## type ?NCStats for documentation. ##
##########################################


Attaching package: 'NCStats'


The following object(s) are masked from package:stats :

print.anova


The following object(s) are masked from package:methods :

Summary

> help(prediciton.plot)
No documentation for 'prediciton.plot' in specified packages and libraries:
you could try '??prediciton.plot'
> help(prediction.plot)

> prediction.plot(model,interval="confidence",newdata=nels)
obs HW.Hours StdMathScore fit lwr upr
345 1 -0.332993081 42.432 50.80109 49.33697 52.26520
759 2 -0.213682155 53.698 51.01377 49.58705 52.44049
95 3 -1.007799147 49.205 49.59818 47.73843 51.45792
325 4 0.205899984 53.698 51.76172 50.36448 53.15895
...
>

February 4, 2009

Fed 4, 2009

> nels<-read.table("NELS.txt", header=T)
> head(nels)
HW.Hours StdMathScore
1 -0.04391888 59.514
2 0.26639017 47.954
3 0.30237000 42.799
4 0.28818846 49.205
5 0.64999216 52.519
6 0.23066355 44.493
> sample1<- read.table("Sample1.txt", header=T)
> head(sample1)
HW.Hours StdMathScore
483 0.03774998 56.053
138 0.31898249 44.345
908 1.61900567 59.514
689 -1.80284598 44.641
130 -0.70691857 45.892
189 1.09322485 61.871
> sample2<-read.table("Sample2.txt", header=T)
> head(sample2)
HW.Hours StdMathScore
776 0.18038788 42.211
805 0.20964010 58.704
653 -0.05089566 63.270
52 1.37044411 64.522
329 1.20164587 45.597
569 0.10342771 54.949
> library(lattice)
> xyplot(StdMathScore~HW.Hours, data=nels, type=c("p","r"))
> model<-lm(StdMathScore~HW.Hours, data=nels)
> coef(model)
(Intercept) HW.Hours
51.987840 1.374264
We are assuming now that all 887 students in the NELS data is the entire population

Drawing a Random Sample


With Replacement
-We already have a random sample in "Sample1.txt"
-There is another random sample in "Sample2.txt"
> xyplot(StdMathScore~HW.Hours, data=sample1, type=c("p","r"))
> xyplot(sd(StdMathScore)/StdMathScore~sd(HW.Hours)/HW.Hours, data=sample1, type=c("p","r"))
> xyplot((sd(StdMathScore)/StdMathScore)~(sd(HW.Hours)/HW.Hours), data=sample1, type=c("p","r"))
> xyplot(StdMathScore/sd(StdMathScore)~HW.Hours/sd(HW.Hours), data=sample1, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=nels, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=sample1, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=sample2, type=c("p","r"))
> model1<-lm(StdMathScore~HW.Hours, data=sample1)
> model2<-lm(StdMathScore~HW.Hours, data=sample2)
> coef(model1)
(Intercept) HW.Hours
53.4549682 0.3746038
> coef(model2)
(Intercept) HW.Hours
51.6687436 -0.2858852
> plot(StdMathScore~HW.Hours, data=sample1)
> abline(model1)
> abline(model,lty="dotted")
> abline(model2,lty="solid", lwd="2")
-The only reason that slopes and intercepts for equations for different random samples is sampling error.
-If we assume an infinite number of samples the the mean of all possible regression slopes is equal to the true population slope.

Hypothesis testing


Hnull: beta-sub-1 = 0
Halt: beta-sub-1 != 0
-the basic idea is that you get a sample of data and ask under the NULL HYPOTHESIS how likely is it that we will see the coeficient for our sample.
> sample3<- read.table("NELSSample.txt", header=T)
> head(sample3)
HW.Hours StdMathScore
345 -0.3329931 42.432
759 -0.2136822 53.698
95 -1.0077991 49.205
325 0.2059000 53.698
355 -0.1177185 55.980
377 0.1413540 65.331
> model3<-lm(StdMathScore~HW.Hours, data=sample3)
> coef(model3)
(Intercept) HW.Hours
51.394680 1.782601
t statistic: t = Ybar - HypValue/(beta/n^(1/2))
in regression t = (BETA-hat-sub-one - 0)/Standard Error of BETA-hat-sub-1
for our data n = 200, t = (1.78 - 0)/.5811
> (1.78-0)/.5811
[1] 3.063156
p value for a regression = pt(-t,df) = cumulative density in a t distribution, we use the negative value for the t stat.
> pt(-3.07,198)
[1] 0.001220481
> 2*pt(-3.07,198)
[1] 0.002440963
p = .002
It is likely that the true regression slope is not zero, so we know there is some form of relationship.
summary(modelName)
> summary(model3)

Call:
lm(formula = StdMathScore ~ HW.Hours, data = sample3)

Residuals:
Min 1Q Median 3Q Max
-19.9886 -8.5163 -0.7377 8.2180 21.5530

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.3947 0.7055 72.853 < 2e-16 ***
HW.Hours 1.7826 0.5811 3.068 0.00246 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.959 on 198 degrees of freedom
Multiple R-squared: 0.04538, Adjusted R-squared: 0.04055
F-statistic: 9.411 on 1 and 198 DF, p-value: 0.002458

> anova(model3)
Analysis of Variance Table

Response: StdMathScore
Df Sum Sq Mean Sq F value Pr(>F)
HW.Hours 1 933.5 933.5 9.4113 0.002458 **
Residuals 198 19638.9 99.2
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

Fed 4, 2009

> nels<-read.table("NELS.txt", header=T)
> head(nels)
HW.Hours StdMathScore
1 -0.04391888 59.514
2 0.26639017 47.954
3 0.30237000 42.799
4 0.28818846 49.205
5 0.64999216 52.519
6 0.23066355 44.493
> sample1<- read.table("Sample1.txt", header=T)
> head(sample1)
HW.Hours StdMathScore
483 0.03774998 56.053
138 0.31898249 44.345
908 1.61900567 59.514
689 -1.80284598 44.641
130 -0.70691857 45.892
189 1.09322485 61.871
> sample2<-read.table("Sample2.txt", header=T)
> head(sample2)
HW.Hours StdMathScore
776 0.18038788 42.211
805 0.20964010 58.704
653 -0.05089566 63.270
52 1.37044411 64.522
329 1.20164587 45.597
569 0.10342771 54.949
> library(lattice)
> xyplot(StdMathScore~HW.Hours, data=nels, type=c("p","r"))
> model<-lm(StdMathScore~HW.Hours, data=nels)
> coef(model)
(Intercept) HW.Hours
51.987840 1.374264
We are assuming now that all 887 students in the NELS data is the entire population

Drawing a Random Sample


With Replacement
-We already have a random sample in "Sample1.txt"
-There is another random sample in "Sample2.txt"
> xyplot(StdMathScore~HW.Hours, data=sample1, type=c("p","r"))
> xyplot(sd(StdMathScore)/StdMathScore~sd(HW.Hours)/HW.Hours, data=sample1, type=c("p","r"))
> xyplot((sd(StdMathScore)/StdMathScore)~(sd(HW.Hours)/HW.Hours), data=sample1, type=c("p","r"))
> xyplot(StdMathScore/sd(StdMathScore)~HW.Hours/sd(HW.Hours), data=sample1, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=nels, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=sample1, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=sample2, type=c("p","r"))
> model1<-lm(StdMathScore~HW.Hours, data=sample1)
> model2<-lm(StdMathScore~HW.Hours, data=sample2)
> coef(model1)
(Intercept) HW.Hours
53.4549682 0.3746038
> coef(model2)
(Intercept) HW.Hours
51.6687436 -0.2858852
> plot(StdMathScore~HW.Hours, data=sample1)
> abline(model1)
> abline(model,lty="dotted")
> abline(model2,lty="solid", lwd="2")
-The only reason that slopes and intercepts for equations for different random samples is sampling error.
-If we assume an infinite number of samples the the mean of all possible regression slopes is equal to the true population slope.

Hypothesis testing


Hnull: beta-sub-1 = 0
Halt: beta-sub-1 != 0
-the basic idea is that you get a sample of data and ask under the NULL HYPOTHESIS how likely is it that we will see the coeficient for our sample.
> sample3<- read.table("NELSSample.txt", header=T)
> head(sample3)
HW.Hours StdMathScore
345 -0.3329931 42.432
759 -0.2136822 53.698
95 -1.0077991 49.205
325 0.2059000 53.698
355 -0.1177185 55.980
377 0.1413540 65.331
> model3<-lm(StdMathScore~HW.Hours, data=sample3)
> coef(model3)
(Intercept) HW.Hours
51.394680 1.782601
t statistic: t = Ybar - HypValue/(beta/n^(1/2))
in regression t = (BETA-hat-sub-one - 0)/Standard Error of BETA-hat-sub-1
for our data n = 200, t = (1.78 - 0)/.5811
> (1.78-0)/.5811
[1] 3.063156
p value for a regression = pt(-t,df) = cumulative density in a t distribution, we use the negative value for the t stat.
> pt(-3.07,198)
[1] 0.001220481
> 2*pt(-3.07,198)
[1] 0.002440963
p = .002
It is likely that the true regression slope is not zero, so we know there is some form of relationship.
summary(modelName)
> summary(model3)

Call:
lm(formula = StdMathScore ~ HW.Hours, data = sample3)

Residuals:
Min 1Q Median 3Q Max
-19.9886 -8.5163 -0.7377 8.2180 21.5530

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.3947 0.7055 72.853 < 2e-16 ***
HW.Hours 1.7826 0.5811 3.068 0.00246 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.959 on 198 degrees of freedom
Multiple R-squared: 0.04538, Adjusted R-squared: 0.04055
F-statistic: 9.411 on 1 and 198 DF, p-value: 0.002458

> anova(model3)
Analysis of Variance Table

Response: StdMathScore
Df Sum Sq Mean Sq F value Pr(>F)
HW.Hours 1 933.5 933.5 9.4113 0.002458 **
Residuals 198 19638.9 99.2
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

February 2, 2009

Feb. 2, 2008


burt <- read.table("burt.txt", header = T)

attach(burt)

model<-lm(FostIQ~OwnIQ)

Regression Lingo Alert



Regress the outcome variable (Y) on the predictor variable-We regress Y on X

coef(model)
(Intercept) OwnIQ
9.719491 0.907920

anova(model)
Analysis of Variance Table

Response: FostIQ
Df Sum Sq Mean Sq F value Pr(>F)
OwnIQ 1 9250.7 9250.7 169.42 < 2.2e-16 ***
Residuals 51 2784.7 54.6
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Fitted Model



-Drops error from the regression equation.

Yvariable-sub-i = (Intercept) + Slope*Xvariable-sub-i

(Regression Equation includes error so the Y variable is not an estimate but in the Fitted Model the Y is an estimate [needs the hat])

library(lattice)

xyplot(FostIQ~OwnIQ,type=c("p","r"))

R^2



R^2 = SSmodel/SStotal

R^2 = 9251/12035 = .769

*76.9% of difference in FostIQ is accounted for by difference in OwnIQ and 23.1% is not.

We cannot account for how the unexplained variation divides the variation between other systematic components, measurement error, and individual variation.

Estimated Residual Variance



What is the variance of the mean estimates for the scores at each point on the line (REmember that the line represents the means of the potential distribution at any point on a line.)

sigma-hat-

sigma-hat^2-sub-X|Y = SSresiduals/n

sigma-hat^2-sub-X|Y = SSresiduals/n - (parameters in equation)

sigma-hat^2-sub-X|Y = 2785/31 - 2

SD=Sqrt(sigma-hat^2-sub-X|Y = 2785/31 - 2)

sqrt(54.6)
[1] 7.389181

We are therefore sure to around 95% (2 SDs) that our predicted values will be within about 15 points either side of any particular point estimate.

9.72+.91*75
[1] 77.97

77.97-14.8
[1] 63.17

77.97+14.8
[1] 92.77

So we are sure that for an OwnIQ score of 75 we would expect the Fost IQ score to be somewhere between 63.1,92.8.