> nels<-read.table("NELSsample.txt", header=T)

> head(nels)

HW.Hours StdMathScore

345 -0.3329931 42.432

759 -0.2136822 53.698

95 -1.0077991 49.205

325 0.2059000 53.698

355 -0.1177185 55.980

377 0.1413540 65.331

> attach(nels)

> model<-lm(StdMathScore~HW.Hours, data=nels)

> summary(model)

Call:

lm(formula = StdMathScore ~ HW.Hours, data = nels)

Residuals:

Min 1Q Median 3Q Max

-19.9886 -8.5163 -0.7377 8.2180 21.5530

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 51.3947 0.7055 72.853 < 2e-16 ***

HW.Hours 1.7826 0.5811 3.068 0.00246 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.959 on 198 degrees of freedom

Multiple R-squared: 0.04538, Adjusted R-squared: 0.04055

F-statistic: 9.411 on 1 and 198 DF, p-value: 0.002458

Y-hat = 51.34 + 1.78(x)

H-not (intercept): Beta-not = 0

H-not (coefficient): Beta-sub-one = 0

From the above output we can tell that HW.Hours accounts for only about 5% of the variation in Math Score.

It also tell sus that our 95% margin of error is about +/- 19.92 (9.96*2) points.

BETA-hat-sub-one = 1.78

We are saying that we used a method that works 95% of the time.

> confint(model)

2.5 % 97.5 %

(Intercept) 50.0035113 52.785848

HW.Hours 0.6367181 2.928483

Our interval estimate in this case is anywhere from .64 to 2.93

**Remember we have been talking about the confidence interval for the parameter.**

If our end goal is to use the model to predict we probably are more interested in a conf. interval for the prediciton that we can make based on that model.

We can get a confidence interval for the predicted individual value or we can get the interval for the conditional mean (mean of all points at a particular measurement).

mu-sub-X|Y

predict(modelName,interval="confidence")

> predict(model,interval="confidence")

fit lwr upr

345 50.80109 49.33697 52.26520

759 51.01377 49.58705 52.44049

95 49.59818 47.73843 51.45792

325 51.76172 50.36448 53.15895

...

> model.predictions<-predict(model,interval="confidence")

> library(NCStats)

Loading required package: car

Loading required package: gplots

Loading required package: gtools

Attaching package: 'gtools'

The following object(s) are masked from package:car :

logit

Loading required package: gdata

Attaching package: 'gplots'

The following object(s) are masked from package:stats :

lowess

Loading required package: Hmisc

Attaching package: 'Hmisc'

The following object(s) are masked from package:gdata :

combine,

reorder.factor

The following object(s) are masked from package:car :

recode

The following object(s) are masked from package:base :

format.pval,

round.POSIXt,

trunc.POSIXt,

units

Loading required package: multcomp

Loading required package: mvtnorm

Loading required package: nortest

Loading required package: sciplot

Loading required package: tcltk

Loading Tcl/Tk interface ... done

Loading required package: TeachingDemos

Attaching package: 'TeachingDemos'

The following object(s) are masked from package:Hmisc :

cnvrt.coords,

subplot

##########################################

## NCStats package by Derek H. Ogle ##

## type ?NCStats for documentation. ##

##########################################

Attaching package: 'NCStats'

The following object(s) are masked from package:stats :

print.anova

The following object(s) are masked from package:methods :

Summary

> help(prediciton.plot)

No documentation for 'prediciton.plot' in specified packages and libraries:

you could try '??prediciton.plot'

> help(prediction.plot)

> prediction.plot(model,interval="confidence",newdata=nels)

obs HW.Hours StdMathScore fit lwr upr

345 1 -0.332993081 42.432 50.80109 49.33697 52.26520

759 2 -0.213682155 53.698 51.01377 49.58705 52.44049

95 3 -1.007799147 49.205 49.59818 47.73843 51.45792

325 4 0.205899984 53.698 51.76172 50.36448 53.15895

...

>

> head(nels)

HW.Hours StdMathScore

1 -0.04391888 59.514

2 0.26639017 47.954

3 0.30237000 42.799

4 0.28818846 49.205

5 0.64999216 52.519

6 0.23066355 44.493

> sample1<- read.table("Sample1.txt", header=T)

> head(sample1)

HW.Hours StdMathScore

483 0.03774998 56.053

138 0.31898249 44.345

908 1.61900567 59.514

689 -1.80284598 44.641

130 -0.70691857 45.892

189 1.09322485 61.871

> sample2<-read.table("Sample2.txt", header=T)

> head(sample2)

HW.Hours StdMathScore

776 0.18038788 42.211

805 0.20964010 58.704

653 -0.05089566 63.270

52 1.37044411 64.522

329 1.20164587 45.597

569 0.10342771 54.949

> library(lattice)

> xyplot(StdMathScore~HW.Hours, data=nels, type=c("p","r"))

> model<-lm(StdMathScore~HW.Hours, data=nels)

> coef(model)

(Intercept) HW.Hours

51.987840 1.374264

-We already have a random sample in "Sample1.txt"

-There is another random sample in "Sample2.txt"

> xyplot(StdMathScore~HW.Hours, data=sample1, type=c("p","r"))

> xyplot(sd(StdMathScore)/StdMathScore~sd(HW.Hours)/HW.Hours, data=sample1, type=c("p","r"))

> xyplot((sd(StdMathScore)/StdMathScore)~(sd(HW.Hours)/HW.Hours), data=sample1, type=c("p","r"))

> xyplot(StdMathScore/sd(StdMathScore)~HW.Hours/sd(HW.Hours), data=sample1, type=c("p","r"))

> xyplot(StdMathScore~HW.Hours, data=nels, type=c("p","r"))

> xyplot(StdMathScore~HW.Hours, data=sample1, type=c("p","r"))

> xyplot(StdMathScore~HW.Hours, data=sample2, type=c("p","r"))

> model1<-lm(StdMathScore~HW.Hours, data=sample1)

> model2<-lm(StdMathScore~HW.Hours, data=sample2)

> coef(model1)

(Intercept) HW.Hours

53.4549682 0.3746038

> coef(model2)

(Intercept) HW.Hours

51.6687436 -0.2858852

> plot(StdMathScore~HW.Hours, data=sample1)

> abline(model1)

> abline(model,lty="dotted")

> abline(model2,lty="solid", lwd="2")

-The only reason that slopes and intercepts for equations for different random samples is sampling error.

-If we assume an infinite number of samples the the mean of all possible regression slopes is equal to the true population slope.

Hnull: beta-sub-1 = 0

Halt: beta-sub-1 != 0

-the basic idea is that you get a sample of data and ask under the NULL HYPOTHESIS how likely is it that we will see the coeficient for our sample.

> sample3<- read.table("NELSSample.txt", header=T)

> head(sample3)

HW.Hours StdMathScore

345 -0.3329931 42.432

759 -0.2136822 53.698

95 -1.0077991 49.205

325 0.2059000 53.698

355 -0.1177185 55.980

377 0.1413540 65.331

> model3<-lm(StdMathScore~HW.Hours, data=sample3)

> coef(model3)

(Intercept) HW.Hours

51.394680 1.782601

t statistic: t = Ybar - HypValue/(beta/n^(1/2))

in regression t = (BETA-hat-sub-one - 0)/Standard Error of BETA-hat-sub-1

for our data n = 200, t = (1.78 - 0)/.5811

> (1.78-0)/.5811

[1] 3.063156

p value for a regression = pt(-t,df) = cumulative density in a t distribution, we use the negative value for the t stat.

> pt(-3.07,198)

[1] 0.001220481

> 2*pt(-3.07,198)

[1] 0.002440963

p = .002

It is likely that the true regression slope is not zero, so we know there is some form of relationship.

summary(modelName)

> summary(model3)

Call:

lm(formula = StdMathScore ~ HW.Hours, data = sample3)

Residuals:

Min 1Q Median 3Q Max

-19.9886 -8.5163 -0.7377 8.2180 21.5530

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 51.3947 0.7055 72.853 < 2e-16 ***

HW.Hours 1.7826 0.5811 3.068 0.00246 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.959 on 198 degrees of freedom

Multiple R-squared: 0.04538, Adjusted R-squared: 0.04055

F-statistic: 9.411 on 1 and 198 DF, p-value: 0.002458

> anova(model3)

Analysis of Variance Table

Response: StdMathScore

Df Sum Sq Mean Sq F value Pr(>F)

HW.Hours 1 933.5 933.5 9.4113 0.002458 **

Residuals 198 19638.9 99.2

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

>

> head(nels)

HW.Hours StdMathScore

1 -0.04391888 59.514

2 0.26639017 47.954

3 0.30237000 42.799

4 0.28818846 49.205

5 0.64999216 52.519

6 0.23066355 44.493

> sample1<- read.table("Sample1.txt", header=T)

> head(sample1)

HW.Hours StdMathScore

483 0.03774998 56.053

138 0.31898249 44.345

908 1.61900567 59.514

689 -1.80284598 44.641

130 -0.70691857 45.892

189 1.09322485 61.871

> sample2<-read.table("Sample2.txt", header=T)

> head(sample2)

HW.Hours StdMathScore

776 0.18038788 42.211

805 0.20964010 58.704

653 -0.05089566 63.270

52 1.37044411 64.522

329 1.20164587 45.597

569 0.10342771 54.949

> library(lattice)

> xyplot(StdMathScore~HW.Hours, data=nels, type=c("p","r"))

> model<-lm(StdMathScore~HW.Hours, data=nels)

> coef(model)

(Intercept) HW.Hours

51.987840 1.374264

-We already have a random sample in "Sample1.txt"

-There is another random sample in "Sample2.txt"

> xyplot(StdMathScore~HW.Hours, data=sample1, type=c("p","r"))

> xyplot(sd(StdMathScore)/StdMathScore~sd(HW.Hours)/HW.Hours, data=sample1, type=c("p","r"))

> xyplot((sd(StdMathScore)/StdMathScore)~(sd(HW.Hours)/HW.Hours), data=sample1, type=c("p","r"))

> xyplot(StdMathScore/sd(StdMathScore)~HW.Hours/sd(HW.Hours), data=sample1, type=c("p","r"))

> xyplot(StdMathScore~HW.Hours, data=nels, type=c("p","r"))

> xyplot(StdMathScore~HW.Hours, data=sample1, type=c("p","r"))

> xyplot(StdMathScore~HW.Hours, data=sample2, type=c("p","r"))

> model1<-lm(StdMathScore~HW.Hours, data=sample1)

> model2<-lm(StdMathScore~HW.Hours, data=sample2)

> coef(model1)

(Intercept) HW.Hours

53.4549682 0.3746038

> coef(model2)

(Intercept) HW.Hours

51.6687436 -0.2858852

> plot(StdMathScore~HW.Hours, data=sample1)

> abline(model1)

> abline(model,lty="dotted")

> abline(model2,lty="solid", lwd="2")

-The only reason that slopes and intercepts for equations for different random samples is sampling error.

-If we assume an infinite number of samples the the mean of all possible regression slopes is equal to the true population slope.

Hnull: beta-sub-1 = 0

Halt: beta-sub-1 != 0

-the basic idea is that you get a sample of data and ask under the NULL HYPOTHESIS how likely is it that we will see the coeficient for our sample.

> sample3<- read.table("NELSSample.txt", header=T)

> head(sample3)

HW.Hours StdMathScore

345 -0.3329931 42.432

759 -0.2136822 53.698

95 -1.0077991 49.205

325 0.2059000 53.698

355 -0.1177185 55.980

377 0.1413540 65.331

> model3<-lm(StdMathScore~HW.Hours, data=sample3)

> coef(model3)

(Intercept) HW.Hours

51.394680 1.782601

t statistic: t = Ybar - HypValue/(beta/n^(1/2))

in regression t = (BETA-hat-sub-one - 0)/Standard Error of BETA-hat-sub-1

for our data n = 200, t = (1.78 - 0)/.5811

> (1.78-0)/.5811

[1] 3.063156

p value for a regression = pt(-t,df) = cumulative density in a t distribution, we use the negative value for the t stat.

> pt(-3.07,198)

[1] 0.001220481

> 2*pt(-3.07,198)

[1] 0.002440963

p = .002

It is likely that the true regression slope is not zero, so we know there is some form of relationship.

summary(modelName)

> summary(model3)

Call:

lm(formula = StdMathScore ~ HW.Hours, data = sample3)

Residuals:

Min 1Q Median 3Q Max

-19.9886 -8.5163 -0.7377 8.2180 21.5530

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 51.3947 0.7055 72.853 < 2e-16 ***

HW.Hours 1.7826 0.5811 3.068 0.00246 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.959 on 198 degrees of freedom

Multiple R-squared: 0.04538, Adjusted R-squared: 0.04055

F-statistic: 9.411 on 1 and 198 DF, p-value: 0.002458

> anova(model3)

Analysis of Variance Table

Response: StdMathScore

Df Sum Sq Mean Sq F value Pr(>F)

HW.Hours 1 933.5 933.5 9.4113 0.002458 **

Residuals 198 19638.9 99.2

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

>

burt <- read.table("burt.txt", header = T)

attach(burt)

model<-lm(FostIQ~OwnIQ)

coef(model)

(Intercept) OwnIQ

9.719491 0.907920

anova(model)

Analysis of Variance Table

Response: FostIQ

Df Sum Sq Mean Sq F value Pr(>F)

OwnIQ 1 9250.7 9250.7 169.42 < 2.2e-16 ***

Residuals 51 2784.7 54.6

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

-Drops error from the regression equation.

Yvariable-sub-i = (Intercept) + Slope*Xvariable-sub-i

(Regression Equation includes error so the Y variable is not an estimate but in the Fitted Model the Y is an estimate [

library(lattice)

xyplot(FostIQ~OwnIQ,type=c("p","r"))

R^2 = SSmodel/SStotal

R^2 = 9251/12035 = .769

*76.9% of difference in FostIQ is accounted for by difference in OwnIQ and 23.1% is not.

We cannot account for how the unexplained variation divides the variation between other systematic components, measurement error, and individual variation.

What is the variance of the mean estimates for the scores at each point on the line (

sigma-hat-

sigma-hat^2-sub-X|Y = SSresiduals/n

sigma-hat^2-sub-X|Y = SSresiduals/n - (parameters in equation)

sigma-hat^2-sub-X|Y = 2785/31 - 2

SD=Sqrt(sigma-hat^2-sub-X|Y = 2785/31 - 2)

sqrt(54.6)

[1] 7.389181

We are therefore sure to around 95% (2 SDs) that our predicted values will be within about 15 points either side of any particular point estimate.

9.72+.91*75

[1] 77.97

77.97-14.8

[1] 63.17

77.97+14.8

[1] 92.77

So we are sure that for an OwnIQ score of 75 we would expect the Fost IQ score to be somewhere between 63.1,92.8.

]]>

37 38 39 40 41 42

3.94891354 1.04099358 2.04099358 1.13307363 -5.86692637 -12.77484633

43 44 45 46 47 48

-12.49860620 4.59347385 -9.22236607 11.77763393 -6.13028602 0.96179402

49 50 51 52 53

-0.85404589 -1.57780576 4.79051441 -9.84116541 3.34299467

> model$fitted.values

1 2 3 4 5 6 7 8

71.45805 74.18181 75.99765 77.81349 80.53725 81.44517 83.26101 84.16893

9 10 11 12 13 14 15 16

84.16893 85.07685 86.89269 87.80061 88.70853 88.70853 90.52437 91.43229

17 18 19 20 21 22 23 24

92.34021 92.34021 93.24813 93.24813 94.15605 94.15605 94.15605 95.06397

25 26 27 28 29 30 31 32

95.97189 96.87981 96.87981 97.78773 97.78773 97.78773 97.78773 99.60357

33 34 35 36 37 38 39 40

100.51149 101.41941 102.32733 103.23525 105.05109 105.95901 105.95901 106.86693

41 42 43 44 45 46 47 48

106.86693 107.77485 110.49861 111.40653 113.22237 113.22237 114.13029 115.03821

49 50 51 52 53

116.85405 119.57781 123.20949 126.84117 128.65701

---Model fit can be defined by a proportion of the total variation explained by the model (the variation unexplained by the model divided by the total variation)

---Model fit can be defined by a proportion of the total variation explained by the model (the variation explained by the model divided by the total variation)

---SSmodel/SStotal (Sum of Squares model / Sum of Squares total) and SSresidual/SStotal (Sum of Squares residuals / Sum of Squares total)

> anova(model)

Analysis of Variance Table

Response: FostIQ

Df Sum Sq Mean Sq F value Pr(>F)

OwnIQ 1 9250.7 9250.7 169.42 < 2.2e-16 ***

Residuals 51 2784.7 54.6

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

SSmodel = 9520.7 and SSresidual = 2784.7

>

-deterministic models, there is no error in a mathematical model

-We are using models that allow for error and use probability

-Allow for other systematic components that in many cases are not included or were not measured.

-Allow for measurement error (especially in the social sciences)

-Allow for individual variation within the unit of analysis that we are analyzing.

1.Identify systematic components

2.Assess the model fit (looking at residuals [good model=smaller residuals])

...fitting model to sample data

-Causality can be only determined through the design of the study not through analysis.

-Analysis discovers associations, correlations or covariation.

> burt <- read.table("burt.txtv", header = T)

> burt <- read.table("burt.txt", header = T)

> head(burt)

ID OwnIQ FostIQ

1 1 68 63

2 2 71 76

3 3 73 77

4 4 75 72

5 5 78 71

6 6 79 75

> attach(burt)

**Always a good idea to begin an analysis with a decriptive analysis and plots.**

> library(psych)

package 'psych' successfully unpacked and MD5 sums checked

> library(psych)

> describe(OwnIQ)

var n mean sd median trimmed mad min max range skew kurtosis se

1 1 53 97.36 14.69 96 97 14.83 68 131 63 0.24 -0.47 2.02

> describe(FostIQ)

var n mean sd median trimmed mad min max range skew kurtosis se

1 1 53 98.11 15.21 97 98.21 16.31 63 132 69 -0.02 -0.5 2.09

**Remeber from 8261: Kernel Density Plot is better than a histogram.**

-There are actually some better ways to plot now than using the base plot command: plot()

> library(lattice)

-the "lattice" library has a LOT of plot styles

A density plot in lattice is -> densityplot(variable,Kernel="e")

> densityplot(OwnIQ,Kernel="e")

-all of the "lattice" graphics functions allow you to enter a formula

> densityplot(~OwnIQ,data=burt,Kernel="e")

> densityplot(FostIQ,Kernel="e")

> densityplot(~FostIQ,data=burt,Kernel="e")

-in "lattice" library histogram ->histogram()

-in "lattice" library boxplot -> bwplot()

> bwplot(OwnIQ)

> histogram(OwnIQ)

> densityplot(OwnIQ,Kernel="e")

-in "lattice" scatter plot -> xyplot(formula, type="p")

-formla -> X~Y where X will be plotted on X axis and Y on the Y axis

> xyplot(FostIQ~OwnIQ,type="p")

1. What is the direction of the relaitionship?

2. What is the type of relationship? (Is it linear?)

3. What is the strength of the relationship? (Are the points close or far from the line?)

4. What is the magnitude of the relationship? (Line slope?)

5. Are there any unusual observations? (not necessarily outlier)

-In a deterministic realtionshp all data points are on the line, but there are different types of error in social science, so our plots start to look like clouds.

>

]]>