## February 9, 2009

### Feb. 9, 2009

HW.Hours StdMathScore
345 -0.3329931 42.432
759 -0.2136822 53.698
95 -1.0077991 49.205
325 0.2059000 53.698
355 -0.1177185 55.980
377 0.1413540 65.331
> attach(nels)
> model<-lm(StdMathScore~HW.Hours, data=nels)
> summary(model)

Call:
lm(formula = StdMathScore ~ HW.Hours, data = nels)

Residuals:
Min 1Q Median 3Q Max
-19.9886 -8.5163 -0.7377 8.2180 21.5530

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.3947 0.7055 72.853 < 2e-16 ***
HW.Hours 1.7826 0.5811 3.068 0.00246 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.959 on 198 degrees of freedom
Multiple R-squared: 0.04538, Adjusted R-squared: 0.04055
F-statistic: 9.411 on 1 and 198 DF, p-value: 0.002458

Y-hat = 51.34 + 1.78(x)

H-not (intercept): Beta-not = 0
H-not (coefficient): Beta-sub-one = 0

From the above output we can tell that HW.Hours accounts for only about 5% of the variation in Math Score.
It also tell sus that our 95% margin of error is about +/- 19.92 (9.96*2) points.
BETA-hat-sub-one = 1.78

## Confidence interval for slope

We are saying that we used a method that works 95% of the time.
> confint(model)
2.5 % 97.5 %
(Intercept) 50.0035113 52.785848
HW.Hours 0.6367181 2.928483

Our interval estimate in this case is anywhere from .64 to 2.93
Remember we have been talking about the confidence interval for the parameter.

## Other confidence intervals in regression besides parameter estimates

If our end goal is to use the model to predict we probably are more interested in a conf. interval for the prediciton that we can make based on that model.
We can get a confidence interval for the predicted individual value or we can get the interval for the conditional mean (mean of all points at a particular measurement).

### predicting the conditional mean

mu-sub-X|Y
predict(modelName,interval="confidence")
> predict(model,interval="confidence")
fit lwr upr
345 50.80109 49.33697 52.26520
759 51.01377 49.58705 52.44049
95 49.59818 47.73843 51.45792
325 51.76172 50.36448 53.15895
...
> model.predictions<-predict(model,interval="confidence")
Confidence Bands
> library(NCStats)

Attaching package: 'gtools'

The following object(s) are masked from package:car :

logit

Attaching package: 'gplots'

The following object(s) are masked from package:stats :

lowess

Attaching package: 'Hmisc'

The following object(s) are masked from package:gdata :

combine,
reorder.factor

The following object(s) are masked from package:car :

recode

The following object(s) are masked from package:base :

format.pval,
round.POSIXt,
trunc.POSIXt,
units

Attaching package: 'TeachingDemos'

The following object(s) are masked from package:Hmisc :

cnvrt.coords,
subplot

##########################################
## NCStats package by Derek H. Ogle ##
## type ?NCStats for documentation. ##
##########################################

Attaching package: 'NCStats'

The following object(s) are masked from package:stats :

print.anova

The following object(s) are masked from package:methods :

Summary

> help(prediciton.plot)
No documentation for 'prediciton.plot' in specified packages and libraries:
you could try '??prediciton.plot'
> help(prediction.plot)

> prediction.plot(model,interval="confidence",newdata=nels)
obs HW.Hours StdMathScore fit lwr upr
345 1 -0.332993081 42.432 50.80109 49.33697 52.26520
759 2 -0.213682155 53.698 51.01377 49.58705 52.44049
95 3 -1.007799147 49.205 49.59818 47.73843 51.45792
325 4 0.205899984 53.698 51.76172 50.36448 53.15895
...
>

## February 4, 2009

### Fed 4, 2009

HW.Hours StdMathScore
1 -0.04391888 59.514
2 0.26639017 47.954
3 0.30237000 42.799
4 0.28818846 49.205
5 0.64999216 52.519
6 0.23066355 44.493
HW.Hours StdMathScore
483 0.03774998 56.053
138 0.31898249 44.345
908 1.61900567 59.514
689 -1.80284598 44.641
130 -0.70691857 45.892
189 1.09322485 61.871
HW.Hours StdMathScore
776 0.18038788 42.211
805 0.20964010 58.704
653 -0.05089566 63.270
52 1.37044411 64.522
329 1.20164587 45.597
569 0.10342771 54.949
> library(lattice)
> xyplot(StdMathScore~HW.Hours, data=nels, type=c("p","r"))
> model<-lm(StdMathScore~HW.Hours, data=nels)
> coef(model)
(Intercept) HW.Hours
51.987840 1.374264
We are assuming now that all 887 students in the NELS data is the entire population

## Drawing a Random Sample

With Replacement
-We already have a random sample in "Sample1.txt"
-There is another random sample in "Sample2.txt"
> xyplot(StdMathScore~HW.Hours, data=sample1, type=c("p","r"))
> xyplot(sd(StdMathScore)/StdMathScore~sd(HW.Hours)/HW.Hours, data=sample1, type=c("p","r"))
> xyplot((sd(StdMathScore)/StdMathScore)~(sd(HW.Hours)/HW.Hours), data=sample1, type=c("p","r"))
> xyplot(StdMathScore/sd(StdMathScore)~HW.Hours/sd(HW.Hours), data=sample1, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=nels, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=sample1, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=sample2, type=c("p","r"))
> model1<-lm(StdMathScore~HW.Hours, data=sample1)
> model2<-lm(StdMathScore~HW.Hours, data=sample2)
> coef(model1)
(Intercept) HW.Hours
53.4549682 0.3746038
> coef(model2)
(Intercept) HW.Hours
51.6687436 -0.2858852
> plot(StdMathScore~HW.Hours, data=sample1)
> abline(model1)
> abline(model,lty="dotted")
> abline(model2,lty="solid", lwd="2")
-The only reason that slopes and intercepts for equations for different random samples is sampling error.
-If we assume an infinite number of samples the the mean of all possible regression slopes is equal to the true population slope.

## Hypothesis testing

Hnull: beta-sub-1 = 0
Halt: beta-sub-1 != 0
-the basic idea is that you get a sample of data and ask under the NULL HYPOTHESIS how likely is it that we will see the coeficient for our sample.
HW.Hours StdMathScore
345 -0.3329931 42.432
759 -0.2136822 53.698
95 -1.0077991 49.205
325 0.2059000 53.698
355 -0.1177185 55.980
377 0.1413540 65.331
> model3<-lm(StdMathScore~HW.Hours, data=sample3)
> coef(model3)
(Intercept) HW.Hours
51.394680 1.782601
t statistic: t = Ybar - HypValue/(beta/n^(1/2))
in regression t = (BETA-hat-sub-one - 0)/Standard Error of BETA-hat-sub-1
for our data n = 200, t = (1.78 - 0)/.5811
> (1.78-0)/.5811
[1] 3.063156
p value for a regression = pt(-t,df) = cumulative density in a t distribution, we use the negative value for the t stat.
> pt(-3.07,198)
[1] 0.001220481
> 2*pt(-3.07,198)
[1] 0.002440963
p = .002
It is likely that the true regression slope is not zero, so we know there is some form of relationship.
summary(modelName)
> summary(model3)

Call:
lm(formula = StdMathScore ~ HW.Hours, data = sample3)

Residuals:
Min 1Q Median 3Q Max
-19.9886 -8.5163 -0.7377 8.2180 21.5530

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.3947 0.7055 72.853 < 2e-16 ***
HW.Hours 1.7826 0.5811 3.068 0.00246 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.959 on 198 degrees of freedom
Multiple R-squared: 0.04538, Adjusted R-squared: 0.04055
F-statistic: 9.411 on 1 and 198 DF, p-value: 0.002458

> anova(model3)
Analysis of Variance Table

Response: StdMathScore
Df Sum Sq Mean Sq F value Pr(>F)
HW.Hours 1 933.5 933.5 9.4113 0.002458 **
Residuals 198 19638.9 99.2
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

### Fed 4, 2009

HW.Hours StdMathScore
1 -0.04391888 59.514
2 0.26639017 47.954
3 0.30237000 42.799
4 0.28818846 49.205
5 0.64999216 52.519
6 0.23066355 44.493
HW.Hours StdMathScore
483 0.03774998 56.053
138 0.31898249 44.345
908 1.61900567 59.514
689 -1.80284598 44.641
130 -0.70691857 45.892
189 1.09322485 61.871
HW.Hours StdMathScore
776 0.18038788 42.211
805 0.20964010 58.704
653 -0.05089566 63.270
52 1.37044411 64.522
329 1.20164587 45.597
569 0.10342771 54.949
> library(lattice)
> xyplot(StdMathScore~HW.Hours, data=nels, type=c("p","r"))
> model<-lm(StdMathScore~HW.Hours, data=nels)
> coef(model)
(Intercept) HW.Hours
51.987840 1.374264
We are assuming now that all 887 students in the NELS data is the entire population

## Drawing a Random Sample

With Replacement
-We already have a random sample in "Sample1.txt"
-There is another random sample in "Sample2.txt"
> xyplot(StdMathScore~HW.Hours, data=sample1, type=c("p","r"))
> xyplot(sd(StdMathScore)/StdMathScore~sd(HW.Hours)/HW.Hours, data=sample1, type=c("p","r"))
> xyplot((sd(StdMathScore)/StdMathScore)~(sd(HW.Hours)/HW.Hours), data=sample1, type=c("p","r"))
> xyplot(StdMathScore/sd(StdMathScore)~HW.Hours/sd(HW.Hours), data=sample1, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=nels, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=sample1, type=c("p","r"))
> xyplot(StdMathScore~HW.Hours, data=sample2, type=c("p","r"))
> model1<-lm(StdMathScore~HW.Hours, data=sample1)
> model2<-lm(StdMathScore~HW.Hours, data=sample2)
> coef(model1)
(Intercept) HW.Hours
53.4549682 0.3746038
> coef(model2)
(Intercept) HW.Hours
51.6687436 -0.2858852
> plot(StdMathScore~HW.Hours, data=sample1)
> abline(model1)
> abline(model,lty="dotted")
> abline(model2,lty="solid", lwd="2")
-The only reason that slopes and intercepts for equations for different random samples is sampling error.
-If we assume an infinite number of samples the the mean of all possible regression slopes is equal to the true population slope.

## Hypothesis testing

Hnull: beta-sub-1 = 0
Halt: beta-sub-1 != 0
-the basic idea is that you get a sample of data and ask under the NULL HYPOTHESIS how likely is it that we will see the coeficient for our sample.
HW.Hours StdMathScore
345 -0.3329931 42.432
759 -0.2136822 53.698
95 -1.0077991 49.205
325 0.2059000 53.698
355 -0.1177185 55.980
377 0.1413540 65.331
> model3<-lm(StdMathScore~HW.Hours, data=sample3)
> coef(model3)
(Intercept) HW.Hours
51.394680 1.782601
t statistic: t = Ybar - HypValue/(beta/n^(1/2))
in regression t = (BETA-hat-sub-one - 0)/Standard Error of BETA-hat-sub-1
for our data n = 200, t = (1.78 - 0)/.5811
> (1.78-0)/.5811
[1] 3.063156
p value for a regression = pt(-t,df) = cumulative density in a t distribution, we use the negative value for the t stat.
> pt(-3.07,198)
[1] 0.001220481
> 2*pt(-3.07,198)
[1] 0.002440963
p = .002
It is likely that the true regression slope is not zero, so we know there is some form of relationship.
summary(modelName)
> summary(model3)

Call:
lm(formula = StdMathScore ~ HW.Hours, data = sample3)

Residuals:
Min 1Q Median 3Q Max
-19.9886 -8.5163 -0.7377 8.2180 21.5530

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.3947 0.7055 72.853 < 2e-16 ***
HW.Hours 1.7826 0.5811 3.068 0.00246 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.959 on 198 degrees of freedom
Multiple R-squared: 0.04538, Adjusted R-squared: 0.04055
F-statistic: 9.411 on 1 and 198 DF, p-value: 0.002458

> anova(model3)
Analysis of Variance Table

Response: StdMathScore
Df Sum Sq Mean Sq F value Pr(>F)
HW.Hours 1 933.5 933.5 9.4113 0.002458 **
Residuals 198 19638.9 99.2
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

## February 2, 2009

### Feb. 2, 2008

attach(burt)

model<-lm(FostIQ~OwnIQ)

Regress the outcome variable (Y) on the predictor variable-We regress Y on X

coef(model)
(Intercept) OwnIQ
9.719491 0.907920

anova(model)
Analysis of Variance Table

Response: FostIQ
Df Sum Sq Mean Sq F value Pr(>F)
OwnIQ 1 9250.7 9250.7 169.42 < 2.2e-16 ***
Residuals 51 2784.7 54.6
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

## Fitted Model

-Drops error from the regression equation.

Yvariable-sub-i = (Intercept) + Slope*Xvariable-sub-i

(Regression Equation includes error so the Y variable is not an estimate but in the Fitted Model the Y is an estimate [needs the hat])

library(lattice)

xyplot(FostIQ~OwnIQ,type=c("p","r"))

## R^2

R^2 = SSmodel/SStotal

R^2 = 9251/12035 = .769

*76.9% of difference in FostIQ is accounted for by difference in OwnIQ and 23.1% is not.

We cannot account for how the unexplained variation divides the variation between other systematic components, measurement error, and individual variation.

## Estimated Residual Variance

What is the variance of the mean estimates for the scores at each point on the line (REmember that the line represents the means of the potential distribution at any point on a line.)

sigma-hat-

sigma-hat^2-sub-X|Y = SSresiduals/n

sigma-hat^2-sub-X|Y = SSresiduals/n - (parameters in equation)

sigma-hat^2-sub-X|Y = 2785/31 - 2

SD=Sqrt(sigma-hat^2-sub-X|Y = 2785/31 - 2)

sqrt(54.6)
[1] 7.389181

We are therefore sure to around 95% (2 SDs) that our predicted values will be within about 15 points either side of any particular point estimate.

9.72+.91*75
[1] 77.97

77.97-14.8
[1] 63.17

77.97+14.8
[1] 92.77

So we are sure that for an OwnIQ score of 75 we would expect the Fost IQ score to be somewhere between 63.1,92.8.

# Basic Regression Equation

Outcome = functional form (line, curvilinear, etc.) + Residuals

## Equation of a line

y = mx + b mx = slope b = y-intercept m/1 = rise/run = deltaY/deltaX (delta = change in)

## Regression Model

Y-sub-i = Beta-sub-zero + Beta-sub-oneX-sub-i + epsilon-sub-i(error) Beta-sub-zero + Beta-sub-oneX-sub-i is the functional form. Some outcome = systematic component (functional form) plus some error Regression model is actually about the population. -We use sample data to get estimates for parameters (regression coefficients). -The distribution of a particular variable to the independent variable is the marginal distribution. The distribution in the population at any point in the line is the conditional distribution. When reporting conditional distributions we write mu-sub-Y|X-sub-n which means the mean of Y conditioned on X. Y-hat = Beta-sub-zero + Beta-sub-oneX-sub-i (without the error). Y-hat is always on the line. > burt <- read.table("burt.txt", header = T) > library(psych) > library(lattice) > attach(burt) > xyplot(FostIQ~OwnIQ,type="p")

## Residuals

(Y-subi - Y-hat-sub-i) Sum of squares residuals [SIGMA-sub-i(Y-sub-i - Y-hat-sub-1)^2]

## Line of best fit

-Line that minimizes the sum of squares residuals OLS is the ordinary least squares regression.

# Linear Model Function in R = model <- lm(Y~X)

> model <- lm(FostIQ~OwnIQ) -the model is now an object > names(model) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" -We are intereseted in the coefficients because they give us the y-intercept and the slope. > model\$coefficients (Intercept) OwnIQ 9.719491 0.907920 -this gives us first the y-intercept (BETA-sub-zero) and then the slope(BETA-sub-one) Our estimates are then BETA-sub-zero-hat = 9.719491 and BETA-sub-one-hat = .907920 The equation therefore is FostIQ-hat = 9.72 + .91(OwnIQ) or Y-hat = 9.72 + .91(X-sub-i) > xyplot(FostIQ~OwnIQ,type=c("p","r")) The plot for the scatter plot with the regression line is xyplot(X~Y,type=c("p","r"))

## Evaluating fit of the model

-Well, we have the best line possible but how well does it really fit? -We use what is called ANOVE regression decomposition. - Total variation = variation explained by the model + variation not explained by the model *The distance between the mean of all of the values and the distance between the regression line is the variation predicted by the model. The distance between the mean and the regression line is the variation not explained by the model. The distance between a data point and the mean of all points is the total variation. *For the entire model we add up the squared values of all the total variations, explained variations, and unexplained variations. > 9.72+.91*68 [1] 71.6 > mean(FostIQ) [1] 98.1132 Residual = Actual - Predicted REsidual = 63-71.6 > 63-71.6 [1] -8.6 Remembers it is the residuals we use to evaluate model fit. > model\$residuals 1 2 3 4 5 6 -8.45804807 1.81819206 1.00235215 -5.81348777 -9.53724764 -6.44516759 7 8 9 10 11 12 2.73899249 -2.16892746 8.83107254 0.92315258 -3.89268733 6.19939271 13 14 15 16 17 18 4.29147276 8.29147276 11.47563284 -11.43228711 -10.34020707 -4.34020707 19 20 21 22 23 24 -2.24812703 2.75187297 -7.15604698 4.84395302 4.84395302 -1.06396694 25 26 27 28 29 30 0.02811310 -3.87980685 12.12019315 -5.78772681 -2.78772681 14.21227319 31 32 33 34 35 36 15.21227319 5.39643328 -12.51148668 13.58059337 1.67267341 2.76475345

37 38 39 40 41 42
3.94891354 1.04099358 2.04099358 1.13307363 -5.86692637 -12.77484633
43 44 45 46 47 48
-12.49860620 4.59347385 -9.22236607 11.77763393 -6.13028602 0.96179402
49 50 51 52 53
-0.85404589 -1.57780576 4.79051441 -9.84116541 3.34299467
> model\$fitted.values
1 2 3 4 5 6 7 8
71.45805 74.18181 75.99765 77.81349 80.53725 81.44517 83.26101 84.16893
9 10 11 12 13 14 15 16
84.16893 85.07685 86.89269 87.80061 88.70853 88.70853 90.52437 91.43229
17 18 19 20 21 22 23 24
92.34021 92.34021 93.24813 93.24813 94.15605 94.15605 94.15605 95.06397
25 26 27 28 29 30 31 32
95.97189 96.87981 96.87981 97.78773 97.78773 97.78773 97.78773 99.60357
33 34 35 36 37 38 39 40
100.51149 101.41941 102.32733 103.23525 105.05109 105.95901 105.95901 106.86693
41 42 43 44 45 46 47 48
106.86693 107.77485 110.49861 111.40653 113.22237 113.22237 114.13029 115.03821
49 50 51 52 53
116.85405 119.57781 123.20949 126.84117 128.65701
---Model fit can be defined by a proportion of the total variation explained by the model (the variation unexplained by the model divided by the total variation)
---Model fit can be defined by a proportion of the total variation explained by the model (the variation explained by the model divided by the total variation)
---SSmodel/SStotal (Sum of Squares model / Sum of Squares total) and SSresidual/SStotal (Sum of Squares residuals / Sum of Squares total)
> anova(model)
Analysis of Variance Table

Response: FostIQ
Df Sum Sq Mean Sq F value Pr(>F)
OwnIQ 1 9250.7 9250.7 169.42 < 2.2e-16 ***
Residuals 51 2784.7 54.6
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
SSmodel = 9520.7 and SSresidual = 2784.7
>

# Mathematical and Statistical Models

## Mathematical Models

-deterministic models, there is no error in a mathematical model

## Statistical Model

-We are using models that allow for error and use probability
-Allow for other systematic components that in many cases are not included or were not measured.
-Allow for measurement error (especially in the social sciences)
-Allow for individual variation within the unit of analysis that we are analyzing.

### Goals of Creating Statistical Models

1.Identify systematic components
2.Assess the model fit (looking at residuals [good model=smaller residuals])

## How do we use statistical Models

Articulate Research Questions -> Outcome Variables, Focal (important) Predictors, Covariates (account or control for)
Postulate the statistical model (What is it going to look like?)
...fitting model to sample data
Determine if relationship is due to chance -> Does the model really work in the population or is it by chance?

## Regression is all about relationships and associations

-Causality can be only determined through the design of the study not through analysis.
-Analysis discovers associations, correlations or covariation.

ID OwnIQ FostIQ
1 1 68 63
2 2 71 76
3 3 73 77
4 4 75 72
5 5 78 71
6 6 79 75
> attach(burt)
Always a good idea to begin an analysis with a decriptive analysis and plots.
> library(psych)

package 'psych' successfully unpacked and MD5 sums checked

> library(psych)
> describe(OwnIQ)
var n mean sd median trimmed mad min max range skew kurtosis se
1 1 53 97.36 14.69 96 97 14.83 68 131 63 0.24 -0.47 2.02
> describe(FostIQ)
var n mean sd median trimmed mad min max range skew kurtosis se
1 1 53 98.11 15.21 97 98.21 16.31 63 132 69 -0.02 -0.5 2.09
Remeber from 8261: Kernel Density Plot is better than a histogram.
-There are actually some better ways to plot now than using the base plot command: plot()
> library(lattice)
-the "lattice" library has a LOT of plot styles
A density plot in lattice is -> densityplot(variable,Kernel="e")
> densityplot(OwnIQ,Kernel="e")
-all of the "lattice" graphics functions allow you to enter a formula
> densityplot(~OwnIQ,data=burt,Kernel="e")
> densityplot(FostIQ,Kernel="e")
> densityplot(~FostIQ,data=burt,Kernel="e")
-in "lattice" library histogram ->histogram()
-in "lattice" library boxplot -> bwplot()
> bwplot(OwnIQ)
> histogram(OwnIQ)
> densityplot(OwnIQ,Kernel="e")
-in "lattice" scatter plot -> xyplot(formula, type="p")
-formla -> X~Y where X will be plotted on X axis and Y on the Y axis
> xyplot(FostIQ~OwnIQ,type="p")

## Five things to look for in a scatter plot

1. What is the direction of the relaitionship?
2. What is the type of relationship? (Is it linear?)
3. What is the strength of the relationship? (Are the points close or far from the line?)
4. What is the magnitude of the relationship? (Line slope?)
5. Are there any unusual observations? (not necessarily outlier)
-In a deterministic realtionshp all data points are on the line, but there are different types of error in social science, so our plots start to look like clouds.
>