### Jan 28, 2009

# Basic Regression Equation

Outcome = functional form (line, curvilinear, etc.) + Residuals## Equation of a line

**y = mx + b**mx = slope b = y-intercept m/1 = rise/run = deltaY/deltaX (delta = change in)

## Regression Model

Y-sub-i = Beta-sub-zero + Beta-sub-oneX-sub-i + epsilon-sub-i(error) Beta-sub-zero + Beta-sub-oneX-sub-i is the functional form. Some outcome = systematic component (functional form) plus some error**Regression model is actually about the population.**-We use sample data to get estimates for parameters (regression coefficients). -The distribution of a particular variable to the independent variable is the marginal distribution. The distribution in the population at any point in the line is the conditional distribution.

**When reporting conditional distributions we write mu-sub-Y|X-sub-n which means the mean of Y conditioned on X.**Y-hat = Beta-sub-zero + Beta-sub-oneX-sub-i (without the error). Y-hat is always on the line. > burt <- read.table("burt.txt", header = T) > library(psych) > library(lattice) > attach(burt) > xyplot(FostIQ~OwnIQ,type="p")

## Residuals

(Y-subi - Y-hat-sub-i) Sum of squares residuals [SIGMA-sub-i(Y-sub-i - Y-hat-sub-1)^2]## Line of best fit

**-Line that minimizes the sum of squares residuals**OLS is the ordinary

**least squares**regression.

# Linear Model Function in R = model <- lm(Y~X)

> model <- lm(FostIQ~OwnIQ) -the model is now an object > names(model) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" -We are intereseted in the coefficients because they give us the y-intercept and the slope. > model$coefficients (Intercept) OwnIQ 9.719491 0.907920 -this gives us first the y-intercept (BETA-sub-zero) and then the slope(BETA-sub-one) Our estimates are then BETA-sub-zero-hat = 9.719491 and BETA-sub-one-hat = .907920 The equation therefore is FostIQ-hat = 9.72 + .91(OwnIQ) or Y-hat = 9.72 + .91(X-sub-i) > xyplot(FostIQ~OwnIQ,type=c("p","r")) The plot for the scatter plot with the regression line is xyplot(X~Y,type=c("p","r"))## Evaluating fit of the model

-Well, we have the best line possible but how well does it really fit? -We use what is called ANOVE regression decomposition. - Total variation = variation explained by the model + variation not explained by the model *The distance between the mean of all of the values and the distance between the regression line is the variation predicted by the model. The distance between the mean and the regression line is the variation not explained by the model. The distance between a data point and the mean of all points is the total variation. *For the entire model we add up the squared values of all the total variations, explained variations, and unexplained variations. > 9.72+.91*68 [1] 71.6 > mean(FostIQ) [1] 98.1132 Residual = Actual - Predicted REsidual = 63-71.6 > 63-71.6 [1] -8.6 Remembers it is the residuals we use to evaluate model fit. > model$residuals 1 2 3 4 5 6 -8.45804807 1.81819206 1.00235215 -5.81348777 -9.53724764 -6.44516759 7 8 9 10 11 12 2.73899249 -2.16892746 8.83107254 0.92315258 -3.89268733 6.19939271 13 14 15 16 17 18 4.29147276 8.29147276 11.47563284 -11.43228711 -10.34020707 -4.34020707 19 20 21 22 23 24 -2.24812703 2.75187297 -7.15604698 4.84395302 4.84395302 -1.06396694 25 26 27 28 29 30 0.02811310 -3.87980685 12.12019315 -5.78772681 -2.78772681 14.21227319 31 32 33 34 35 36 15.21227319 5.39643328 -12.51148668 13.58059337 1.67267341 2.76475345 37 38 39 40 41 42

3.94891354 1.04099358 2.04099358 1.13307363 -5.86692637 -12.77484633

43 44 45 46 47 48

-12.49860620 4.59347385 -9.22236607 11.77763393 -6.13028602 0.96179402

49 50 51 52 53

-0.85404589 -1.57780576 4.79051441 -9.84116541 3.34299467

> model$fitted.values

1 2 3 4 5 6 7 8

71.45805 74.18181 75.99765 77.81349 80.53725 81.44517 83.26101 84.16893

9 10 11 12 13 14 15 16

84.16893 85.07685 86.89269 87.80061 88.70853 88.70853 90.52437 91.43229

17 18 19 20 21 22 23 24

92.34021 92.34021 93.24813 93.24813 94.15605 94.15605 94.15605 95.06397

25 26 27 28 29 30 31 32

95.97189 96.87981 96.87981 97.78773 97.78773 97.78773 97.78773 99.60357

33 34 35 36 37 38 39 40

100.51149 101.41941 102.32733 103.23525 105.05109 105.95901 105.95901 106.86693

41 42 43 44 45 46 47 48

106.86693 107.77485 110.49861 111.40653 113.22237 113.22237 114.13029 115.03821

49 50 51 52 53

116.85405 119.57781 123.20949 126.84117 128.65701

---Model fit can be defined by a proportion of the total variation explained by the model (the variation unexplained by the model divided by the total variation)

---Model fit can be defined by a proportion of the total variation explained by the model (the variation explained by the model divided by the total variation)

---SSmodel/SStotal (Sum of Squares model / Sum of Squares total) and SSresidual/SStotal (Sum of Squares residuals / Sum of Squares total)

> anova(model)

Analysis of Variance Table

Response: FostIQ

Df Sum Sq Mean Sq F value Pr(>F)

OwnIQ 1 9250.7 9250.7 169.42 < 2.2e-16 ***

Residuals 51 2784.7 54.6

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

SSmodel = 9520.7 and SSresidual = 2784.7

>