Main | February 2009 »

January 28, 2009

Jan 28, 2009

Basic Regression Equation

Outcome = functional form (line, curvilinear, etc.) + Residuals

Equation of a line

y = mx + b mx = slope b = y-intercept m/1 = rise/run = deltaY/deltaX (delta = change in)

Regression Model

Y-sub-i = Beta-sub-zero + Beta-sub-oneX-sub-i + epsilon-sub-i(error) Beta-sub-zero + Beta-sub-oneX-sub-i is the functional form. Some outcome = systematic component (functional form) plus some error Regression model is actually about the population. -We use sample data to get estimates for parameters (regression coefficients). -The distribution of a particular variable to the independent variable is the marginal distribution. The distribution in the population at any point in the line is the conditional distribution. When reporting conditional distributions we write mu-sub-Y|X-sub-n which means the mean of Y conditioned on X. Y-hat = Beta-sub-zero + Beta-sub-oneX-sub-i (without the error). Y-hat is always on the line. > burt <- read.table("burt.txt", header = T) > library(psych) > library(lattice) > attach(burt) > xyplot(FostIQ~OwnIQ,type="p")

Residuals

(Y-subi - Y-hat-sub-i) Sum of squares residuals [SIGMA-sub-i(Y-sub-i - Y-hat-sub-1)^2]

Line of best fit

-Line that minimizes the sum of squares residuals OLS is the ordinary least squares regression.

Linear Model Function in R = model <- lm(Y~X)

> model <- lm(FostIQ~OwnIQ) -the model is now an object > names(model) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" -We are intereseted in the coefficients because they give us the y-intercept and the slope. > model$coefficients (Intercept) OwnIQ 9.719491 0.907920 -this gives us first the y-intercept (BETA-sub-zero) and then the slope(BETA-sub-one) Our estimates are then BETA-sub-zero-hat = 9.719491 and BETA-sub-one-hat = .907920 The equation therefore is FostIQ-hat = 9.72 + .91(OwnIQ) or Y-hat = 9.72 + .91(X-sub-i) > xyplot(FostIQ~OwnIQ,type=c("p","r")) The plot for the scatter plot with the regression line is xyplot(X~Y,type=c("p","r"))

Evaluating fit of the model

-Well, we have the best line possible but how well does it really fit? -We use what is called ANOVE regression decomposition. - Total variation = variation explained by the model + variation not explained by the model *The distance between the mean of all of the values and the distance between the regression line is the variation predicted by the model. The distance between the mean and the regression line is the variation not explained by the model. The distance between a data point and the mean of all points is the total variation. *For the entire model we add up the squared values of all the total variations, explained variations, and unexplained variations. > 9.72+.91*68 [1] 71.6 > mean(FostIQ) [1] 98.1132 Residual = Actual - Predicted REsidual = 63-71.6 > 63-71.6 [1] -8.6 Remembers it is the residuals we use to evaluate model fit. > model$residuals 1 2 3 4 5 6 -8.45804807 1.81819206 1.00235215 -5.81348777 -9.53724764 -6.44516759 7 8 9 10 11 12 2.73899249 -2.16892746 8.83107254 0.92315258 -3.89268733 6.19939271 13 14 15 16 17 18 4.29147276 8.29147276 11.47563284 -11.43228711 -10.34020707 -4.34020707 19 20 21 22 23 24 -2.24812703 2.75187297 -7.15604698 4.84395302 4.84395302 -1.06396694 25 26 27 28 29 30 0.02811310 -3.87980685 12.12019315 -5.78772681 -2.78772681 14.21227319 31 32 33 34 35 36 15.21227319 5.39643328 -12.51148668 13.58059337 1.67267341 2.76475345

37 38 39 40 41 42
3.94891354 1.04099358 2.04099358 1.13307363 -5.86692637 -12.77484633
43 44 45 46 47 48
-12.49860620 4.59347385 -9.22236607 11.77763393 -6.13028602 0.96179402
49 50 51 52 53
-0.85404589 -1.57780576 4.79051441 -9.84116541 3.34299467
> model$fitted.values
1 2 3 4 5 6 7 8
71.45805 74.18181 75.99765 77.81349 80.53725 81.44517 83.26101 84.16893
9 10 11 12 13 14 15 16
84.16893 85.07685 86.89269 87.80061 88.70853 88.70853 90.52437 91.43229
17 18 19 20 21 22 23 24
92.34021 92.34021 93.24813 93.24813 94.15605 94.15605 94.15605 95.06397
25 26 27 28 29 30 31 32
95.97189 96.87981 96.87981 97.78773 97.78773 97.78773 97.78773 99.60357
33 34 35 36 37 38 39 40
100.51149 101.41941 102.32733 103.23525 105.05109 105.95901 105.95901 106.86693
41 42 43 44 45 46 47 48
106.86693 107.77485 110.49861 111.40653 113.22237 113.22237 114.13029 115.03821
49 50 51 52 53
116.85405 119.57781 123.20949 126.84117 128.65701
---Model fit can be defined by a proportion of the total variation explained by the model (the variation unexplained by the model divided by the total variation)
---Model fit can be defined by a proportion of the total variation explained by the model (the variation explained by the model divided by the total variation)
---SSmodel/SStotal (Sum of Squares model / Sum of Squares total) and SSresidual/SStotal (Sum of Squares residuals / Sum of Squares total)
> anova(model)
Analysis of Variance Table

Response: FostIQ
Df Sum Sq Mean Sq F value Pr(>F)
OwnIQ 1 9250.7 9250.7 169.42 < 2.2e-16 ***
Residuals 51 2784.7 54.6
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
SSmodel = 9520.7 and SSresidual = 2784.7
>

January 26, 2009

Jan 26, 2009


Mathematical and Statistical Models


Mathematical Models


-deterministic models, there is no error in a mathematical model

Statistical Model


-We are using models that allow for error and use probability
-Allow for other systematic components that in many cases are not included or were not measured.
-Allow for measurement error (especially in the social sciences)
-Allow for individual variation within the unit of analysis that we are analyzing.

Goals of Creating Statistical Models


1.Identify systematic components
2.Assess the model fit (looking at residuals [good model=smaller residuals])

How do we use statistical Models


Articulate Research Questions -> Outcome Variables, Focal (important) Predictors, Covariates (account or control for)
Postulate the statistical model (What is it going to look like?)
...fitting model to sample data
Determine if relationship is due to chance -> Does the model really work in the population or is it by chance?

Regression is all about relationships and associations


-Causality can be only determined through the design of the study not through analysis.
-Analysis discovers associations, correlations or covariation.
> burt <- read.table("burt.txtv", header = T)

> burt <- read.table("burt.txt", header = T)
> head(burt)
ID OwnIQ FostIQ
1 1 68 63
2 2 71 76
3 3 73 77
4 4 75 72
5 5 78 71
6 6 79 75
> attach(burt)
Always a good idea to begin an analysis with a decriptive analysis and plots.
> library(psych)

package 'psych' successfully unpacked and MD5 sums checked

> library(psych)
> describe(OwnIQ)
var n mean sd median trimmed mad min max range skew kurtosis se
1 1 53 97.36 14.69 96 97 14.83 68 131 63 0.24 -0.47 2.02
> describe(FostIQ)
var n mean sd median trimmed mad min max range skew kurtosis se
1 1 53 98.11 15.21 97 98.21 16.31 63 132 69 -0.02 -0.5 2.09
Remeber from 8261: Kernel Density Plot is better than a histogram.
-There are actually some better ways to plot now than using the base plot command: plot()
> library(lattice)
-the "lattice" library has a LOT of plot styles
A density plot in lattice is -> densityplot(variable,Kernel="e")
> densityplot(OwnIQ,Kernel="e")
-all of the "lattice" graphics functions allow you to enter a formula
> densityplot(~OwnIQ,data=burt,Kernel="e")
> densityplot(FostIQ,Kernel="e")
> densityplot(~FostIQ,data=burt,Kernel="e")
-in "lattice" library histogram ->histogram()
-in "lattice" library boxplot -> bwplot()
> bwplot(OwnIQ)
> histogram(OwnIQ)
> densityplot(OwnIQ,Kernel="e")
-in "lattice" scatter plot -> xyplot(formula, type="p")
-formla -> X~Y where X will be plotted on X axis and Y on the Y axis
> xyplot(FostIQ~OwnIQ,type="p")

Five things to look for in a scatter plot


1. What is the direction of the relaitionship?
2. What is the type of relationship? (Is it linear?)
3. What is the strength of the relationship? (Are the points close or far from the line?)
4. What is the magnitude of the relationship? (Line slope?)
5. Are there any unusual observations? (not necessarily outlier)
-In a deterministic realtionshp all data points are on the line, but there are different types of error in social science, so our plots start to look like clouds.
>