December 9, 2008

Dec. 9, 2008

> library(faraway)
> data(infmort)
> attach(infmort)
> interaction.plot(region,oil,mortality,fun=function(x) mean(x,na.rm=T),ylab="World Region",trace.label="Mortality", type = "b", pch = "19")
Warning message:
In legend(xleg, yleg, legend = ylabs, col = col, pch = if (type %in% :
not using pch[2..] since pch[1] has multiple chars
> ###Changing line and plot type in an interaction plot###
> interaction.plot(region,oil,mortality,fun=function(x) mean(x,na.rm=T),ylab="World Region",trace.label="Mortality", type = "b", pch = 19)
> interaction.plot(region,oil,mortality,fun=function(x) mean(x,na.rm=T),ylab="Mortality",trace.label="Oil Exports", type = "b", pch = 19)
> detach(infmort)
> vl<-read.table("VLSSperCapita.txt", header = T)
> attach(vl)
> head(vl)
household dong region urban dollar regionName
1 12225 2764.8999 1 0 184.32666 Northern Uplands
2 13314 940.9336 1 0 62.72891 Northern Uplands
3 12101 1786.9258 1 0 119.12839 Northern Uplands
4 7505 2130.5046 1 1 142.03364 Northern Uplands
5 11709 1149.1510 1 0 76.61007 Northern Uplands
6 15309 1461.8467 1 0 97.45645 Northern Uplands
> tapply(household,as.factor(region):as.factor(urban), mean,na.rm=T)
1:0 1:1 2:0 2:1 3:0 3:1 4:0 4:1
13760.817 6538.326 18388.126 3570.832 22780.238 6069.667 26260.526 8298.266
5:0 5:1 6:0 6:1 7:0 7:1
29076.707 NA 31996.681 3391.250 36262.457 10112.734
> 39262.457+10112.734
[1] 49375.19
> tapply(household,as.factor(region), mean,na.rm=T)
1 2 3 4 5 6 7
12188.52 13444.82 20231.17 20257.22 29076.71 17763.87 29630.96
> 36262.457+10112.734
[1] 46375.19
> (36262.457+10112.734)/2
[1] 23187.60

> tapply(household,as.factor(region):as.factor(urban), length)
1:0 1:1 2:0 2:1 3:0 3:1 4:0 4:1 5:0 5:1 6:0 6:1 7:0 7:1
672 187 783 392 600 108 502 252 368 NA 514 509 830 282
> (36262.457*830+10112.734*282)/1112
[1] 29630.96
> model<-aov(household~region+urban+region:urban)
> anova(model)
Analysis of Variance Table

Response: household
Df Sum Sq Mean Sq F value Pr(>F)
region 1 1.5928e+11 1.5928e+11 35772 < 2.2e-16 ***
urban 1 5.0572e+11 5.0572e+11 113580 < 2.2e-16 ***
region:urban 1 6.0949e+10 6.0949e+10 13688 < 2.2e-16 ***
Residuals 5995 2.6693e+10 4.4526e+06
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> model<-aov(household~urban+region+region:urban)
> anova(model)
Analysis of Variance Table

Response: household
Df Sum Sq Mean Sq F value Pr(>F)
urban 1 4.6756e+11 4.6756e+11 105008 < 2.2e-16 ***
region 1 1.9745e+11 1.9745e+11 44345 < 2.2e-16 ***
urban:region 1 6.0949e+10 6.0949e+10 13688 < 2.2e-16 ***
Residuals 5995 2.6693e+10 4.4526e+06
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> model3<-aov(mortality~region+oil+region:oil,data=infmort)
> model4<-aov(mortality~oil+region+region:oil,data=infmort)
> anova(model3)
Analysis of Variance Table

Response: mortality
Df Sum Sq Mean Sq F value Pr(>F)
region 3 210752 70251 12.8446 4.188e-07 ***
oil 1 42197 42197 7.7153 0.006610 **
region:oil 2 57433 28717 5.2505 0.006892 **
Residuals 94 514112 5469
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> anova(model4)
Analysis of Variance Table

Response: mortality
Df Sum Sq Mean Sq F value Pr(>F)
oil 1 60073 60073 10.9837 0.001306 **
region 3 192877 64292 11.7551 1.315e-06 ***
oil:region 2 57433 28717 5.2505 0.006892 **
Residuals 94 514112 5469
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> ###This happens because R uses sequential sums of squares (type I), so the first factor takes up as much expanation as it can.
> ###So, if you vary which one you put in the model first, it will vary which factor gets figured first.
> library(car)

Attaching package: 'car'


The following object(s) are masked from package:faraway :

logit,
vif

> anova(model)
Analysis of Variance Table

Response: household
Df Sum Sq Mean Sq F value Pr(>F)
urban 1 4.6756e+11 4.6756e+11 105008 < 2.2e-16 ***
region 1 1.9745e+11 1.9745e+11 44345 < 2.2e-16 ***
urban:region 1 6.0949e+10 6.0949e+10 13688 < 2.2e-16 ***
Residuals 5995 2.6693e+10 4.4526e+06
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> model<-aov(household~region+urban+region:urban)
> model2<-aov(household~urban+region+region:urban)
> anova(model)
Analysis of Variance Table

Response: household
Df Sum Sq Mean Sq F value Pr(>F)
region 1 1.5928e+11 1.5928e+11 35772 < 2.2e-16 ***
urban 1 5.0572e+11 5.0572e+11 113580 < 2.2e-16 ***
region:urban 1 6.0949e+10 6.0949e+10 13688 < 2.2e-16 ***
Residuals 5995 2.6693e+10 4.4526e+06
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> anova(model2)
Analysis of Variance Table

Response: household
Df Sum Sq Mean Sq F value Pr(>F)
urban 1 4.6756e+11 4.6756e+11 105008 < 2.2e-16 ***
region 1 1.9745e+11 1.9745e+11 44345 < 2.2e-16 ***
urban:region 1 6.0949e+10 6.0949e+10 13688 < 2.2e-16 ***
Residuals 5995 2.6693e+10 4.4526e+06
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> Anova(model, type="III")
Anova Table (Type III tests)

Response: household
Sum Sq Df F value Pr(>F)
(Intercept) 1.1875e+11 1 26670.3 < 2.2e-16 ***
region 2.5759e+11 1 57850.8 < 2.2e-16 ***
urban 1.1271e+10 1 2531.4 < 2.2e-16 ***
region:urban 6.0949e+10 1 13688.4 < 2.2e-16 ***
Residuals 2.6693e+10 5995
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> Anova(model2, type="III")
Anova Table (Type III tests)

Response: household
Sum Sq Df F value Pr(>F)
(Intercept) 1.1875e+11 1 26670.3 < 2.2e-16 ***
urban 1.1271e+10 1 2531.4 < 2.2e-16 ***
region 2.5759e+11 1 57850.8 < 2.2e-16 ***
urban:region 6.0949e+10 1 13688.4 < 2.2e-16 ***
Residuals 2.6693e+10 5995
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> Anova(model2, type="II")
Anova Table (Type II tests)

Response: household
Sum Sq Df F value Pr(>F)
urban 5.0572e+11 1 113580 < 2.2e-16 ***
region 1.9745e+11 1 44345 < 2.2e-16 ***
urban:region 6.0949e+10 1 13688 < 2.2e-16 ***
Residuals 2.6693e+10 5995
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> Anova(model2, type="III")
Anova Table (Type III tests)

Response: household
Sum Sq Df F value Pr(>F)
(Intercept) 1.1875e+11 1 26670.3 < 2.2e-16 ***
urban 1.1271e+10 1 2531.4 < 2.2e-16 ***
region 2.5759e+11 1 57850.8 < 2.2e-16 ***
urban:region 6.0949e+10 1 13688.4 < 2.2e-16 ***
Residuals 2.6693e+10 5995
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> ###NOTE: Problem with type III sums of squares does not obey marginality. It asks how much does the interaction between the factors explain apart from the factors that make up the interaction, and that doesn't make sense.
> ###

Analysis of Variance is NOT a good way to analyze an unbalanced design. Regression is a much better way to analyze.


> ###When determining the order in which to include factors in the model, put your controls into the model first and add in the things you are interested in (focal predictors)last.
> model2<-aov(household~region:urban+urban+region)
> summary(model2)
Df Sum Sq Mean Sq F value Pr(>F)
urban 1 4.6756e+11 4.6756e+11 105008 < 2.2e-16 ***
region 1 1.9745e+11 1.9745e+11 44345 < 2.2e-16 ***
region:urban 1 6.0949e+10 6.0949e+10 13688 < 2.2e-16 ***
Residuals 5995 2.6693e+10 4.4526e+06
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

December 4, 2008

Dec. 4, 2008

ir<-read.csv("IronContent.csv",header=T)
attach(ir)
model<-aov(iron~food+type+food:type)
ir<-read.csv("Alcohol.csv",header=T)
ir<-read.csv("IronContent.csv",header=T)
alc<-read.csv("Alcohol.csv",header=T)
attach(alc)

The following object(s) are masked from package:datasets :

sleep

model<-aov(steercor~sleep+alcohol+sleep:alcohol)

summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
sleep 1 1302.08 1302.08 24.9004 9.965e-06 ***
alcohol 1 1012.50 1012.50 19.3625 6.783e-05 ***
sleep:alcohol 1 112.50 112.50 2.1514 0.1496
Residuals 44 2300.83 52.29
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
interaction.plot(alcohol,sleep,steercor)
model2<-lm(iron~food+type+food:type)
summary(model2)

Call:
lm(formula = iron ~ food + type + food:type)

Residuals:
Min 1Q Median 3Q Max
-0.89750 -0.16062 0.05875 0.17750 0.59000

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3300 0.1837 12.687 6.87e-13 ***
foodmeat -0.2725 0.2597 -1.049 0.303382
foodvegetables -1.0975 0.2597 -4.226 0.000243 ***
typeClay 0.1425 0.2597 0.549 0.587740
typeIron 1.3400 0.2597 5.159 1.98e-05 ***
foodmeat:typeClay -0.0225 0.3673 -0.061 0.951605
foodvegetables:typeClay 0.0850 0.3673 0.231 0.818733
foodmeat:typeIron 1.2825 0.3673 3.492 0.001669 **
foodvegetables:typeIron 0.2175 0.3673 0.592 0.558668
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3673 on 27 degrees of freedom
Multiple R-squared: 0.91, Adjusted R-squared: 0.8833
F-statistic: 34.13 on 8 and 27 DF, p-value: 3.623e-12

Interaction Plot

Remember that approximately parallel lines indicates that htere are not interaction effects.

Model Choice

A simple model is better so if we don't find a significant interaction effect we can use a model without an interaction effect. If we change the model like this we might find a change in the main effects. model3<-aov(steercor~sleep+alcohol) summary(model) Df Sum Sq Mean Sq F value Pr(>F) sleep 1 1302.08 1302.08 24.9004 9.965e-06 *** alcohol 1 1012.50 1012.50 19.3625 6.783e-05 *** sleep:alcohol 1 112.50 112.50 2.1514 0.1496 Residuals 44 2300.83 52.29 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 summary(model3) Df Sum Sq Mean Sq F value Pr(>F) sleep 1 1302.08 1302.08 24.279 1.171e-05 *** alcohol 1 1012.50 1012.50 18.880 7.842e-05 *** Residuals 45 2413.33 53.63 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 -A model without the interaction effects is often called a Main Effects or Additive Model. -A model with an interaction term is a multiplicative model. -If you are just doing exploratory work then it is probably OK to drop the interaction effect, but if you are confirming theory that has an interaction effect, you shoul dprobably leave it in the model. -A lot of people in ANOV analysis don't support dropping the interaction term, but they do in regression even though they have the same statistical model. -The Main Effects model is nested in the Interaction Model. -We can do tests of model fit.

Unbalanced Designs

-balanced designs are orthagonal, so that means you can add up the sums of squares for the factors and interactions and it will equal total summs of squares. -unbalanced designs are non-orthogonal so the combined sums of squares for the factors and interactions are greater than total. It is more difficult to determine Eta-hat-squared or effect size for any particular factor or interaction. -in unbalanced designs factors become related to each other. NOTE: don't drop people out at random to try to balance the model.

Imputation

try to predict values and add someone in. -Assumes that missing data is missing at random. This is hard to prove. Imputation is hard to do AND it shrinks your variation.

Analyse unbalanced data

-probably the best option, but adds another level of complexity.
Type I Sums of Squares
Factors come in in order and then the interactions come in.
Type I is the default sums of squares that R uses.
Type II Sums of Squares
-does something similar, but it adjusts the sums of squares based on the main effects.
-Type II IS COMPLETELY INAPPROPRIATE IF YOU HAVE AN INTERAcTION!
Type III Sums of Squares
-Unique contribution, so each factor and interaction gets adjusted to only what is unique for that factor or interaction.
-default in SPSS
It is a holy war in stats which sums of squares you should use in ANOVA.
YOU CAN GET THREE VERY DIFFERENT RESULTS BASED ON THE SUMS OF SQUARES THAT YOU USE
-Dr. Zieffler likes type one because that is what regression uses.
-Most social scientists use type 3 because that is what the default is in SPSS.
Changing sums of squares in R
library(car)
Anova(model,type="III")
Anova Table (Type III tests)

Response: steercor
Sum Sq Df F value Pr(>F)
(Intercept) 74.40 1 1.4229 0.239323
sleep 14.58 1 0.2789 0.600087
alcohol 405.00 1 7.7450 0.007906 **
sleep:alcohol 112.50 1 2.1514 0.149552
Residuals 2300.83 44
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Anova(model,type="II")
Anova Table (Type II tests)

Response: steercor
Sum Sq Df F value Pr(>F)
sleep 1302.1 1 24.9004 9.965e-06 ***
alcohol 1012.5 1 19.3625 6.783e-05 ***
sleep:alcohol 112.5 1 2.1514 0.1496
Residuals 2300.8 44
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
NOTE:There are a lot of statisticians who believe that type III tests are dead wrong.
model<-aov(steercor~sleep+alcohol+time_day+sleep:alcohol+sleep:time_day+alcohol:time_day+sleep:alcohol:time_day)
summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
sleep 1 1302.08 1302.08 48.8892 1.924e-08 ***
alcohol 1 1012.50 1012.50 38.0163 2.764e-07 ***
time_day 1 918.75 918.75 34.4962 7.122e-07 ***
sleep:alcohol 1 112.50 112.50 4.2240 0.046425 *
sleep:time_day 1 216.75 216.75 8.1383 0.006829 **
alcohol:time_day 1 50.00 50.00 1.8773 0.178278
sleep:alcohol:time_day 1 50.00 50.00 1.8773 0.178278
Residuals 40 1065.33 26.63
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model4<-aov(steercor~sleep+alcohol+time_day+sleep:alcohol+sleep:time_day+alcohol:time_day)
summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
sleep 1 1302.08 1302.08 48.8892 1.924e-08 ***
alcohol 1 1012.50 1012.50 38.0163 2.764e-07 ***
time_day 1 918.75 918.75 34.4962 7.122e-07 ***
sleep:alcohol 1 112.50 112.50 4.2240 0.046425 *
sleep:time_day 1 216.75 216.75 8.1383 0.006829 **
alcohol:time_day 1 50.00 50.00 1.8773 0.178278
sleep:alcohol:time_day 1 50.00 50.00 1.8773 0.178278
Residuals 40 1065.33 26.63
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary(model4)
Df Sum Sq Mean Sq F value Pr(>F)
sleep 1 1302.08 1302.08 47.8650 2.148e-08 ***
alcohol 1 1012.50 1012.50 37.2198 3.110e-07 ***
time_day 1 918.75 918.75 33.7735 8.020e-07 ***
sleep:alcohol 1 112.50 112.50 4.1355 0.048498 *
sleep:time_day 1 216.75 216.75 7.9678 0.007314 **
alcohol:time_day 1 50.00 50.00 1.8380 0.182605
Residuals 41 1115.33 27.20
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model5<-aov(steercor~sleep+alcohol+time_day+sleep:alcohol+sleep:time_day)
summary(model5)
Df Sum Sq Mean Sq F value Pr(>F)
sleep 1 1302.08 1302.08 46.9286 2.379e-08 ***
alcohol 1 1012.50 1012.50 36.4917 3.469e-07 ***
time_day 1 918.75 918.75 33.1128 8.955e-07 ***
sleep:alcohol 1 112.50 112.50 4.0546 0.05049 .
sleep:time_day 1 216.75 216.75 7.8119 0.00779 **
Residuals 42 1165.33 27.75
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model6<-aov(steercor~sleep+alcohol+time_day+sleep:time_day)
summary(model6)
Df Sum Sq Mean Sq F value Pr(>F)
sleep 1 1302.08 1302.08 43.8160 4.590e-08 ***
alcohol 1 1012.50 1012.50 34.0713 6.307e-07 ***
time_day 1 918.75 918.75 30.9166 1.589e-06 ***
sleep:time_day 1 216.75 216.75 7.2938 0.009858 **
Residuals 43 1277.83 29.72
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

-Normally, we don't interperet main effects that make up interactions, but because there is no interaction in volving Alcohol we can interpret the main effect of alcohol.

December 2, 2008

Dec. 2, 2008

ir <- read.csv("IronContent.csv", header=T)
attach(ir)
head(ir)
case food type typenum foodnum typefood iron
1 1 legumes Aluminum 1 2 12 2.40
2 2 legumes Aluminum 1 2 12 2.17
3 3 legumes Aluminum 1 2 12 2.41
4 4 legumes Aluminum 1 2 12 2.34
5 5 legumes Clay 2 2 22 2.41
6 6 legumes Clay 2 2 22 2.43
model<- aov(iron~food+type+food:type)
summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
food 2 9.2969 4.6484 34.456 3.699e-08 ***
type 2 24.8940 12.4470 92.263 8.531e-13 ***
food:type 4 2.6404 0.6601 4.893 0.004247 **
Residuals 27 3.6425 0.1349
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(model)
Analysis of Variance Table

Response: iron
Df Sum Sq Mean Sq F value Pr(>F)
food 2 9.2969 4.6484 34.456 3.699e-08 ***
type 2 24.8940 12.4470 92.263 8.531e-13 ***
food:type 4 2.6404 0.6601 4.893 0.004247 **
Residuals 27 3.6425 0.1349
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model
Call:
aov(formula = iron ~ food + type + food:type)

Terms:
food type food:type Residuals
Sum of Squares 9.296872 24.893956 2.640428 3.642500
Deg. of Freedom 2 2 4 27

Residual standard error: 0.3672974
Estimated effects may be unbalanced
summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
food 2 9.2969 4.6484 34.456 3.699e-08 ***
type 2 24.8940 12.4470 92.263 8.531e-13 ***
food:type 4 2.6404 0.6601 4.893 0.004247 **
Residuals 27 3.6425 0.1349
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

main Effects


Food
Pot Type

Interaction Effects


Food*Type
NOTE: Because we have a significant interaction term, it is inappropriate to interpret any main effects that make up thatinteraction effect.

Interaction Plot


-"We just test hte hell out of too many things" - Andy Zieffler
?interaction.plot
interaction.plot(xfactor,traceFactor [two factors in the analysis of variance],response[DV])
interaction.plot(food,type,iron)
tapply(iron, food:colon,mean)
Error in tapply(iron, food:colon, mean) : object "colon" not found
tapply(iron, food:type,mean)
legumes:Aluminum legumes:Clay legumes:Iron meat:Aluminum
2.3300 2.4725 3.6700 2.0575
meat:Clay meat:Iron vegetables:Aluminum vegetables:Clay
2.1775 4.6800 1.2325 1.4600
vegetables:Iron
2.7900
If there are no significant interactions, we would see all paralell lines.
-You won't ever see a perfectly parallel interaction plot due to sampling error.
interaction.plot(type,food,iron)
There are a lot of statiticians who are saying that the complexity of tests
makes it a bit more appropriate to infer results from plots rather than
using complicated analyses.
If you are in an exploratory setting, it is probably OK to check only the plot rather than running a pairwise test.

November 25, 2008

Nov. 25, 2008

NOTE: The lower p value has the highest POWER.
Whatever lest us more easily reject the NULL gives us more power.

Trends and Forecasting in R


R can tell us about trends and how significant they are.
Trends are basically a specific type of regression model.
Best place to take it is probably best to learn in business.
Basically plot a line or non-linear regression on the data points.
R library TS has good functions for time series modeling.

Updating in R


You need to periodically check R packages, but you can update all at once.
Watch out for replacing old versions that have functions that you need.

Fit Indicies and Structural Equation Modeling


Everything we do in Social Sciences is about measuring latent traits, but there are always errors.
We can use the manifest variable (what we can measure) to help get at latent variable (what we cannot measure)
but there are errors in measurement.
Structural equation modeling is a means of trying to estimate this error.
There are multiple packages. Add ons for SPSS and a product called AMOS
WE have a package in R that is not well developed called "sem."
FIT INDICIES: How well does a particular model fit our data? One index of fit is sums of squares error.
Can get a straigh outcome or can use it to compare.

R and SPSS


Conceptual stuff translates on a theoretical and conceptual level, but not on a function or command level.
It is VERY VERY EASY to learn SPSS.

How do you read SPSS data into R


library(foreign)
read.suffix("FileName.suffix", to.data.frame = T)
e.g. SPSS new.data<-read.SPSS("FileName.spss", to.data.frame = T)

Rcommander


Point and click GUI for R
Also gives you the syntax
Very clunky for right now.
A good GUI is probably not a high priority

SPSS Reference


Go to Dr. Zieffler for a copy of SPSS ro R comparison notes.

Quantitative falues to Factor Names


library(car)
recode(nameOfDataset$vector,"factorName1='newFactorName1',factorName2='newFactorName2'")
Attach data after you have recoded the data.
you can have a list or a range to recode to one factor using colon (:) or c(X,Y,Z,...)

Creating tables in R


library(car)
library(xtable)
data(Prestige)
Warning message:
In data(Prestige) : data set 'Prestige' not found
library(car)
library(car)
data(Prestige)
attach(Prestige)

The following object(s) are masked from package:datasets :

women

m1<-aov(prestige~type)

package 'xtable' successfully unpacked and MD5 sums checked

The downloaded packages are in
C:\Documents and Settings\install\Local Settings\Temp\RtmpdcIxvZ\downloaded_packages
updating HTML package descriptions
library(xtable)
xt1 <- xtable(m1, caption = "A table", label = "tab:1")
print(xt1)
% latex table generated in R 2.7.2 by xtable 1.5-4 package
% Tue Nov 25 17:38:12 2008
\begin{table}[ht]
\begin{center}
\begin{tabular}{lrrrrr}
\hline
& Df & Sum Sq & Mean Sq & F value & Pr($>$F) \\
\hline
type & 2 & 19775.59 & 9887.80 & 109.59 & 0.0000 \\
Residuals & 95 & 8571.28 & 90.22 & & \\
\hline
\end{tabular}
\caption{A table}
\label{tab:1}
\end{center}
\end{table}
?xtable

Graphics in R


Paul Murrell: author, R Graphics: book

Charts and Bars


cars <- c(1, 3, 6, 4, 9)
pie(cars)
barbplot(cars)
Error: could not find function "barbplot"
barplot(cars)

Categorical data analysis


Chi-square, but you must have your data set up as a matrix
gendergap <- matrix(c(279, 73, 225, 165, 47, 191), byrow = TRUE,nrow = 2)
dimnames(gendergap) <- list(Gender = c("Females", "Males"), PartyID = c("Democrat", "Independent", "Republican"))

chisq.test(gendergap)

Pearson's Chi-squared test

data: gendergap
X-squared = 7.0095, df = 2, p-value = 0.03005

Chi-square
chisq.test(matrixName)

November 20, 2008

Nov. 20,2008

ir<-read.csv("IronContent.csv", header=T)
attach(ir)

Finding critical F in R


qf(lower probably,df1,df2)
qf(.95,2,95)
[1] 3.092217

Two-Factor ANOVA Pt. 2


Y-sub-ijk = mu + alpha-sub-j + beta-sub-K + (alpha-beta)-sub-jK + epsilon-sub-ijK
alpha and beta are main effects and (alpha-beta)-sub-jk are the interaction effects.
epsilon-sub-ijK is error

Hypothesis


H-not: alpha -subj = 0 for all j
H-alt: alpha-subj != 0 for at least one j

or

H-not: beta-subK = 0 for all K
H-alt: beta-subK != 0 at least one K
NOTE: dot subscripts ... mean means across factor one, factor two and grand mean

mu-dot 1 dot = mu dot 2 dot = mu dot 3 dot

or

mu-dot dot 1 = mu-dot dot 2 = mu-dot dot 3

Interaction effect hypotheses
H-not: alpha-beta-subjK = 0
H-alt: alpha-beta-subjK != for at least one K

model assumptions

1. Observations in each of the cells are independent 2. Normaly distributed errors 3.Homgeneity of variance of errors tapply(iron,food:type,var) legumes:Aluminum legumes:Clay legumes:Iron meat:Aluminum 0.012333333 0.005091667 0.029800000 0.063491667 meat:Clay meat:Iron vegetables:Aluminum vegetables:Clay 0.386025000 0.394733333 0.053491667 0.211666667 vegetables:Iron 0.057533333 .394733333/.005091667 [1] 77.52536

We do not have homogeneity of variance with the raw variable (not residuals)

model<-aov(iron~food+type+food:type)
par(mfrow=c(2,2))
plot(model)
library(car)
qq.plot(model$residuals)
par(mfrow=c(1,1))
qq.plot(model$residuals)
In this case the assumptions are not met, but because there are only 4 observatoins in each cell, we should thinking about ignoring
plot(model)

NOTE: variance estimate is the sum of squares divided by degrees of freedom.

SIGMA-alpha-hat-subj^2 = total variation
SIGMA-beta-hat-subK^2 = total variation
SIGMA-alpha-beta-subjK^2 = total variation for interactions
df for factor1 = j - 1
df for factor2 = K - 1
df for interactions = (j - 1)(K - 1)
SIGMA-epsilon-beta-subjK^2 = total variation for error
df for error = (j-1) + (K-1) + (j - 1)(K - 1)
anova(model)
Analysis of Variance Table

Response: iron
Df Sum Sq Mean Sq F value Pr(>F)
food 2 9.2969 4.6484 34.456 3.699e-08 ***
type 2 24.8940 12.4470 92.263 8.531e-13 ***
food:type 4 2.6404 0.6601 4.893 0.004247 **
Residuals 27 3.6425 0.1349
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA Table = anova()
When you have a significant interaction we DO NOT interpret any main effects regardless of how significant.

We cannot any longer look at them in isolation because it is clear that htey change when they are together.
If the interaction was not significant then you have a big dilemma.

Interpreting Interaction Effects


Post hoc comparisons
set up a variable that symbolizes each interaction cell and do a post hoc test (e.g. Fisher's Tukey's)
-pairwise mean effects
Analysing how things vary within groups
-simple main effects (hard to do in R)
probably just take a look at the plot and you can "eyeball" the difference.

November 18, 2008

Nov. 18, 2008

NOTE: R has classes
R classifys any data with letters as factors
If we have a factor that is numerical we have to assign it as a factor in R.
In R it is called "coersion" to force R to think of integers as a factor.
-is.factor() asks if something is a factor or not.
-as.factor() forces R to think of a data vector as a factor.
--you have to forceit as a factor every time you use it, or you could create a new variable using as.factor()
--e.g. oldVariable<-as.factor(oldVariable)
alco<-read.csv("Alcohol.csv",header=T)
head(alco)
sleep alcohol time_day steercor
1 1 1 1 4
2 1 1 1 18
3 1 1 1 8
4 1 1 1 10
5 1 2 1 23
6 1 2 1 15
NOTE: confint() - gives confidence intervals!

Finding functions in R


-e.g. help.search("what you're looking for")
does a string search on the packages you have down loaded.
Google it! id probably best "What you're looking for, R OR cRan"
Introducaiton to R PDF notes "Simple R - Using R for introductory statistics" it will probably have it.

Multiple factor (two-factor, between subjects) analysis of variance.


Main effects - the differences that are attributable to the factors alone and not in combination.
Interaction effects - the differences that are attributable to interaction effects.

Testing iron content in food by food type and pot type







IronAluminumClay
444Meat
444Veg
444Legumes




-We typicall call the type of analysis a "# factor" X "# level" ANOVA (e.g. 3 X 3 ANOVA)
ir<-read.csv("IronContent.csv", header=T)
attach(ir)
head(ir)
case food type typenum foodnum typefood iron
1 1 legumes Aluminum 1 2 12 2.40
2 2 legumes Aluminum 1 2 12 2.17
3 3 legumes Aluminum 1 2 12 2.41
4 4 legumes Aluminum 1 2 12 2.34
5 5 legumes Clay 2 2 22 2.41
6 6 legumes Clay 2 2 22 2.43
class(typenum)
[1] "integer"
-You need graphical representations of differences in pot, food, and combinations
boxplot(iron~food)
boxplot(iron~type)

boxplot for combinations


Interactions are specified in R by using a colon :
boxplot(iron~food:type)
Questions: Are there mean differences between mean differences between factors within one factor or are there mean differences between the interactions.
We need to know these questions because the analysis is different based on questions.
-Getting descriptive statistics
tapply(iron, food, mean)
legumes meat vegetables
2.824167 2.971667 1.827500
tapply(iron, type, mean)
Aluminum Clay Iron
1.873333 2.036667 3.713333
tapply(iron, food:type, mean)
legumes:Aluminum legumes:Clay legumes:Iron meat:Aluminum
2.3300 2.4725 3.6700 2.0575
meat:Clay meat:Iron vegetables:Aluminum vegetables:Clay
2.1775 4.6800 1.2325 1.4600
vegetables:Iron
2.7900
tapply(iron, type:food, mean)
Aluminum:legumes Aluminum:meat Aluminum:vegetables Clay:legumes
2.3300 2.0575 1.2325 2.4725
Clay:meat Clay:vegetables Iron:legumes Iron:meat
2.1775 1.4600 3.6700 4.6800
Iron:vegetables
2.7900






IronAluminumClay 
4.682.062.18Meat
3.671.231.46Veg
3.672.332.47Legumes
3.711.872.64Legumes

Equation Model


Y-sub-ijk = mu + alpha-sub-j + beta-sub-K + (alpha-beta)-sub-jK
alpha and beta are main effects and (alpha-beta)-sub-jk are the interaction effects.
alpha-sub-j - mu-subj - mu
beta-sub-K = mu-subK - mu
(alpha-beta)-sub-jK = mu-sub-jk - mu-sub-j - mu-sub-K + mu
alpha-sub-j = mu-subj - mu
Y-sub-ijk = mu + alpha-sub-j + beta-sub-K + (alpha-beta)-sub-jK + epsilon-sub-ijK
epsilon-sub-ijK is error

November 13, 2008

Nov. 13, 2008

Multiple Comparisons (post hoc, contrast)

-pairwise contrasts H-null: mu-Atkins = mu-Learn H-null: mu-Atkins = mu-Ornish H-null: mu-Atkins = mu-Zone H-null: mu-Learn = mu-Ornish H-null: mu-Learn = mu-Zone H-null: mu-Ornish = mu-Zone diett<-read.table(DietData.dat",header=T)

+ diett<-read.table(DietData.dat",header=T)diett<-read.table("DietData.dat",header=T)
attach(diett)
model<-aov(WeightChange~Diet)
summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
Diet 3 1801 600 2.8255 0.03931 *
Residuals 244 51840 212
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
NOTE: per contrast type one error rate, the error rate per contrast (alpha-sub-PC)
NOTE: familywise type one error rate, the probability of making at least one type one error rate.

Familywise Type I error


Familywise type one error rate = 1 - (1 - alpha-sub-PC)^K [K is number of tests total]
-Since we usually like a .05 error rate we solve the equation for alpha-sub-PC with alpha-sub-FW equalling .05
-Shorthand for that equation is K(alpha-sub-PC), this is a conservative estimate.
1-((1-.05)^6)
[1] 0.2649081
or

6*.05
[1] 0.3
.3/6
[1] 0.05
So, alpha-sub-pc = desired familywise error rate / number of comparisons
Number of comparisons is #ofGroups(#ofGroups-1)/2
Fisher's Least Signifigant Difference (LSD)-the first multiple comparisons test there was.
-if you have exactly 3 groups this is the best test.
Rcode pairwise.t.test(DV,IV,p.adjust="adjustment we want to use")
-for Fisher's LSD the adjustment arguement is "none"
-read in the help call for the various types of adjustment
pairwise.t.test(WeightChange,Diet,p.adjust="none")

Pairwise comparisons using t tests with pooled SD

data: WeightChange and Diet

Atkins LEARN Ornish
LEARN 0.105 - -
Ornish 0.038 0.651 -
Zone 0.006 0.271 0.520

P value adjustment method: none
-remember the pvalue has been adjusted so we are comparing to .05
The mean weight loss for Atkins and Ornish and Atkins and Zone appear to differ.
-The Bonferroni test is much more conservative than the Fisher's
Tukey's Honestly Significant Difference (HSD)
-balanced design
-homogeneity of variance
Tukey-Kramer
-unbalanced design
-most software will automatically choose between HSD and Tukey-Kramer
Rcode library(multcomp)

package 'mvtnorm' successfully unpacked and MD5 sums checked
package 'multcomp' successfully unpacked and MD5 sums checked

The downloaded packages are in
C:\Documents and Settings\install\Local Settings\Temp\RtmpxVMZ1B\downloaded_packages
updating HTML package descriptions
diett<-read.table("DietData.dat",header=T)
attach(diett)
model<-aov(WeightChange~Diet)
library(multcomp)
Loading required package: mvtnorm
new<-glht(model,linfct(factorname="Tukey"))
new<-glht(model,linfct=mcp(factorname="Tukey"))
new<-glht(model,linfct=mcp(Diet="Tukey"))
summary(new)

Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: aov(formula = WeightChange ~ Diet)

Linear Hypotheses:
Estimate Std. Error t value p value
LEARN - Atkins == 0 4.196 2.582 1.625 0.3662
Ornish - Atkins == 0 5.406 2.593 2.085 0.1609
Zone - Atkins == 0 7.122 2.570 2.771 0.0307 *
Ornish - LEARN == 0 1.210 2.672 0.453 0.9690
Zone - LEARN == 0 2.926 2.650 1.104 0.6873
Zone - Ornish == 0 1.716 2.662 0.645 0.9173
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

new<-glht(model,linfct=mcp(Diet="Dunnett"))
new<-glht(model,linfct=mcp(Diet="Tukey"))
new2<-glht(model,linfct=mcp(Diet="Dunnett"))
summary(new2)

Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Dunnett Contrasts


Fit: aov(formula = WeightChange ~ Diet)

Linear Hypotheses:
Estimate Std. Error t value p value
LEARN - Atkins == 0 4.196 2.582 1.625 0.2516
Ornish - Atkins == 0 5.406 2.593 2.085 0.0987 .
Zone - Atkins == 0 7.122 2.570 2.771 0.0168 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

-Dennett treats one group as the control.
-use the contrasts(factorName) to find out which is control and control shows up as all 0000 s
contrasts(Diet)
LEARN Ornish Zone
Atkins 0 0 0
LEARN 1 0 0
Ornish 0 1 0
Zone 0 0 1
In this case control is Atkins
Look up how to change it in the notes
NOTE:Bonferroni can get a little to conservative and therefore increases your Type II error rate and reduces power
-This is a good arguement for the Tukey test, which is a good compromise.
plot(new)
par(mar=c(5, 8, 4, 2) + 0.1)
plot(new)
par(mar=c(5, 4, 4, 2) + 0.1)
-Plotting these confidence intervals will start to show you haw these things differ.
?par
?conf.int
No documentation for 'conf.int' in specified packages and libraries:
you could try 'help.search("conf.int")'

November 11, 2008

Nov. 11, 2008

diett<-load.table("DietData.dat", header=T)
Error: could not find function "load.table"
diett<-read.table("DietData.dat", header=T)
summary(diett)
Case Diet WeightChange
Min. : 4.0 Atkins:68 Min. :-63.383
1st Qu.: 74.5 LEARN :60 1st Qu.:-16.369
Median :149.5 Ornish:59 Median : -5.897
Mean :151.7 Zone :61 Mean : -7.674
3rd Qu.:229.5 3rd Qu.: 1.516
Max. :311.0 Max. : 34.943
attach(diett)
names(diett)
[1] "Case" "Diet" "WeightChange"
plot(density(WeightChange,ker="e"))

ANOVA Pt. 4, Power analysis


-Cohen's f = measure of effect size for a >=3 group ANOVA test
Cohen's f estimate= f-hat
model <- aov(WeightChange~Diet)
anova(model)
Analysis of Variance Table

Response: WeightChange
Df Sum Sq Mean Sq F value Pr(>F)
Diet 3 1801 600 2.8255 0.03931 *
Residuals 244 51840 212
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
library(MBESS)
F2Rsquare(2,8255,3,244)
Error in F2Rsquare(2, 8255, 3, 244) : unused argument(s) (244)
F2Rsquare(2.8255,3,244)
[1] 0.03357342
Cohen's f = sqrt(Eta-squared/(1 - Eta-squared))

+ Cohen's f = sqrt(Eta-squared/(1 - Eta-squared))
f-hat = sqrt(.034/(1 - .034))
Error in f - hat = sqrt(0.034/(1 - 0.034)) : object "f" not found
f-hat = sqrt(.034/(1 - .034))
Use Gpower to run a power analysis

November 6, 2008

Nov. 6, 2008

ANOVA pt.3

diett<-read.table("DietData.dat", header=T)
attach(diett)

R code: model(DependentV~IndependentV)
-Error in the model includes both individual differences
AND other factors we didn't include AND measurement error.


-Reporting ANOVA results F(3, 244) = 2.8255

model <- aov(WeightChange~Diet)
par(mfrow=c(2,2))
plot(model)
summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
Diet 3 1801 600 2.8255 0.03931 *
Residuals 244 51840 212
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
NOTE: -If you have ordinal data, you have to create catigories out of it to use the ANOVA analysis.
library(MBESS)
F2Rsquare(2.8255,3,244)
[1] 0.03357342

-F2Rsquare(Fvalue,df numerator, df denominator)
omega-squared = SumOfSquares-betweengroups - (K-1)MS-withingroups/SumofSquares-total + MS-withingroups
R-squared or Eta-hat-squared = SumOfSquares-treatment/SS-total


Fixed Effects v. Random Effects v. Mixed Effects Model

-if I randomly selected my test treatments from the total population of all treatments then I could draw my conclusions back to ALL TREATMENTS
-This is a Random Effects Model
-We have been using a Fixed Effects Model

Confidence Interval

Eta-sqrd

ci.R2(R2=Eta-squrd,df.1,df.2,conf.level)

ci.R2(R2=.3357342,df.1=3,df.2=244,conf.level=.95)
$Lower.Conf.Limit.R2
[1] 0.2336431

$Prob.Less.Lower
[1] 0.025

$Upper.Conf.Limit.R2
[1] 0.4253084

$Prob.Greater.Upper
[1] 0.025

ci.R2(R2=.03357342,df.1=3,df.2=244,conf.level=.95)
$Lower.Conf.Limit.R2
[1] 0

$Prob.Less.Lower
[1] 0.025

$Upper.Conf.Limit.R2
[1] 0.08072746

$Prob.Greater.Upper
[1] 0.025


November 4, 2008

Nov. 4, 2008

NOTE:Levene's test is TOO SENSATIVE, don't use it. Use plots of errors. Use 4:1 ratio of variance or less.

ANOVA pt. 2

F = sigma-hat-squared (between groups) / sigma-hat-squared(within groups) Variance = sigma-hat-squared = Sum of(y-sub-i - mu-hat)-squared / n - 1 Any variance estimate is a sum of squares divided by degrees of freedom. SIGMA((y-sub-i - mu-hat)-suqared) = Total variation = variability due to treatment+due to general differences SIGMA((y-sub-i - mu-hat)-suqared) = SIGMA(mu-hat-sub-j - mu-hat)-squared) + SIGMA(y-sub-i - mu-hat-sub-j)-squared) NOTE: Degrees of freedom for groups is K (number of groups) - 1 NOTE: the degrees of freedom for within groups is N - K or total sample size minus number of groups. -Sums of squares and degrees of freedom are addative. Grand mean = mu-hat F = SIGMA(mu-hat-sub-j - mu-hat)-squared)/(K-1) + SIGMA(y-sub-i - mu-hat-sub-j)-squared)/(N - K) - Sometimes they call BG and WG variances mean squares in ANOVA literature.

Stat model for ANOVA

Y-subi-subj = mu + alpha-subj + epsalon-subi-subj - alpha-subj is mu-subj - mu - epsalon is error diet <- read.table("DietData.dat",header=T,to.data.frame=T)

diet <- read.table("DietData.dat",header=T)
attach(diet)
head(diet)
Case Diet WeightChange
1 117 Atkins -63.38290
2 4 Atkins -55.55649
3 135 Atkins -48.50170
4 6 Atkins -42.54922
5 50 Atkins -35.71489
6 19 Atkins -34.94327
boxplot(WeightChange~Diet)
tapply(WeightChange,Diet,mean)
Atkins LEARN Ornish Zone
-11.726647 -7.530623 -6.320541 -4.604409
tapply(WeightChange,Diet,var)
Atkins LEARN Ornish Zone
255.7337 186.7806 218.8879 183.1665
We can use the 4:1 or less rule to determine the homogeneity of variance assumption.
summary(diet)
Case Diet WeightChange
Min. : 4.0 Atkins:68 Min. :-63.383
1st Qu.: 74.5 LEARN :60 1st Qu.:-16.369
Median :149.5 Ornish:59 Median : -5.897
Mean :151.7 Zone :61 Mean : -7.674
3rd Qu.:229.5 3rd Qu.: 1.516
Max. :311.0 Max. : 34.943

ANOVA Hypotheses

H-null: alpha-subj = 0 for all
H-alt: alpha-subj != 0 for each group(j)
also
H-null: mu-sub1 = mu-sub2 = mu-sub3 = ... mu-subk
H-alt: not all mu-subj are equal

ANOVA assumptions

- errors for all groups are normally distributed, they have a mean of zero, and there is homogeneity of variance across all groups. library(car) model <- aov(WeightChange~Diet) ## To get all plots on one graph par(mfrow=c(2,2)) plot(model) names(model) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "contrasts" "xlevels" "call" "terms" [13] "model" library(car) qq.plot(model$residuals) par(mfrow=c(1,1)) qq.plot(model$residuals)

box.cox.powers(WeightChange + 64)
Box-Cox Transformation to Normality

Est.Power Std.Err. Wald(Power=0) Wald(Power=1)
1.5286 0.1617 9.451 3.2682

L.R. test, power = 0: 227.3437 df = 1 p = 0
L.R. test, power = 1: 12.7243 df = 1 p = 4e-04
WC <- (WeightChange+64)^1.5
model2 <- aov(WC~Diet)
## To get all plots on one graph
par(mfrow=c(2,2))
plot(model2)
qq.plot(model2$residuals)

October 21, 2008

Oct. 21, 2008

Two Sample T-test Continued

MJ<-read.csv("Marihuana.csv",header=T) attach(MJ) model <- aov(DSST~Group) To get all plots on one graph par(mfrow=c(2,2)) plot(model) names(model) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "contrasts" "xlevels" "call" "terms" [13] "model" model$fitted.values 1 2 3 4 5 6 7 8 -5.111111 -5.111111 -5.111111 -5.111111 -5.111111 -5.111111 -5.111111 -5.111111 9 10 11 12 13 14 15 16 -5.111111 0.250000 0.250000 0.250000 0.250000 0.250000 0.250000 0.250000 17 0.250000 model$fitted.values 1 2 3 4 5 6 7 8 9 -5.111111 -5.111111 -5.111111 -5.111111 -5.111111 -5.111111 -5.111111 -5.111111 -5.111111 10 11 12 13 14 15 16 17 0.250000 0.250000 0.250000 0.250000 0.250000 0.250000 0.250000 0.250000 library(car) qq.plot(model$residuals) par(mfrow=c(1,1)) qq.plot(model$residuals) We QQ plot residuals to determine if they are normal and can be used to compare w/ t sample t-test

Assumptions of the two-sample t-test

-independence (not too big a deal in t-test) -Normality of errors -mean = zero -homogeneity of variance (homostedasticity) tapply(DSST,Group,var) CU NS 30.21429 39.11111 30 - 39 is far less than 4-1 so homostedasticity assumption is met. library(car) levene.test(DSST,Group) Levene's Test for Homogeneity of Variance Df F value Pr(>F) group 1 0.0112 0.917 15 NOTE: SPSS does Levene's test by default and it runs it incorrectly! par(mfrow=c(2,2)) plot(model) ?t.test NOTE: Equality of variances is assumed to be false, so you must change to true t.test(DSST~Group,var.equal=T)

Two Sample t-test

data: DSST by Group
t = 1.866, df = 15, p-value = 0.08172
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.7625956 11.4848178
sample estimates:
mean in group CU mean in group NS
0.250000 -5.111111

We fail to reject the null hypothesis, because it seem slike there is no difference between cronic users and nieve subjects.
It seems like you don't gain anything from being a chronic user.
Either there are no significant differences, or we have a POWER issue.
NOTE: Degrees of freedom for a two sample t-test is (n-sub-1 - 1)+(n-sub-2 - 2)
t.test(DSST~Group)

Welch Two Sample t-test

data: DSST by Group
t = 1.8811, df = 15, p-value = 0.07951
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.713513 11.435735
sample estimates:
mean in group CU mean in group NS
0.250000 -5.111111

Cohen's d for effect size of two sample t-test

Because we assume homogeneity of variance, we can use the pooled standard deviations of the two samples See notes on two-sample t-test for the pooled standard deviation When we used the pooled SD we are estimating d and instead of calling it d-hat, we call it g in the two-sampled t-test. g is an estimation of the effect difference between the two samples g is a point estimate for d library(MBESS) ?smd smd(Group.1=subset(DSST,Group=="NS"),Group.2=subset(DSST,Group=="CU")) [1] -0.906721 NOTE: If you have a balanced design, we will get d-hat. If we have an unbalanced design we get g in R ci.smd(smd=-.906721,n.1=8,n.2=9) $Lower.Conf.Limit.smd [1] -1.898482

$smd
[1] -0.906721

$Upper.Conf.Limit.smd
[1] 0.1119626

ci.smd(smd=-.906721,n.1=9,n.2=8)
$Lower.Conf.Limit.smd
[1] -1.898482

$smd
[1] -0.906721

$Upper.Conf.Limit.smd
[1] 0.1119626

ci.smd(smd=.906721,n.1=9,n.2=8)
$Lower.Conf.Limit.smd
[1] -0.1119626

$smd
[1] 0.906721

$Upper.Conf.Limit.smd
[1] 1.898482


October 14, 2008

Oct. 14, 2008

Power

Assume in a t-test that the NULL hypothesis is true, so our decision is based around an assumption that the t-distribution is at ZERO.

If H-not is false then the test statistic is really a part of the non-central t not the central t.
Type II error - H-not is false and we make a decision that H-not is not false.
Since we are attempting to reject the NULL hypothesis we say POWER is the probability to reject the NULL hypothesis when the null is false.

What affects power


1. sample size
2. effect size we are looking for - We need to get an idea of the center (lambda) of the non-central distribution. Your lambda parameter best guess is the t-statistic.
3. type I error rate (alpha level) so if we change alpha level to .01 it makes it more difficult to reject the NULL.

Original POWER problem (Vitamin D Osteoperosis)

H-not: mu = 1 H-alt: mu not= 1

We failed to reject the null in our t-test, so what is the POWER (probability of type II error)?
When we fail to reject is when |t| <= 2.04 This is the critical value to reject with 32 DF and alpha .05 also about 2 standard errors.
t(32) = .498

ˆμ
So check it out:
-|critical t-value for reject| <= mu-hat - mu-not/(theta-hat/sqrtN) <= |critical t-value to reject|
-2.04 <= mu-hat - mu-not/(theta-hat/sqrtN) <= 2.04
-2.04 <= mu-hat - 1/.66t/sqrt33) <= 2.04
.776 <= mu-hat <= 1.224
P(BETA)=P( .776 <= mu-hat <= 1.224) given the mu=1.06
so we replace the mu of 1 with 1.06 in the above equation.
We find that P(-2.45 <= t <= 1.46)

We therefore use R to find the cumulative density for 1.46 and 32 DF and subtract the cumulative density of -2.45 and 32DF
The probability of making the error is
> pt(1.46,32)-pt(-2.45,32)
[1] 0.9130113
So POWER = 1 - P(BETA)
and probablility that I will make the correct decision is 1 - .9130113, or .087 or 8.7%
So we may assume that we failed to reject not because there was no difference but because we had little statistical power.

Computing power with G*Power 3

It's easy....

NOTE: .8 is a adequate to high degree of power.

a-priori POWER

NOTE: don't compute post-hoc power! You should do an a-priori power analysis. NSF even wants you to do it before you get a grant.

Maybe do an a-priori based on Cohen's if you are in psych. Or base it on other effect sizes previously computed. Ask some other people in the field.

NOTE: you want to keep your sample size smaller just so you don't get effects for extremely small effect sizes.

POWER compromise

Maybe we can't pay for enough participants. We have a set N and a set power we want, maybe .80 We have to determine a BETA to alpha ratio so we think about what the error would apply in the real world.

October 9, 2008

Oct. 9, 2008


> ###NOTE: Check Out Fred Mosteller & another: Federalist Papers analysis, statistical analysis of text.###


> ###

Confidence intervals for d

###
-You MUST meet normality assumption! If not it will be wrong.
> -This limits your use of d.

> ###

Bootstrapping

###
> ###-another means of getting confidence intervals.###
> ###sample from a sample with replacement of the values each time you resample.###
> ###

Another means of getting an accurate representation

###
> ###-use trimmed mean and winsorized SD of sample in the d equation.###
> ###

Power

###
> ###Research question: Did these women get over one hour of sun a week?
> ###

read.table or csv has an arguement called na.strings

###
> ###-read.csv("...",header=T,na.strings="[whatever non-data string placeholder was used]")###

> elderly<-read.csv("Osteoporosis.csv",header=T,na.strings="NA")
> attach(elderly)
> class(elderly$avg_sun)
[1] "numeric"
> plot(density(avg_sun,kernel="e"))
Error in density.default(avg_sun, kernel = "e") :
'x' contains missing values
> ###NOTE: density() has an na.rm parameter that defaults to false so it will think that there are no missing values.###
> plot(density(avg_sun,kernel="e",na.rm=T))
> summary(avg_sun)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.1120 0.5729 0.8889 1.0570 1.3960 2.9610 24.0000
> lenght(avg_sun)
Error: could not find function "lenght"
> length(avg_sun)
[1] 57
> 57-24
[1] 33
> ###So we have 33 useable entries###
> ###We have lost a lot of missing data, and we have introduced bias into our estimates.###
> ###Is htere something inherantly different concerning sun exposure about the people that don't answer that question?###
> ###If you cannot make the arguement that the missing data is random you have to make caveats about the missing data.###
> ###---best way to deal with it is to probably ignore it and write about it in the limitations.###
> library(car)
> qq.plot(avg_sun)
> ###-looks OK, no severe problems###
> mean(avg_sun,na.rm=T)
[1] 1.057165
> sun<-avg_sun
> sd(sun,na.rm=T)
[1] 0.6590523
> t.test(sun,mu=1)

One Sample t-test

data: sun
t = 0.4983, df = 32, p-value = 0.6217
alternative hypothesis: true mean is not equal to 1
95 percent confidence interval:
0.8234756 1.2908552
sample estimates:
mean of x
1.057165

> ###---we use 1 in the t test for mu because we are testing if they get more than one hour.###
> ###

We clearly fail to reject the null based on both the p value and the confidence interval.

###
> ###---What do we do now? Why did we not find a significant effect?###
> ###---1. it is actually true that these people don't get enough sun.###
> ###---or 2. our sample ws too small.###
> library(MBESS)
> smd(Mean.1=1.057165,Mean.2=1,s=.6590523)
[1] 0.08673818
> ###Mean.1 = sample mean, Mean.2=test mean, s=sample standard deviation###
> ###We have a very small effect###
> ###Probability of making a type two error is very high with small sample sizes.###
> ###d-hat above was only .09, translates to a small effect###
> ###Ways of decreasing chances of making a type two error###
> ###1. decrease alpha level###
> ###2. population effect###
> ###3. increase sample size###
>
> ###NOTE: Non-central Z ot t means that the center of the distribution is not at zero.###
> ###Probability of making a type two error is the area of the actual non-central distributon that is included in the region of the central distribution that would tell us to fail to reject.###
>

October 7, 2008

Oct. 7, 2008

R version 2.7.2 (2008-08-25)
Copyright (C) 2008 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> ###NOTE: standardization makes scores across differing scales comparable###
> football<- read.table("Football.dat",headers = TRUE)
Error in read.table("Football.dat", headers = TRUE) :
unused argument(s) (headers = TRUE)
> football<- read.table("Football.dat",header = TRUE)
> attach(football)
> utils:::menuInstallPkgs()
--- Please select a CRAN mirror for use in this session ---
trying URL 'http://www.ibiblio.org/pub/languages/R/CRAN/bin/windows/contrib/2.7/RColorBrewer_1.0-2.zip'
Content type 'application/zip' length 40465 bytes (39 Kb)
opened URL
downloaded 39 Kb

package 'RColorBrewer' successfully unpacked and MD5 sums checked

The downloaded packages are in
C:\Documents and Settings\install\Local Settings\Temp\Rtmp7VVMMa\downloaded_packages
updating HTML package descriptions
> ######################################################################################
> ##### Create a plot that helps readers interpret the results of the analysis. Note
> ##### that two functions from the RColorBrewer library are used to obtain an
> ##### appropriate color palette. The code is further commented below.
> ######################################################################################
>
> library(RColorBrewer)
> display.brewer.pal(3,"Paired") # This will plot the colors from the Paired palette.
> brewer.pal(3,"Paired")# This gives the hexadecimal values for the colors.
[1] "#A6CEE3" "#1F78B4" "#B2DF8A"
> # see http://www.colorbrewer.org or the RColorBrewer documentation.
>
>
> plot(WGR, ylim=c(0,.05), xlim=c(0,110), ,type="n", ylab="Density", xlab="White Graduation Rate")# This creates an empty plot for us to fill.
> rect(xleft = 55.20, ybottom = -1, xright = 63.12, ytop = 1, border=NA, col="#B2DF8A")# This adds the confidence region.
> box()# The previous command shaded over the outline of the plot,
> # so we re-draw it.
> lines(density(WGR, kernel="e"), lwd=1.5)# Add the density plot with a thicker line.
> rug(WGR, col="#1F78B4")# Add a rug plot.
> arrows(55.20,.005, 63.12, .005, code=3, length=.1)# Add an arrow
> arrows(55.20,.045, 63.12, .045, code=3, length=.1)# Add the second arrow.
> abline(v=76, lty="dashed", lwd = 2, col="#1F78B4")# Add the vertical line for the hypothesized value.
> text(59.16, .025, "95% Confidence Interval", cex = 1.5, srt = 90)# Add the vertical text.
>
>
> plot(BGR, ylim=c(0,.05), xlim=c(0,110), ,type="n", ylab="Density", xlab="Black Graduation Rate")
> rect(xleft = 40.10, ybottom = -1, xright = 47.98, ytop = 1, border=NA, col="#B2DF8A")
> box()
> lines(density(BGR, kernel="e"), lwd=1.5)
> rug(BGR, col="#1F78B4")
> arrows(40.10,.005, 47.98, .005, code=3, length=.1)
> arrows(40.10,.045, 47.98, .045, code=3, length=.1)
> abline(v=49, lty="dashed", lwd = 2, col="#1F78B4")
> text(44.04, .025, "95% Confidence Interval", cex = 1.5, srt = 90)
> ###
> ###
> ###
> ###
> ###

How different is our sample mean different from our hypothesized value?


> ###The absolute value of mu-hat minus mu-NULL
> Confidence interval for the mean answers how confident we are.
Error: unexpected '<' in "<"
> ###Confidence interval for the mean answers how confident we are that the mean is to the population mean.###
> ###(effect size is what is the possible variation in our mean based on the sample size.###
> ###

Cohen's d


> ###If you want to standardise something you divide by standard deviation.###
> ###Cohen's d says takes unstandardized effect size |mu.hat - mu.null| and divide by standard deviation.
> ###because we don't know the actual population mean and SD we must assume it and d becomes d.hat
> sd(BGR)
[1] 13.86813
> Cohen says standardized efect size .2 = small, .5 = moderate, .8 = large
Error: unexpected '<' in "<"
> ###Cohen says standardized efect size .2 = small, .5 = moderate, .8 = large###
> ###These have somehow become the law, but they are relative. His estimates were for psychological data only. My guess is they probably don't really apply, but these things have been locked in.
> ###Remeber to know your field to determine what is a large or small effect size.###
> mean(BGR)
[1] 44.04
> (44.04-49)/13.87
[1] -0.3576063
> |(44.04-49)|
Error: unexpected '|' in "|"
> ###How to find standardised effect size in R###
> library(MBESS)
Error in library(MBESS) : there is no package called 'MBESS'
> utils:::menuInstallPkgs()
also installing the dependency ‘gsl’

trying URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.7/gsl_1.8-11.zip'
Content type 'application/zip' length 582918 bytes (569 Kb)
opened URL
downloaded 569 Kb

trying URL 'http://www.ibiblio.org/pub/languages/R/CRAN/bin/windows/contrib/2.7/MBESS_1.0.1.zip'
Content type 'application/zip' length 658354 bytes (642 Kb)
opened URL
downloaded 642 Kb

package 'gsl' successfully unpacked and MD5 sums checked
package 'MBESS' successfully unpacked and MD5 sums checked

The downloaded packages are in
C:\Documents and Settings\install\Local Settings\Temp\Rtmp7VVMMa\downloaded_packages
updating HTML package descriptions
> ### you need to download MBESS library###
> library(MBESS)
> smd(Mean.1=mean(BGR),Mean.2=49,s-sd(BGR))
Error in smd(Mean.1 = mean(BGR), Mean.2 = 49, s - sd(BGR)) :
object "s" not found
> smd(Mean.1=mean(BGR),Mean.2=49,s=sd(BGR))
[1] -0.3576547
> ###If you are doing a 2-sided test, you have to remember to take the absolute value of your standardized effect size because you are not interestedin directionality.###
>

Confidence Interval for d


Error: unexpected '<' in "<"
> ###

Confidence Interval for d

###
> ###NOTE: At this point APA is going to require an interval estimate of effect size.###
> ###Lucky for us R has a function to figure out the d.hat confidence interval.###
> ###NOTE: With small effect sizes we often get very wide effect size confidence intervals.
>

ci.sm()


Error: unexpected '<' in "<"
> ###

ci.sm()

###
> ###-Used to calculate the standardized effect size confidence interval.###
> ###ci.sm(sm = standardized effect, N = n)###
> ci.sm(sm=smd(Mean.1=mean(BGR),Mean.2=49,s=sd(BGR)),N=50)
[1] "The 0.95 confidence limits for the standardized mean are given as:"
$Lower.Conf.Limit.Standardized.Mean
[1] -0.6419787

$Standardized.Mean
[1] -0.3576547

$Upper.Conf.Limit.Standardized.Mean
[1] -0.06990362

>

October 2, 2008

Oct. 2, 2008

ERROR

reality
Hnull = TRUEHnull = False
Decisionreject Hnulltype I (alpha)correct (1 - beta)
fail to reject(1 - alpha)correcttype II (beta)
-probablility of making a type one error is your alpha level.
NOTE: if you increase your probability of making type 1 you decrease making type2 and vice versa, but it is not a linear realitionship.NOTE: significance rate should probably be better than one in twenty or .95.

R version 2.7.2 (2008-08-25)
Copyright (C) 2008 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> football<-read.table("Football.dat",header=T)
> attach(football)
NOTE:Comment the new code for each lab.
NOTE: At thirty the sampling distribution tends to become normal, but not always.

ERROR


See table and notes above.

What is the effect and is it useful?


NOTE: APA says that you must include an effect size.
>

Estimates


Error: unexpected '<' in "<"

Estimates


Point estimate: best one number guess. So guessing the population mean would mean we would guess the sample mean.
> mean(WGR)
[1] 59.16
> mean(BGR)
[1] 44.04
> NOTE: remeber the t-test is about making sure that the populations are different in the way wer see the samples to be different.
Error: unexpected '<' in "<"
NOTE: remeber the t-test is about making sure that the populations are different in the way wer see the samples to be different.
> Interval estimate: an interval were we are "pretty sure" that the population characteristic lies.
Error: unexpected symbol in "Interval estimate"
Interval estimate: an interval were we are "pretty sure" that the population characteristic lies.
-we want to make it useful and we want to be confident in it. Sampling error and degree of cinfidence affects the width of the interval.
Point estimate - error and point estimate + error make up the bounds of our interval.
point est. - error <= parameter <= point est. + error
P(-1.96 <= Zscore <= 1.96) = .95
Zscore is observation minus mean divided by the standard deviation of the destributionP(-1.96
P(-1.96 <= Z <= 1.96) = .95
P((mu-hat-1.96(sigma-hat/root-n))<=mu<=(mu-hat+1.96(sigma-hat/root-n))=.95
mu-hat=59.16
sigma-hat=13.92
n=50
> sd(WGR)
[1] 13.92305
> err<-1.96*sd(WGR)/sqrt(50)
> mean(wgr)-err
Error in mean(wgr) : object "wgr" not found
> mean(WGR)-err
[1] 55.30073
> mean(WGR)+err
[1] 63.01927
our .95 confidence interval is therefore 55.50 - 63.02
> t.test(WGR, mu=76,alt="two.sided")

One Sample t-test

data: WGR
t = -8.5525, df = 49, p-value = 2.766e-11
alternative hypothesis: true mean is not equal to 76
95 percent confidence interval:
55.20311 63.11689
sample estimates:
mean of x
59.16

See above "95 percent confidence interval"
> t.test(WGR,mu=76,alt="two.sided",conf.level=.99)

One Sample t-test

data: WGR
t = -8.5525, df = 49, p-value = 2.766e-11
alternative hypothesis: true mean is not equal to 76
99 percent confidence interval:
53.88313 64.43687
sample estimates:
mean of x
59.16

Our .01 or 99 percent confidence interval is 53.88 to 64.44
When you give an interval estimate you are either right or wrong. In reality there is not 95% sure. It either is or isn't
Confidence intervals only work for two sided tests. For a one-sided test you have to compute an asymetric confidence interval.
> help(t.test)
If one did a one-sided test, one would not provide this type of effect size.
>