Recently in Praxes Category

Creating accessible tables and graphs with R

| No Comments

I was recently given the task of improving the accessibility of a statistical report, with the goal of meeting level AA of the Web Content Accessibility Guidelines 2.0. The report contains about 600 pages of tables and graphs in a DOC file. The rule of thumb for tables is to ensure that a visually impaired person can use the tab key to navigate easily across cells in a row from the top left of the table to the bottom right. Additionally, tables and graphs should contain alternative titles and descriptions.

Last year, I produced the report with the Rlogo.jpg packages xtable and R2HTML. James Falkofske notes that hypertext markup language (HTML) files are inherently fluid, allowing readers with a wide range of abilities and assistive technologies to access the content. HTML tables with little or no nesting generally meet the navigation guidelines, even when opened with Word and saved as a word processor document. However, alternative titles and descriptions are not applied to tables and graphs automatically.

It would be very time consuming to retroactively add alternative text to 600 pages of tables and graphs in Word, so I decided to look for ways to modify the defaults in the xtable and R2HTML packages. The example below shows that it's possible to add alternative text while writing tables and graphs to a HTML file. For tables, the goal is to add title and summary attributes to the <TABLE> tag. Those attributes will show up as alternative text when the HTML file is opened in Word. For graphs, the goal is to add title and "alt" attributes to the <IMG> tag. The graph's "alt" attribute will carry over to the alternative description field when the HTML file is opened in Word, but the title won't. However, adding title attributes will allow web browsers to display a float-over title for graphs as well as tables. Try it out by floating your cursor over the table and graph below.

########################################
#Load libraries.
library(xtable) #for tables
library(R2HTML) #for graphs
library(datasets) #for data

########################################
#Initiate a HTML file to export results.

#The file can be named with a DOC extension to facilitate opening it with Word.
name.file.htm <- "Accessible Tables and Graphs.doc"
HTML.title(
  "Add alternative titles and descriptions to tables and graphs in a HTML-based DOC file", 
  file = name.file.htm, HR = 1, append = F)

########################################
#Accessible table example

#Assign alternative title and description to a single object.
html.table.attributes <- paste('border=1', 
  'title="Table of days of frost and years of life expectancy"', 
  'summary="Table of days of frost and years of life expectancy"')

#The xtable(caption) argument places the title in the first row of the table, making it 
#more difficult to designate table headers to repeat across multiple pages in a document.  
#Produce the table title with HTML.title() instead.
HTML.title("Days of frost and years of life expectancy", file = name.file.htm, HR = 3)

#Create table and write to file.
data.frost.exp <- as.data.frame(state.x77[, c("Frost", "Life Exp")])
names(data.frost.exp)[2] <- "Life"
table.xtable <- xtable(data.frost.exp, digits = c(NA, 0, 2))
print(table.xtable, type = "html", html.table.attributes = html.table.attributes, 
  file = name.file.htm, border = 1, append = T)
HTML("<br>", file = name.file.htm)

########################################
#Accessible graph example

#Modify HTMLInsertGraph() by adding Title and Alt arguments.
HTMLInsertGraph <- function (GraphFileName = "", Caption = "", 
  Title = "untitled graph", Alt = "a graph", 
  GraphBorder = 1, Align = "center", 
  WidthHTML = 500, HeightHTML = NULL, 
  file = get(".HTML.file"), append = TRUE, ...) {
    cat("\n", file = file, append = append, ...)
    cat(paste("<p align=", Align, "><img src='", GraphFileName, 
      "' title='", Title, "' alt='", Alt, "' border=", GraphBorder, 
      if (!is.null(WidthHTML)) 
        paste(" width=", WidthHTML, sep = "") 
        else "", 
      if (!is.null(HeightHTML)) 
        paste(" height=", HeightHTML, sep = "")
        else "", ">", sep = "", collapse = ""), 
      file = file, append = TRUE, sep = "")
    if (Caption != "") 
      cat(paste("<br><i class=caption>", Caption, "</i>"), 
        file = file, append = TRUE, sep = "")
    invisible(return(TRUE))
}

#Assign alternative title and description to objects.
Title <- "Graph of days of frost and years of life expectancy"
Alt <- "Graph of days of frost and years of life expectancy"

name.file.plot <- "Days of frost and years of life expectancy.png"
png(name.file.plot, width = 350, height = 350)
plot(data.frost.exp, main = "Days of frost and years of life expectancy", family = "serif")
abline(lm(Life ~ Frost, data.frost.exp), col = "red")
dev.off()
HTMLInsertGraph(GraphFileName = name.file.plot, file = name.file.htm, Align = "left", 
  WidthHTML = 350, HeightHTML = 350, append = T, Title = Title, Alt = Alt)
HTML("<br>", file = name.file.htm)

Days of frost and years of life expectancy

Frost Life
Alabama 20 69.05
Alaska 152 69.31
Arizona 15 70.55
Arkansas 65 70.66
California 20 71.71
Colorado 166 72.06
Connecticut 139 72.48
Delaware 103 70.06
Florida 11 70.66
Georgia 60 68.54
Hawaii 0 73.60
Idaho 126 71.87
Illinois 127 70.14
Indiana 122 70.88
Iowa 140 72.56
Kansas 114 72.58
Kentucky 95 70.10
Louisiana 12 68.76
Maine 161 70.39
Maryland 101 70.22
Massachusetts 103 71.83
Michigan 125 70.63
Minnesota 160 72.96
Mississippi 50 68.09
Missouri 108 70.69
Montana 155 70.56
Nebraska 139 72.60
Nevada 188 69.03
New Hampshire 174 71.23
New Jersey 115 70.93
New Mexico 120 70.32
New York 82 70.55
North Carolina 80 69.21
North Dakota 186 72.78
Ohio 124 70.82
Oklahoma 82 71.42
Oregon 44 72.13
Pennsylvania 126 70.43
Rhode Island 127 71.90
South Carolina 65 67.96
South Dakota 172 72.08
Tennessee 70 70.11
Texas 35 70.90
Utah 137 72.90
Vermont 168 71.64
Virginia 85 70.08
Washington 32 71.72
West Virginia 100 69.48
Wisconsin 149 72.48
Wyoming 173 70.29


Graph of days of frost and years of life expectancy

Best breweries in Minnesota: Value-added ratings

| 1 Comment

A good friend of mine told me to check out Beer Advocate online. Their rating system does a good job of balancing simplicity (A-F grades) with thoroughness (measures of central tendency and dispersion). I also appreciate the way it encourages raters and readers to consider multiple dimensions of a beer's quality--its look, smell, taste, and feel--in addition to overall quality.

As a practitioner of multilevel and latent variable modeling, one aspect of Beer Advocate's rating system has given me pause: a lack of adjustment for consistently high- or low-raters. What's concerning about a simple average of ratings? Let's imagine a plausible scenario: a motivated group of beer aficionados with high standards get their hands on a limited-release beer and publish their reviews on Beer Advocate. The beer could be world class, but if only tough raters have tried it, then the average rating will appear lower. Conversely, a large number of uncritical raters could review a mediocre beer, resulting in an excellent rating on average.

Is my concern warranted? If so, which beers and breweries are rated highest after adjusting for raters and other factors? To answer those questions, I wrote some Rlogo.jpg code to gather Beer Advocate ratings of beers produced by Minnesota breweries. The code relies heavily on the XML and stringr packages. After gathering the ratings, I decided to check my assumption of within-rater consistency. The intraclass correlation coefficient of r = 0.13 indicates a small degree of consistency within raters, but enough to justify my concern.

I specified a multilevel model to obtain value-added ratings of beers and breweries while adjusting for rater and beer characteristics. For simplicity, I decided to focus on the overall rating, although I would like to consider multiple dimensions of beer quality (look, smell, taste, and feel) in future analyses. The model specifies random effects for raters, cross-classified with beers nested in breweries. The fixed effects were the serving type experienced by the rater for a given beer (e.g., can), the beer's alcohol by volume (ABV), and its style (e.g., porter).

.

Note that I transformed the overall rating from a 1-5 scale to logits. I did so to avoid predictions below the floor of 1 point and above the ceiling of 5 points on the original scale (i.e., to enforce lower and upper asymptotes) and to spread the ratings out at the tails (e.g., make it "harder" to go from 4.5 to 5 than going from 3.5 to 4). The chart below shows the resulting ogival relationship between the original point system and the logits.

Plot showing transformation from points to logits
Points_to_Logits.png

I estimated the model parameters with the lme4 package. Given the large number of fixed effects (and the need for future blog content), I'll describe the influence of serving type, ABV, and style at a later date. The value added by breweries after adjusting for rater and beer characteristics is shown in the table below. Value-added logits were transformed back to the original scale, with values in the table representing point deviations from the brewery average, but the logit and point ratings are virtually the same. August Schell Brewing Company recently won national accolades, but it came in second to a relative newcomer, Fulton Beer. Summit Brewing and Surly Brewing followed closely in the third and fourth spots. The results show some discrepancies between the value-added ratings and Beer Advocate's grades, which are based on the simple average of all beer ratings. For example, Harriet Brewing received an A-, compared to Schell's B grade, but Harriet's value-added rating is below average.

Breweries sorted by value-added ratings
Brewery Value-added (logits) Value-added (points) Beer Advocate grade
(average beer rating)
Fulton Beer 0.85 0.80 A-
August Schell Brewing Co., Inc. 0.83 0.79 B
Summit Brewing Company 0.80 0.76 B+
Surly Brewing Company 0.71 0.68 A-
Minneapolis Town Hall Brewery 0.27 0.27 A-
Lake Superior Brewing Company 0.18 0.18 B
Fitger's Brewhouse 0.10 0.10 B+
Olvalde Farm & Brewing Company 0.08 0.08 A-
Granite City Food & Brewery 0.04 0.04 B-
Flat Earth Brewing Company 0.03 0.03 B+
Brau Brothers Brewing Co., LLC 0.02 0.02 B
Pig's Eye Brewing Company 0.00 0.00 D+
Harriet Brewing -0.09 -0.09 A-
Lift Bridge Brewery -0.11 -0.11 B+
Leech Lake Brewing Company -0.14 -0.14 B+
Barley John's Brew Pub -0.26 -0.26 B+
McCann's Food & Brew -0.27 -0.26 B+
St. Croix Brewing Company, LLC -0.69 -0.66 C
Cold Spring Brewing Co. -0.99 -0.91 D+
Great Waters Brewing -1.37 -1.19 B+


A different picture emerges when one considers that value-added ratings are point estimates within a range of certainty. The prediction-interval plot below shows that we can't confidently conclude that Fulton Beer scored above average, let alone out performed Shell's, Summit, or Surly. We can confidently say, though, that Shell's, Summit, and Surly scored above average, and along with Minneapolis Town Hall Brewery, they significantly out-performed Cold Spring Brewing Co. and Great Waters Brewing.

Plot of breweries' value-added ratings with 95% prediction intervals
Breweries_Value-Added.png

The results highlight some of the benefits of multilevel modeling and the limits of value-added ratings. By controlling for beer characteristics and separately estimating variance components for raters and beers nested in breweries, I have reduced the degree to which raters' tastes and brewers' preferences for certain types of beers are confounding brewery ratings. That is, the value-added ratings reflect the brewery's overall quality with greater fidelity. Nevertheless, we should keep in mind that value-added ratings are based on error unexplained by theory/fixed effects, that they are point estimates within a range of certainty, and that one should interpret value-added ratings with caution because personal preferences still count. This analysis has motivated me to try beers by the breweries with high value-added ratings, but I know from experience that I like beers brewed by Brau Brothers, Lift Bridge, and Barley John's, even though their value-added point estimates place them in the bottom half of Minnesota breweries.

Using irtoys, plyr, and ggplot2 for test development

| No Comments

My job as a Quantitative Analyst at the Minnesota Department of Education requires me to "provide evaluation, statistical, and psychometric support to the Division of Research and Assessment through the entire process of design, development, and reporting to the Minnesota assessment system." That's an accurate job description, but a little vague. What I really do is write a lot of Rlogo.jpg code to apply item response theory (IRT) and compare item and test characteristics across multiple years and/or forms to maintain high quality.

Rlogo.jpg is great for plotting item and test characteristics across multiple years and/or forms, but it requires using a few packages in concert (i.e., to my knowledge there is no single package designed for this purpose). The irtoys package provides many useful functions for test development under IRT; the plyr package allows one to apply irtoys' functions to subsets of item parameters in a data set (e.g., by strand); and the ggplot2 package allows one to overlay and facet item and test curves.

In the following reproducible example, I graphically compare item and test characteristics of the 2009 and 2010 grade 3 Minnesota Comprehensive Assessments of reading ability. Note that for simplicity, I have excluded two polytomous items from 2009 and a few dichotomous items from 2010 from the published item parameter estimates. Also note that items have been assigned to strands arbitrarily in this example. Graphs such as these allow us to answer questions like, "Did the 2010 form yield as much information (i.e., measure with equal or less error) at the cut scores as the 2009 form?" The graphs also provide guidance for replacing undesirable items.


########################################
#Combine two years' worth of item parameter estimates from grade 3.

data.pars.2009 <- matrix(ncol = 3, byrow = T, c(0.57, -1.24, 0.17, 0.93, -0.74, 0.25, 
  1.45, -0.45, 0.19, 0.97, -0.5, 0.13, 0.95, -0.8, 0.62, 1.07, -0.38, 0.24, 0.87, 
  0.37, 0.22, 0.61, -1.45, 0.02, 0.84, -1.03, 0.19, 0.72, -0.23, 0.12, 0.93, -1.6, 
  0.25, 0.68, -2.02, 0.06, 1.1, -1.5, 0.36, 0.74, -1.6, 0.51, 1.01, -1.01, 0.21, 
  0.81, -1.52, 0.16, 0.93, -0.5, 0.11, 0.34, -0.75, 0.02, 0.92, -0.93, 0.19, 1.14, 
  0.21, 0.26, 0.54, -0.59, 0.1, 0.86, -0.52, 0.28, 1.04, -1.77, 0.05, 0.84, -1.68, 
  0.02, 1.46, -1.7, 0.17, 1.3, -1.16, 0.21, 0.51, -0.62, 0.1, 1.09, -1.04, 0.17, 
  0.66, -2.06, 0.09, 1.11, -0.61, 0.11, 1.34, -0.9, 0.26, 1.3, -1.24, 0.19, 1.37, 
  -1.72, 0.24, 1.09, -1.37, 0.16, 0.89, -0.94, 0.08, 1.24, -1.38, 0.14, 0.79, -1.32, 
  0.11, 1.09, -1.69, 0.13))
data.pars.2009 <- data.frame(2009, data.pars.2009)
names(data.pars.2009) <- c("year", "a", "b", "c")
data.pars.2009$strand <- factor(rep(1:4, length.out = 38), labels = paste("Strand", 1:4))

data.pars.2010 <- matrix(ncol = 3, byrow = T, c(1, -2.02, 0.05, 0.61, -0.88, 0.24, 
  0.74, -1.36, 0.08, 0.93, 0.63, 0.25, 0.94, 0.38, 0.23, 0.56, -0.14, 0.32, 0.7, 
  -0.73, 0.08, 1.1, -2.54, 0.02, 1.17, -0.36, 0.22, 0.46, 0.76, 0.09, 0.77, -1.53, 
  0.09, 1.5, -1.27, 0.18, 1.01, -1.41, 0.1, 1.63, -1.61, 0.21, 1.33, -1.63, 0.13, 
  1.11, -1.01, 0.19, 1.05, -1.17, 0.07, 0.88, -0.33, 0.23, 0.56, -0.65, 0.28, 1.03, 
  0.45, 0.13, 1.29, -1.73, 0.12, 0.96, -1.54, 0.15, 0.62, -1.33, 0.09, 0.67, -1.41, 
  0.06, 0.74, 0.16, 0.3, 1.33, -0.94, 0.26, 1.31, -1.71, 0.23, 1.14, -1.42, 0.16, 
  0.96, -0.87, 0.09, 1.22, -1.36, 0.14, 0.86, -1.19, 0.17, 0.94, -2.48, 0.02, 0.81, 
  -1.96, 0.02, 0.98, -0.86, 0.13, 0.8, -0.68, 0.11, 0.72, -1.67, 0.04, 0.63, 0.32, 
  0.16, 1.24, -1.64, 0.33))
data.pars.2010 <- data.frame(2010, data.pars.2010)
names(data.pars.2010) <- c("year", "a", "b", "c")
data.pars.2010$strand <- factor(rep(1:4, length.out = 38), labels = paste("Strand", 1:4))

data.pars <- rbind(data.pars.2009, data.pars.2010)
data.pars$year <- factor(data.pars$year)
data.pars$id <- factor(1:nrow(data.pars))

#Apply D scaling constant.
data.pars$a <- data.pars$a * 1.7

#Theta values for plotting
thetas <- seq(-3, 3, by = 0.1)

#Theta cut scores
cuts.grade3 <- c(-1.40, -0.84, -0.01)


########################################
#Plot overlapping test characteristic curves.

funk <- function(z) with(trf(z[, c("a", "b", "c")], x = thetas), data.frame(Ability = x, f))
data.plot <- ddply(data.pars, "year", funk)

g1 <- ggplot(data.plot, aes(x = Ability, y = f)) + 
  geom_vline(xintercept = cuts.grade3, linetype = 2, lwd = 1, color = "darkgrey") +
  geom_line(aes(color = year, linetype = year), lwd = 1) + 
  scale_colour_manual(name = "Year", values = c("darkgoldenrod1", "darkblue")) + 
  scale_linetype_manual(name = "Year", values = 2:1) + 
  scale_y_continuous(name = "Expected score") + 
  opts(title = "Test characteristic curves")
g1

TCCs.png

The test characteristic curves appear comparable up to the "exceeds proficiency standard" cut score, above which the test became more difficult compared to 2009.


########################################
#Plot overlapping conditional standard errors of measurement (from test information).

funk <- function(z) with(tif(z[, c("a", "b", "c")], x = thetas), data.frame(Ability = x, f))
data.plot <- ddply(data.pars, "year", funk)
data.plot$f <- 1/sqrt(data.plot$f)

g1 <- ggplot(data.plot, aes(x = Ability, y = f)) + 
  geom_vline(xintercept = cuts.grade3, linetype = 2, lwd = 1, color = "darkgrey") +
  geom_line(aes(color = year, linetype = year), lwd = 1) + 
  scale_colour_manual(name = "Year", values = c("darkgoldenrod1", "darkblue")) + 
  scale_linetype_manual(name = "Year", values = 2:1) + 
  scale_y_continuous(name = "CSEM") + 
  opts(title = "Conditional standard errors of measurement")
g1

CSEMs.png

The conditional standard errors of measurement appear comparable, overall, except for the 2010 standard error at the exceeds-proficiency cut, which appears to slightly exceed the 0.25 rule of thumb.


########################################
#Plot overlapping CSEMs faceted by strand.

funk <- function(z) with(tif(z[, c("a", "b", "c")], x = thetas), data.frame(Ability = x, f))
data.plot <- ddply(data.pars, c("year", "strand"), funk)
data.plot$f <- 1/sqrt(data.plot$f)

g1 <- ggplot(data.plot, aes(x = Ability, y = f)) + 
  geom_vline(xintercept = cuts.grade3, linetype = 2, lwd = 1, color = "darkgrey") +
  geom_line(aes(color = year, linetype = year), lwd = 1) + 
  scale_colour_manual(name = "Year", values = c("darkgoldenrod1", "darkblue")) + 
  scale_linetype_manual(name = "Year", values = 2:1) + 
  facet_wrap( ~ strand) + 
  scale_y_continuous(name = "CSEM") + 
  opts(title = "Conditional standard errors of measurement")
g1

CSEMs_Strand.png

It looks like strand 3 could use a highly discriminating difficult item to bring its CSEM in line with the corresponding 2009 CSEM at the exceeds-proficiency cut.


########################################
#Plot item characteristic curves faceted by strand.

funk <- function(z) with(irf(z[, c("a", "b", "c")], x = thetas), data.frame(Ability = x, f))
data.plot <- ddply(data.pars, c("id", "year", "strand"), funk)

g1 <- ggplot(data.plot, aes(x = Ability, y = f)) + 
  geom_vline(xintercept = cuts.grade3, linetype = 2, lwd = 1, color = "darkgrey") +
  geom_line(aes(group = id:year, color = year, linetype = year)) + 
  scale_colour_manual(name = "Year", values = c("darkgoldenrod1", "darkblue")) + 
  scale_linetype_manual(name = "Year", values = 2:1) + 
  facet_wrap( ~ strand) + 
  scale_y_continuous(name = "Probability", breaks = seq(0, 1, by = 0.25), limits = c(0, 1)) + 
  opts(title="Item characteristic curves")
g1

ICCs_Strand.png

Compared to 2009, there are fewer items from strands 1 and 2 that students are able to guess easily (i.e., high c parameters/lower asymptotes). As noted above, strand 3 lacks an item that discriminates at the exceeds-proficiency cut. Replacing the poorly discriminating difficult item in strand 2 with a higher discriminating one would also reduce measurement error of high-ability students.

Plotting survey responses representing three or more dimensions

| No Comments

Bob Williams recently posted a question to EVALTALK about open source software for plotting survey responses in three dimensions:

I'm wondering how to display Likert scale data in three dimensions (ie the Likert scores for each subject on 3 scale dimensions). My graphic software is not that sophisticated and I haven't been able to find an open source equivalent.... that can plot something along three dimensions (X,Y,Z) like this:
          X, Y, Z
Subject A 2, 5, 8
Subject B 3, 5, 9
Subject C 3, 7, 3

Another evaluator, Bill Harris, recommended Rlogo.jpg and ggplot2 in particular.

Bob's question and the mention of Rlogo.jpg inspired me to write up a brief demonstration of plotting multi-dimensional data. As I have mentioned in the past (see here, here, and here), plotting or mapping three or more dimensions can be difficult and potentially misleading because plots and maps are inherently two-dimensional. I would say that plotting three survey response dimensions actually requires plotting four dimensions: X, Y, Z, and the frequency of each X, Y, and Z combination. That's because evaluators commonly have large data sets with repeated response patterns. I simulated a larger data set from Bob's example in order to illustrate plotting four dimensions.

#Open a recording window (if using Windows).
windows(record=T)

#Manually read in the data.
data.3D <- data.frame(Subject = c("A", "B", "C"),
  X = c(2, 3, 3),
  Y = c(5, 5, 7),
  Z = c(8, 9, 3))
data.3D

#Larger data sets will contain frequently repeated combinations of responses.
#Simulate a larger data set with repeated responses combinations.
data.4D <- data.3D
for(i in 1:200) {
  temp <- data.frame(Subject=factor(paste(as.character(data.3D[, 1]), i, sep=".")), 
  X=round(jitter(data.3D[, 2], 7), 0), 
  Y=round(jitter(data.3D[, 2], 7), 0), 
  Z=round(jitter(data.3D[, 4], 7), 0))
  data.4D <- rbind(data.4D, temp)
}
library(plyr)
data.4D <- ddply(data.4D, c("X", "Y", "Z"), "nrow")
names(data.4D)[which(names(data.4D)=="nrow")] <- "Frequency"
data.4D <- data.4D[order(data.4D$Frequency, decreasing=T), ]

#Scatterplot matrices
pairs(data.3D[2:4], xlim=c(0, 10), ylim=c(0, 10), 
  main="Responses to three Likert scale items")
library(psych)
pairs.panels(sapply(data.4D[1:3], jitter), xlim=c(0, 10), ylim=c(0, 10), ellipses=F, 
  hist.col="blue", pch=1, scale=T, main="Responses to three Likert scale items")

Scatterplot_Matrix1_EVALTALK.png Scatterplot_Matrix2_EVALTALK.png

#Load ggplot2 library; set theme without gray background.
library(ggplot2)
theme_set(theme_bw())

#Create a scatterplot with size and color varying by the third dimension.
plot1 <- ggplot(data=data.3D, aes(x=X, y=Y, color=Z, size=Z)) + 
  geom_point() + 
  xlim(0, 10) + 
  ylim(0, 10) + 
  scale_colour_gradient(breaks=0:10, limits=c(0, 10), low="lightblue", high="darkblue") + 
  scale_size_continuous(breaks=0:10, limits=c(0, 10)) + 
  opts(title="Responses to three Likert scale items")
plot1

#Create a jittered scatterplot with color varying by the third dimension 
#and size by response combination frequency.
plot2 <- ggplot(data=data.4D, aes(x=X, y=Y, color=Z, size=Frequency)) + 
  geom_jitter() + 
  xlim(0, 10) + 
  ylim(0, 10) + 
  scale_colour_gradient(breaks=0:10, limits=c(0, 10), low="lightblue", high="darkblue") + 
  scale_size_continuous() + 
  opts(title="Responses to three Likert scale items")
plot2

ggplot1_EVALTALK.png ggplot2_EVALTALK.png

#Reshape data from wide to longer.
data.3D.longer <- melt(data.3D, id.vars=c("Subject", "Y"), variable_name="Item")
names(data.3D.longer)[which(names(data.3D.longer)=="value")] <- "Response"
data.3D.longer

#Create a scatterplot with faceted second and third dimensions.
plot3 <- ggplot(data=data.3D.longer, aes(x=Response, y=Y)) + 
  geom_point(color="blue") + 
  xlim(0, 10) + 
  ylim(0, 10) + 
  facet_grid(. ~ Item) + 
  opts(title="Responses to three Likert scale items")
plot3

#Reshape simulated data from wide to longer.
data.4D.longer <- melt(data.4D, id.vars=c("Y", "Frequency"), variable_name="Item")
names(data.4D.longer)[which(names(data.4D.longer)=="value")] <- "Response"
head(data.4D.longer)

#Create a scatterplot with faceted second and third dimensions 
#and size varying by response combination frequency.
plot4 <- ggplot(data=data.4D.longer, aes(x=Response, y=Y, size=Frequency)) + 
  geom_jitter(color="blue") + 
  xlim(0, 10) + 
  ylim(0, 10) + 
  facet_grid(. ~ Item) + 
  opts(title="Responses to three Likert scale items")
plot4

ggplot3_EVALTALK.png ggplot4_EVALTALK.png

Following Douglas Bates' advice for those required to produce p-values for fixed-effects terms in a mixed-effects model, I wrote a function to perform a likelihood ratio test for each term in an lmer() object. Bates has championed the notion that calculating the p-values for fixed effects is not trivial. That's because with unbalanced, multilevel data, the denominator degrees of freedom used to penalize certainty are unknown (i.e., we're uncertain about how uncertain we should be). As the author of lme4, the foremost mixed-effects modeling package in Rlogo.jpg, he has practiced what he preaches by declining to approximate denominator degrees of freedom as SAS does. Bates contends that alternative inferential approaches make p-values unnecessary. I agree with that position, focusing instead on information criteria and effect sizes. However, as a program evaluator, I recognize that some stakeholders find p-values useful for understanding findings.

A likelihood ratio test can be used to test if the sample size is large. According to Fitzmaurice, Laird, and Ware, twice the difference between the maximized log-likelihoods of two nested models, , represents the degree to which the reduced model is inadequate. When the sample size is large, with degrees of freedom equal to the difference in parameters between the full and reduced models, . The p-value is .

What if the sample size is not large? A p-value based on will be too liberal (i.e., the type I error rate will exceed the nominal p-value). More conservatively, we might say that with numerator and denominator degrees of freedom. According to Snijders and Bosker, the effective sample size lies somewhere between total micro-observations (i.e., at level one) and clusters randomly sampled in earlier stages (i.e., at higher levels). Formally, the effective sample size is , where observations are nested within each cluster and intraclass correlation is . Even if is large, (and statistical power) could be quite small if is small and is large: . Unbalanced designs, modeling three or more levels, and cross-level interactions add to our uncertainty about the denominator degrees of freedom.

The function I wrote chews up the lmer() model call and concatenates the frame and model matrix slots, after which it iteratively fits (via maximum likelihood instead of restricted ML) models reduced by each fixed effect and compares them to the full model, yielding a vector of p-values based on . As the example shows, the function can handle shortcut formulas whereby lower order terms are implied by an interaction term. The function doesn't currently handle weights, glmer() objects, or on-the-fly transformations of the dependent variable [e.g., log(dep.var) ~ ...]. The accuracy of resulting p-values depends on large sample properties, as discussed above, so I don't recommend using the function with small samples. I'm working on another function that will calculate p-values based on the effective sample size estimated from intraclass correlation. I will post that function in a future entry. I'm sure the following function could be improved, but I wanted to go ahead share it with other applied researchers whose audience likes p-values. Please let me know if you see ways to make it better.

p.values.lmer <- function(x) {
  summary.model <- summary(x)
  data.lmer <- data.frame(model.matrix(x))
  names(data.lmer) <- names(fixef(x))
  names(data.lmer) <- gsub(pattern=":", x=names(data.lmer), replacement=".", fixed=T)
  names(data.lmer) <- ifelse(names(data.lmer)=="(Intercept)", "Intercept", names(data.lmer))
  string.call <- strsplit(x=as.character(x@call), split=" + (", fixed=T)
  var.dep <- unlist(strsplit(x=unlist(string.call)[2], " ~ ", fixed=T))[1]
  vars.fixef <- names(data.lmer)
  formula.ranef <- paste("+ (", string.call[[2]][-1], sep="")
  formula.ranef <- paste(formula.ranef, collapse=" ")
  formula.full <- as.formula(paste(var.dep, "~ -1 +", paste(vars.fixef, collapse=" + "), 
                  formula.ranef))
  data.ranef <- data.frame(x@frame[, 
                which(names(x@frame) %in% names(ranef(x)))])
  names(data.ranef) <- names(ranef(x))
  data.lmer <- data.frame(x@frame[, 1], data.lmer, data.ranef)
  names(data.lmer)[1] <- var.dep
  out.full <- lmer(formula.full, data=data.lmer, REML=F)
  p.value.LRT <- vector(length=length(vars.fixef))
  for(i in 1:length(vars.fixef)) {
    formula.reduced <- as.formula(paste(var.dep, "~ -1 +", paste(vars.fixef[-i], 
                       collapse=" + "), formula.ranef))
    out.reduced <- lmer(formula.reduced, data=data.lmer, REML=F)
    print(paste("Reduced by:", vars.fixef[i]))
    print(out.LRT <- data.frame(anova(out.full, out.reduced)))
    p.value.LRT[i] <- round(out.LRT[2, 7], 3)
  }
  summary.model@coefs <- cbind(summary.model@coefs, p.value.LRT)
  summary.model@methTitle <- c("\n", summary.model@methTitle, 
                           "\n(p-values from comparing nested models fit by maximum likelihood)")
  print(summary.model)
}

library(lme4)
library(SASmixed)
lmer.out <- lmer(strength ~ Program * Time + (Time|Subj), data=Weights)
p.values.lmer(lmer.out)

Yields:

Linear mixed model fit by REML 
(p-values from comparing nested models fit by maximum likelihood)
Formula: strength ~ Program * Time + (Time | Subj) 
   Data: Weights 
  AIC  BIC logLik deviance REMLdev
 1343 1383 -661.7     1313    1323
Random effects:
 Groups   Name        Variance Std.Dev. Corr   
 Subj     (Intercept) 9.038486 3.00641         
          Time        0.031086 0.17631  -0.118 
 Residual             0.632957 0.79559         
Number of obs: 399, groups: Subj, 57

Fixed effects:
                Estimate Std. Error   t value p.value.LRT
(Intercept)     79.99018    0.68578 116.64000       0.000
ProgramRI        0.07009    1.02867   0.07000       0.944
ProgramWI        1.11526    0.95822   1.16000       0.235
Time            -0.02411    0.04286  -0.56000       0.564
ProgramRI:Time   0.12902    0.06429   2.01000       0.043
ProgramWI:Time   0.18397    0.05989   3.07000       0.002

Correlation of Fixed Effects:
            (Intr) PrgrRI PrgrWI Time   PrRI:T
ProgramRI   -0.667                            
ProgramWI   -0.716  0.477                     
Time        -0.174  0.116  0.125              
ProgrmRI:Tm  0.116 -0.174 -0.083 -0.667       
ProgrmWI:Tm  0.125 -0.083 -0.174 -0.716  0.477

Structural equation modeling for theory-driven evaluation

| No Comments

Just to show that I'm not the biggest slacker-blogger on the Web, I want to direct you to my guest-post on AEA365: A tip-a-day by and for evaluators. I chose my "tip" because I think structural equation modeling (SEM) and logic modeling would complement each other very well, but very few researchers have combined the two approaches. Those of us who use SEM know how important it is to have strong prior theory for model fit and valid conclusions. We could learn a lot from evaluators who are skilled at developing logic models. Conversely, theory-driven evaluators could improve their practice by carefully attending to statistical power, construct validity, attenuation due to measurement error, and the decomposition of total effects. I am very interested in hearing others' opinions on this issue, so please leave a comment here or at AEA365.

A logic model (left) operationalized as a partial mediation growth model (right)
AEA_Tip-a-Day_Logic_Model.png AEA_Tip-a-Day_Path_Diagram.png

I think I have settled on a dissertation topic: spatiotemporal piecewise regression evaluation. Spatiotemporal piecewise regression (SPR) refers to the analysis of longitudinal data from a spatial regression discontinuity (SRD) design with multiple pre-test observations. SRD offers a way to quasi-experimentally estimate local average treatment effects (LATEs) of geographically implemented programs or policies; SPR is a way to estimate change in LATEs over time. My dissertation will describe SPR methodology in detail, including validity threats, and demonstrate an SPR evaluation of an educational program. As discussed by Shadish, Cook, and Campbell, multiple pre-test observations will allow me to address validity questions.

I decided to conduct a preliminary SPR analysis of data from the educational program to determine if the topic would be feasible and to include some results in a conference paper proposal. As discussed in earlier posts (here, here, and here), Rlogo.jpg can be used to plot regression discontinuity fitted lines, showing the LATE at the treatment assignment cutoff point. The ggplot2 package can be used to plot SPR fitted lines and change in LATEs over time, but it's not easy. One reason is that ggplot2 has a steep learning curve and limits control over the legend's appearance, although its curve is not as steep as lattice's. Another reason SPR plots are difficult to produce is that they represent several dimensions: north/south/east/west (reduced to one-dimensional distance), time, and the outcome. The SPR plots below show that participation in the educational program is associated with an initially positive, small LATE that diminishes over time.

Spatiotemporal piecewise regression fitted line plots
spatiotemporal_piecewise_regression_LATEs.png spatiotemporal_piecewise_regression.png

Compressing KML files with R

| No Comments

Compressing Google Earth files with Rlogo.jpg is not well documented on the Web, so I thought I should share my approach. I am working on my presentation for the upcoming Minnesota Assessment Conference. I will be presenting geographic maps of 2010 Minnesota assessment results. To minimize clutter in the maps and enhance the experience of stakeholders, I am creating two versions of each map: a static PDF map with no school district labels and a Web-based, interactive map. Each PDF will link to a keyhole markup language (KML) file through Google Maps, much like this proficiency map described here.

The KML file sizes exceed Google Maps' interface limit, so I had to find a way to compress them into KMZ files. I could compress each KML file using Google Earth or 7-Zip, but since I am already creating the maps with Rlogo.jpg, I decided to find a way to compress the KML files via code. Gunzipping is easy in Rlogo.jpg (e.g., see gzfile), but Google Maps will not accept gunzipped KML files. Thanks to Duncan Temple Lang, the zip() function in library(Rcompression) will create KMZ files with standard ZIP compression. Here's a reproducible example:

library(maptools)
data(wrld_simpl)
sw <- slot(wrld_simpl[wrld_simpl$NAME=="South Africa",], "polygons")[[1]]
tf <- tempfile()
kmlPolygon(sw, kmlfile=paste(tf, ".kml", sep=""), name="South Africa", col="#df0000aa", lwd=5, border=4, kmlname="KMZ test", kmldescription="<i>KMZ</i> file created with <a href='http://www.r-project.org'>R</a>.")
install.packages("Rcompression", repos="http://www.omegahat.org/R")
library(Rcompression)
zip(zipfile=paste(tf, ".kmz", sep=""), files=paste(tf, ".kml", sep=""))
paste(tf, ".kmz", sep="") #path and filename

The Early Childhood Environment Rating Scale (Revised; ECERS-R) was designed to measure process quality in child care centers, but the data set that I have been analyzing contains ratings of center classrooms and family child care (FCC) homes. A sizable portion of the program's budget was spent on training ECERS-R raters. Training them to reliably use both the ECERS-R and the Family Day Care Rating Scale (FDCRS) was cost prohibitive.

Was it unfair to rate FCC homes with the ECERS-R instrument? I conducted a differential item functioning (DIF) analysis to answer that question and remedy any apparent biases. Borrowing a quasi-experimental matching technique (not for the purpose of inferring causality), I employed propensity score weighting to match center classrooms with FCC homes of similar quality. For each item j, classroom/home i's propensity score (i.e., predicted probability of being an FCC home) was estimated using the following logistic regression model:

,

where FCC is a focal group dummy variable (1 if home; 0 if center) and Totalc stands for corrected total score (i.e., mean of item scores excluding item j). Classroom weights were calculated from the inverse of a site's predicted probability of its actual type and normalized so the weights sum to the original sample size:

.

Such weights can induce overlapping distributions (i.e., balanced groups). In this case, the weights were were applied via weighted least squares (WLS) regression of item scores on corrected total scores and the focal group dummy, as well as an interaction term to consider the possibility of nonuniform/crossing DIF:

.

By minimizing , WLS gave classrooms/homes with weights greater than 1 more influence over parameter estimates and decreased the influence of those with weights less than 1.

The results suggest that three provisions for learning items function differentially and in a non-uniform manner, as indicated by statistically significant interactions (see the table below). None of the language/interaction items exhibited DIF. The regression coefficients suggest that differentially functioning ECERS items did not consistently favor one type of care over the other. After dropping the three differentially functioning items from the provisions for learning scale and re-running the DIF analysis, none of the remaining items exhibited differential functioning.

Summary of WLS estimates: Differentially functioning ECERS items

Estimate Std. Error t value Pr(> |t|)
Item 8: Gross motor equipment
Intercept 3.014 2.044 1.475 0.149
Corrected total 0.523 0.445 1.175 0.248
FCC home -5.234 2.278 -2.297 0.028
Corrected*FCC 1.161 0.498 2.332 0.026
Item 22: Blocks
Intercept -3.990 1.919 -2.079 0.045
Corrected total 1.608 0.403 3.994 0.000
FCC home 3.929 2.076 1.892 0.067
Corrected*FCC -0.893 0.437 -2.045 0.049
Item 25: Nature/science
Intercept -4.802 2.566 -1.872 0.070
Corrected total 1.833 0.540 3.393 0.002
FCC home 5.996 2.771 2.164 0.038
Corrected*FCC -1.443 0.584 -2.471 0.019

Plot of weighted observations and fitted lines: Three crossing DIF items and one fair item (Item 19: Fine motor activities)
ECERS_DIF.png

Scores from the final provisions for learning scale exhibited good reliability of α = 0.87. Scores from the language/interaction scale also exhibited good reliability of α = 0.86. None of the items detracted problematically from overall reliability (i.e., the most that α increased after dropping any item was 0.01 in one instance).

I want to caution readers against generalizing the factor structure and DIF findings beyond this study because the sample is small in size (n = 38 classrooms/homes) and was drawn from a specific population (urban child care businesses subject to local market conditions and state-specific regulations). I largely avoided making statistical inferences in the preceding analyses by exploring the data and comparing relative fit, but in this case I used a significance level of 0.05 to infer DIF. The data do not provide a large amount of statistical power, so there could be more items that truly function differentially. For the final paper, I may apply a standardized effect size criterion. Standardized effect sizes are used to identify DIF with very large samples because the p-values would lead one to infer DIF for almost every item. De-emphasizing p-values in favor of effect sizes may be appropriate with small samples, too.

I followed up the exploratory factor analysis (EFA) of the Early Childhood Environment Rating Scale (ECERS-R) with a confirmatory model comparison of the six-, two-, and one-factor solutions. The six-factor specification adhered to the original subscales in the ECERS-R instrument (after dropping highly skewed items); the two-factor specification simply restricted the minor loadings from the EFA to zero; and the one-factor specification restricted items to zero if they loaded less than |0.3| in the one-factor EFA solution.

The six-factor model did not converge, and the two-factor model exhibited better fit than the one-factor model. These results support earlier findings by Sakai and colleagues (2003) and Cassidy and colleagues (2005) that the ECERS-R measures two latent factors: provisions for learning and language/interaction experienced by children.

The goal of this analysis was to find a factorially simple solution. Higher order or hierarchical models of greater complexity would be worth considering in light of evolving theories and needs surrounding measures of child care quality and given that:

  • several ECERS-R items cross-loaded in the EFAs and were subsequently excluded
  • the overall fit of the two-factor model was poor
  • the estimated correlation between the latent factors was somewhat large (0.68).

Comparison of one- and two-factor models

CFA fit measureOne factorTwo factors
Model χ2750, df = 377, p < 0.01306, df = 208, p < 0.01
χ2 (null model)1139, df = 406653, df = 231
Goodness-of-fit index0.5090.645
Adjusted goodness-of-fit index0.4330.569
RMSEA index0.1640.113
Bentler-Bonnett NFI0.3410.532
Tucker-Lewis NNFI0.4520.742
Bentler CFI0.4910.768
SRMR0.1150.107
BIC-621-451

Measurement model (click image)
ECERS_Path_Diagram.png

The Early Childhood Environment Rating Scale, revised edition (ECERS-R; Harms, Clifford, & Cryer, 1998), is an instrument used widely to observe and rate levels of process quality in child care centers. The authors have divided the instrument into six subscales to help ensure broad and flexible coverage of various aspects of child care quality. (The ECERS-R actually contains seven subscales, but the last one pertains to parents and staff instead of process quality experienced by children.) Psychometric analyses suggest the ECERS-R actually measures two latent dimensions of quality at most. Perlman, Zellman, and Le (2004) concluded that process quality, as measured by the ECERS-R, is unidimensional. Sakai and colleagues (2003) and Cassidy and colleagues (2005) concluded that the ECERS-R measures two latent dimensions, although their factor loadings and interpretations differ across the two studies.

I am writing a paper to present at the upcoming American Educational Research Association (AERA) conference in Denver. The paper will report results from a structural equation mediation model of the influence of on-site child care professional development on school readiness through child care quality. Given the lack of agreement among the earlier psychometric analyses of the ECERS-R, I conducted an exploratory factor analysis to see if I could replicate findings from one of the earlier studies. As shown in the table below, my preliminary results and interpretations do not align perfectly with either of the two-factor solutions from the earlier studies, but the similarities helped me decide which items to drop and which of my interpretations to keep. I plan to use a confirmatory factor analysis to formally compare model fit between the one- and two-factor solution before estimating the mediation model. I hope the summary below will help others who are wrestling with the possibility of multiple dimensions of child care quality as measured by the ECERS-R.

Summary of ECERS-R two-factor solutions from three studies

Item numberItemSubscaleSakai and colleagues (2003)Cassidy and colleagues (2005)Moore (2010)Decision
1Indoor spaceSpace and furnishingsProvisions for learning  Drop
2Furniture for care, play, and learningSpace and furnishingsTeaching and interactions Provisions for learningDiscrepancy
3Furnishings for relaxationSpace and furnishingsTeaching and interactionsMaterials/activities Discrepancy
4Room arrangementSpace and furnishingsProvisions for learning  Drop
5Space for privacySpace and furnishings Materials/activitiesProvisions for learningKeep
6Space for gross motorSpace and furnishings  Provisions for learningDiscrepancy
7Child-related displaySpace and furnishingsTeaching and interactions  Drop
8Gross motor equipmentSpace and furnishingsProvisions for learning Provisions for learningKeep
9Greeting/departingPersonal care routinesTeaching and interactions  Drop
10Meals/ snacksPersonal care routinesProvisions for learning  Drop
11Nap/restPersonal care routines  Provisions for learningDiscrepancy
12Toileting/diaperingPersonal care routinesProvisions for learning  Drop
13Health practicesPersonal care routinesProvisions for learning  Drop
14Safety practicesPersonal care routinesProvisions for learning  Drop
15Books and picturesLanguage-reasoningProvisions for learningMaterials/activitiesProvisions for learningKeep
16Encouraging children to communicateLanguage-reasoning  Provisions for learningDiscrepancy
17Using language to develop reasoning skillsLanguage-reasoningTeaching and interactionsLanguage/interactionLanguage/interactionKeep
18Informal use of languageLanguage-reasoningTeaching and interactionsLanguage/interactionLanguage/interactionKeep
19Fine motorActivitiesTeaching and interactionsMaterials/activitiesProvisions for learningKeep
20ArtActivities Materials/activities Drop
21Music/movementActivitiesTeaching and interactions Language/interactionKeep
22BlocksActivitiesProvisions for learningMaterials/activitiesProvisions for learningKeep
23Sand/waterActivitiesProvisions for learning  Drop
24Dramatic playActivitiesProvisions for learningMaterials/activitiesProvisions for learningKeep
25Nature/scienceActivities Materials/activitiesProvisions for learningKeep
26Math/numbersActivitiesTeaching and interactionsMaterials/activitiesProvisions for learningKeep
27Use of TV, video, and/or computersActivities  Language/interactionDiscrepancy
28Promoting acceptance of diversityActivitiesProvisions for learning Provisions for learningKeep
29Supervision of gross motor activitiesInteraction  Language/interactionDiscrepancy
30General supervision of childrenInteractionProvisions for learningLanguage/interaction Discrepancy
31DisciplineInteraction Language/interactionLanguage/interactionKeep
32Staff-child interactionsInteractionProvisions for learningLanguage/interactionLanguage/interactionKeep
33Interactions among childrenInteractionTeaching and interactionsLanguage/interactionLanguage/interactionKeep
34ScheduleProgram structureProvisions for learning  Drop
35Free playProgram structureTeaching and interactions  Drop
36Group timeProgram structureProvisions for learningLanguage/interactionLanguage/interactionKeep
Note: Empty cells indicate loadings less than |0.3|, cross-loadings greater than |0.3|, or items skewed greater than |2|. Item 37, which asks about provisions for children with disabilities was excluded due to high missingness.

Multilevel Rasch: Estimation with R

| 2 Comments

My last post described how test development designs could be thought of as multistage sampling designs and how multilevel Rasch could be used to estimate parameters under that assumption. I decided to use Rlogo.jpg to compare estimates from the usual Rasch model, the generalized linear model (GLM) with a logit link, and generalized linear mixed-effects regression (GLMER), treating items as fixed then random effects. The GLMER model with item fixed effects exhibited the best fit of the Law School Admission Test (LSAT) data provided with the ltm package. Moreover, intraclass correlation has reduced the effective sample size relative to simple random sampling, lending further support to the multilevel Rasch approach.

#Load libraries.
library(ltm)
library(lme4)
library(xtable)

#Prepare data for GLM.
LSAT.long <- reshape(LSAT, times=1:5, timevar="Item", varying=list(1:5), direction="long")
names(LSAT.long) <- c("Item", "Score", "ID")
LSAT.long$Item <- as.factor(LSAT.long$Item)
LSAT.long$ID <- as.factor(LSAT.long$ID)

#Compute Rasch, GLM, and GLMER estimates and compare fit.
out.rasch <- rasch(LSAT, constraint=cbind(ncol(LSAT)+1, 1))
print(xtable(summary(out.rasch)$coefficients, digits=3), type="html")

value std.err z.vals
Dffclt.Item1 -2.872 0.129 -22.307
Dffclt.Item2 -1.063 0.082 -12.946
Dffclt.Item3 -0.258 0.077 -3.363
Dffclt.Item4 -1.388 0.086 -16.048
Dffclt.Item5 -2.219 0.105 -21.166
Dscrmn 1.000


out.glm <- glm(Score ~ Item, LSAT.long, family="binomial")
print(xtable(summary(out.glm)$coefficients, digits=3), type="html")

Estimate Std. Error z value Pr(> |z|)
(Intercept) 2.498 0.119 20.933 0.000
Item2 -1.607 0.138 -11.635 0.000
Item3 -2.285 0.135 -16.899 0.000
Item4 -1.329 0.141 -9.450 0.000
Item5 -0.597 0.152 -3.930 0.000


out.glmer <- glmer(Score ~ Item + (1|ID), LSAT.long, family="binomial")
print(xtable(summary(out.glmer)@coefs, digits=3, caption="Fixed effects"), type="html", caption.placement="top")
print(xtable(summary(out.glmer)@REmat, digits=3, caption="Random effects"), type="html", caption.placement="top", include.rownames=F)

Fixed effects
Estimate Std. Error z value Pr(> |z|)
(Intercept) 2.705 0.129 21.029 0.000
Item2 -1.711 0.145 -11.774 0.000
Item3 -2.467 0.142 -17.353 0.000
Item4 -1.406 0.148 -9.498 0.000
Item5 -0.623 0.160 -3.880 0.000

Random effects
Groups Name Variance Std.Dev.
ID (Intercept) 0.502 0.70852


out.glmer.re <- glmer(Score ~ (1|Item) + (1|ID), LSAT.long, family="binomial")
print(xtable(summary(out.glmer.re)@coefs, digits=3, caption="Fixed effects"), type="html", caption.placement="top")
print(xtable(summary(out.glmer.re)@REmat[,-6], digits=3, caption="Random effects"), type="html", caption.placement="top", include.rownames=F)

Fixed effects
Estimate Std. Error z value Pr(> |z|)
(Intercept) 1.448 0.379 3.818 0.000


Random effects
Groups Name Variance Std.Dev.
ID (Intercept) 0.45193 0.67226
Item (Intercept) 0.70968 0.84243


print(xtable(AICs.Rasch.GLM <- data.frame(Rasch=summary(out.rasch)$AIC, GLM=summary(out.glm)$aic, GLMER.fe=summary(out.glmer)@AICtab$AIC, GLMER.re=summary(out.glmer.re)@AICtab$AIC, row.names="AIC"), caption="Akaike's information criteria (AICs)"), type="html", caption.placement="top")

Akaike's information criteria (AICs)
Rasch GLM GLMER.fe GLMER.re
AIC 4956.11 4996.87 4950.80 4977.25


#Easiness estimates.
easiness <- data.frame(Rasch=out.rasch$coefficients[,1],
GLM=c(out.glm$coefficients[1], out.glm$coefficients[1]+out.glm$coefficients[-1]),
GLMER.fe=c(fixef(out.glmer)[1], fixef(out.glmer)[1]+fixef(out.glmer)[-1]),
GLMER.re=fixef(out.glmer.re)+unlist(ranef(out.glmer.re)$Item))
print(xtable(easiness, digits=3), type="html")

Rasch GLM GLMER.fe GLMER.re
Item 1 2.872 2.498 2.705 2.527
Item 2 1.063 0.891 0.994 0.917
Item 3 0.258 0.213 0.237 0.224
Item 4 1.388 1.169 1.299 1.201
Item 5 2.219 1.901 2.082 1.938


#Estimated probabilities of a correct response.
pr.correct <- sapply(easiness, plogis)
row.names(pr.correct) <- row.names(easiness)
print(xtable(pr.correct, digits=3), type="html")

Rasch GLM GLMER.fe GLMER.re
Item 1 0.946 0.924 0.937 0.926
Item 2 0.743 0.709 0.730 0.714
Item 3 0.564 0.553 0.559 0.556
Item 4 0.800 0.763 0.786 0.769
Item 5 0.902 0.870 0.889 0.874


#Difficulty estimates.
difficulties <- easiness*-1
print(xtable(difficulties, digits=3), type="html")

Rasch GLM GLMER.fe GLMER.re
Item 1 -2.872 -2.498 -2.705 -2.527
Item 2 -1.063 -0.891 -0.994 -0.917
Item 3 -0.258 -0.213 -0.237 -0.224
Item 4 -1.388 -1.169 -1.299 -1.201
Item 5 -2.219 -1.901 -2.082 -1.938


#Calculate design effects and effective samples sizes from intraclass correlation coefficients and sampling unit sizes.
multistage.consequences <- function(ICC, N) {
M <- nrow(LSAT.long)
n <- M/N #number of responses per item
deff <- 1+(n-1)*ICC
M.effective <- trunc(M/deff)
return(data.frame(ICC, M, N, n, deff, M.effective))
}
model.ICC.Item <- glmer(Score ~ 1 + (1|Item), family=binomial, data=LSAT.long)
ICC.Item <- as.numeric(VarCorr(model.ICC.Item)$Item)/(as.numeric(VarCorr(model.ICC.Item)$Item)+pi^2/3)
multistage.Item <- multistage.consequences(ICC.Item, 5)
model.ICC.ID <- glmer(Score ~ 1 + (1|ID), family=binomial, data=LSAT.long)
ICC.ID <- as.numeric(VarCorr(model.ICC.ID)$ID)/(as.numeric(VarCorr(model.ICC.ID)$ID)+pi^2/3)
multistage.ID <- multistage.consequences(ICC.ID, 1000)
multistage <- data.frame(cbind(t(multistage.Item), t(multistage.ID)))
names(multistage) <- c("Item", "ID")
print(xtable(multistage, digits=3), type="html")

Item ID
ICC 0.159 0.070
M 5000.000 5000.000
N 5.000 1000.000
n 1000.000 5.000
deff 160.295 1.281
M.effective 31.000 3903.000


#Plot test characteristic curves
plot.tcc <- function(difficulties, add, col, lty) {
thetas <- seq(-3.8,3.8,length=100)
temp <- matrix(nrow=length(difficulties), ncol=length(thetas))
for(i in 1:length(thetas)) {for(theta in thetas) temp[,which(thetas==theta)] <- plogis(theta-difficulties)}
if(missing(add)) {plot(thetas, colSums(temp), ylim=c(0,5), col=NULL, ylab=expression(tau), xlab=expression(theta), main="Test characteristic curves")}
lines(thetas, colSums(temp), col=col, lty=lty)
}
plot.tcc(difficulties=difficulties[,1], col="black", lty=1)
plot.tcc(difficulties=difficulties[,2], add=T, col="blue", lty=2)
plot.tcc(difficulties=difficulties[,3], add=T, col="red", lty=3)
plot.tcc(difficulties=difficulties[,4], add=T, col="green", lty=4)
legend("bottomright", c("Rasch", "GLM", "GLMER, fixed items", "GLMER, random items "), col=c("black", "blue", "red", "green"), lty=1:4)

Multilvel_Rasch_Test_Characteristic_Curves.png

Multilevel Rasch

| 2 Comments

After years of studying and applying quantitative social science research methods, it has become easier to catch glimpses of larger connections between seemingly independent methodological approaches. I have learned, for example, that survey development has a lot in common with test development. I have also learned that longitudinal data analysis is like spatial/geographic data analysis in that both attend to reference-based dependencies (i.e., one dimensional time and two-or-more dimensional space).

Many methods seem independent because they are taught in isolation as topics courses (e.g., survey methods) or only within disciplines where they are traditionally applied (e.g., social network analysis within sociology). I credit my professors with teaching me to see the larger connections. I also credit the University of Minnesota with offering interdisciplinary courses, such as Latent Variable Measurement Models and Path Analysis taught by Melanie Wall in Biostatistics--a course traditionally taught by psychologists.

Item response theory (IRT) scaling and multilevel modeling are two other quantitative methods that have more in common than they seem at first glance. The Rasch model is usually expressed as

,

where represents a test taker's latent ability, and represents the difficulty of an item. The Rasch model can also be expressed as a generalized linear model (GLM) with a logit link:

,

where is a dummy variable (1 if a correct response, 0 otherwise) and is the easiness of an item.

Test development designs can be thought of as multistage sampling designs in which test takers represent primary sampling units and item responses represent secondary sampling units nested within test takers. Items can also be thought of as primary sampling units in which responses are nested. If item responses exhibit sizable intraclass correlation, then mixed-effects modeling may be appropriate to account for loss of statistical power relative to simple random sampling. Mixed-effects models may also help with vertical scaling and identifying differential item functioning (DIF) by considering fixed effects for test taker age and group membership. Some authors treat items as fixed effects; while others treat them as random effects. In the latter case, the Rasch model can be expressed as a two-level mixed model with crossed random effects:

.

Regression discontinuity gallery: R code for nonparametric estimation

| No Comments

Choosing a functional form for the relationship between the assignment and outcome variable from a regression discontinuity design is an important step in estimating the local average treatment effect. Plotting nonparametric smoothers can suggest a functional form, and nonparametric inference may be more appropriate when a curvilinear relationship defies classification. Some researchers exclusively prefer locally-weighted regression because they feel it objectively and conservatively excludes distal observations, although the choice of bandwidth can be subjective.

The following example shows a way to obtain and plot nonparametric estimates with Rlogo.jpg. The approach yields an effect size estimate of about 2.1 standard deviations, which is over twice as large as the true effect (1 σ) and much worse than the parametric estimates (1.1 SD). Increasing the default span from 0.75 to 1 (not shown) improved the estimate slightly (1.3 SD), as did reducing it to 0.5 (1.6 SD). It would be interesting to conduct a simulation study to compare parametric and nonparametric effect size recovery over many random samples while systematically varying regression functions, bandwidths, and cutoff locations.

#Simulate and plot the data as described in the first part of the previous entry.

#Use the full sample to fit a locally-weighted regression line.
lo <- loess(y ~ x, surface="direct")
x.lo <- min(x):max(x)
pred <- predict(lo, data.frame(x=x.lo))
lines(x.lo, pred, col="red")

#Use treatment observations to fit a locally-weighted regression line.
lo.tx <- loess(y ~ x, data.frame(y, x)[which(x<0),], surface="direct")
x.lo.tx <- 0:min(x)
pred.tx <- predict(lo.tx, data.frame(x=x.lo.tx))
lines(x.lo.tx, pred.tx, col="blue")

#Use control observations to fit a locally-weighted regression line.
lo.c <- loess(y ~ x, data.frame(y, x)[which(x>=0),], surface="direct")
x.lo.c <- 0:max(x)
pred.c <- predict(lo.c, data.frame(x=x.lo.c))
lines(x.lo.c, pred.c, col="blue")

#Add legend.
legend("bottomright", bg="white", cex=.75, pt.cex=.75, c("Treatment observation", "Control observation", "Cutoff", "Population regression line ", "Piecewise local smoother", "Full local smoother"), lty=c(NA, NA, 2, 1, 1, 1), lwd=c(NA, NA, 2, 2, 1, 1), col=c("darkgrey", "darkgrey", "darkgrey", "black", "blue", "red"), pch=c(4, 1, NA, NA, NA, NA))

#Nonparametric estimate [95% confidence interval] of local average treatement effect.
yhat.tx <- predict(lo.tx, 0, se=T)
yhat.c <- predict(lo.c, 0, se=T)
yhat.tx$fit - yhat.c$fit
c((yhat.tx$fit - qt(0.975, yhat.tx$df)*yhat.tx$se.fit) - (yhat.c$fit + qt(0.975, yhat.c$df)*yhat.c$se.fit), (yhat.tx$fit + qt(0.975, yhat.tx$df)*yhat.tx$se.fit) - (yhat.c$fit - qt(0.975, yhat.c$df)*yhat.c$se.fit))

Yields 21.28786 [7.453552, 35.122162].

Local_Effect_Continuous_Quadratic_Relationship_Nonparametric.png

Regression discontinuity gallery: R code

| No Comments

A reader who is studying political science at the University of Chicago asked me how I plotted the regression discontinuity simulations in my previous blog entry. Here's a reproducible example for those of you conducting regression discontinuity analyses with Rlogo.jpg:

#Randomly sample from X~N(0, 100) and conveniently use mean cutoff to maximize power and bypass centering.
set.seed(100)
x <- rnorm(200, 0, 10)
z <- ifelse(x<0, 1, 0)
tx <- which(z==1)

#Randomly sample from T~N(50, 100) and specify the regression function.
set.seed(101)
y <- rnorm(200, 50, 10) + 10*z + 0.5*x - 0.025*x^2 #1 σ effect size

#Plot the observations and regression lines.
plot(y ~ x, col=NULL, xlab="Assignment", ylab="Outcome", ylim=c(min(y)-10, max(y)+10))
abline(v=0, lty=2, lwd=2, col="darkgrey")
points(x[tx], y[tx], col="darkgrey", pch=4)
points(x[-tx], y[-tx], col="darkgrey")
curve(50 + 10 + 0.5*x - 0.025*x^2, min(x)-10, 0, add=T, lwd=2)
curve(50 + 0.5*x - 0.025*x^2, 0, max(x)+10, add=T, lwd=2)
title(main="Large effect at cutoff; continuous quadratic relationship\n")
mtext(expression(italic(Y[i]) == 50 + 10*italic(Z[i]) + 0.5*italic(X[i]) - 0.025*italic(X[i])^2 + italic(epsilon[i])), line=.5)

I prefer to plot lines with curve() instead of predict() when the regression equation is simple. The former only requires coefficients and produces nice smooth curves; the latter requires a data frame and may result in jagged lines but is well-suited for lengthy equations with many interactions. Here's a way to plot fitted lines with parameter estimates from lm():

#Fit a regression discontinuity model.
model <- lm(y ~ z + x + I(x^2))
print(xtable(model), type="html")

Estimate Std. Error t value Pr(> |t|)
(Intercept) 49.5272 1.4028 35.31 0.0000
z 11.1037 2.1343 5.20 0.0000
x 0.5231 0.1173 4.46 0.0000
I(x^2) -0.0311 0.0055 -5.67 0.0000

#Add the fitted lines and legend.
coefs <- coefficients(model)
curve(coefs[1] + coefs[2] + coefs[3]*x + coefs[4]*x^2, min(x)-10, 0, add=T, col="blue")
curve(coefs[1] + coefs[3]*x + coefs[4]*x^2, 0, max(x)+10, add=T, col="blue")
legend("bottomright", bg="white", cex=.75, pt.cex=.75, c("Treatment observation ", "Control observation", "Cutoff", "Population regression line ", "Fitted line"), lty=c(NA, NA, 2, 1, 1), lwd=c(NA, NA, 2, 2, 1), col=c("darkgrey", "darkgrey", "darkgrey", "black", "blue"), pch=c(4, 1, NA, NA, NA))

Note the decreasing accuracy of predictions further away from the cutoff, where fewer observations lie.

Local_Effect_Continuous_Quadratic_Relationship_with_Fitted_Line.png

What happens when one misspecifies the functional form? In the following example, the linear misspecification poorly represents the true conditional mean, especially beyond the cutoff. However, the local effect size estimate (i.e., at the cutoff) compares favorably to the estimate from the correct specification. Both have overestimated the true local effect by about one-tenth of a standard deviation.

#Fit a misspecified regression discontinuity model.
model <- lm(y ~ z + x)
print(xtable(model), type="html")

Estimate Std. Error t value Pr(> |t|)
(Intercept) 46.9717 1.4296 32.86 0.0000
z 11.0570 2.2969 4.81 0.0000
x 0.4742 0.1259 3.77 0.0002

#Add the fitted lines to the first plot and update the legend.
coefs <- coefficients(model)
curve(coefs[1] + coefs[2] + coefs[3]*x, min(x)-10, 0, add=T, col="red")
curve(coefs[1] + coefs[3]*x, 0, max(x)+10, add=T, col="red")
legend("bottomright", bg="white", cex=.75, pt.cex=.75, c("Treatment observation ", "Control observation", "Cutoff", "Population regression line ", "Fitted line", "Misspecified fitted line"), lty=c(NA, NA, 2, 1, 1, 1), lwd=c(NA, NA, 2, 2, 1, 1), col=c("darkgrey", "darkgrey", "darkgrey", "black", "blue", "red"), pch=c(4, 1, NA, NA, NA, NA))

Local_Effect_Continuous_Quadratic_Relationship_with_Misspecified_Fitted_Line.png

Regression discontinuity gallery: Simulations in R

| No Comments

I recently presented a paper on spatial regression discontinuity at the annual conference of the American Evaluation Association in Orlando. I wanted to graphically illustrate regression discontinuity and its spatial analogue, so I simulated some examples in Rlogo.jpg. Plots such as these and William Trochim's can be a good way to convey some of the key concepts of regression discontinuity design and analysis, such as curvilinear relationships between the assignment and outcome variable, local treatment effects, and the questionable validity of extrapolating program inferences beyond the cutoff.

Local effect (left) and no effect (right) with continuous linear relationship between the assignment and outcome variable
Local_Effect_Continuous_Linear_Relationship.png No_Effect_Continuous_Linear_Relationship.png

Average effect (left) and no effect (right) with no relationship between the assignment and outcome variable
Average_Effect_No_Relationship.png No_Effect_No_Relationship.png

Local effects with curvilinear relationships between the assignment and outcome variable
Local_Effect_Continuous_Quadratic_Relationship.png Local_Effect_Continuous_Cubic_Relationship.png

Extrapolation: Local effect (left) and no local effect (right) with potentially larger effects beyond the cutoff due to discontinuous linear relationship between the assignment and outcome variable*
Local_Effect_Discontinuous_Linear_Relationship.png No_Effect_Discontinuous_Linear_Relationship.png
*Note: Extrapolations beyond the cutoff are rarely valid. Repetitious and abundant distal pretest observations may support extrapolating effect size estimates beyond the cutoff.

Spatial regression discontinuity: Local effect (left) and no effect (right) with continuous linear relationship between the assignment (distance from border) and outcome variable
Local_Effect_Continuous_Linear_SpatialRD.png No_Effect_Continuous_Linear_Relationship_SpatialRD.png

R workshop: Introduction to geographic mapping and spatial analysis

| 1 Comment

Back in March I led a workshop on geographic mapping and spatial analysis with R. I finally got around to running my notes and syntax through Sweave to create a single document. Click on the image below to access the workshop notes.

R_Workshop_Spatial_032609-006.png

Spatially enable evaluation: Bell Museum of Natural History

| No Comments
Thumbnail image for Touch and See Room - All Visitor Paths - 080809 - Cropped.png

Frances Lawrenz is one of my advisors. She knew about my interest in spatially enable evaluation and put me in touch with Bob Tornberg, a graduate student in Educational Policy and Administration who was leading an evaluation for the Bell Museum of Natural History. I had been looking for an opportunity to apply mapping and/or spatial analysis to a micro-level evaluation setting, such as a classroom. Bob wanted to give the primary intended users (PIUs) of the evaluation some information about paths traveled by visitors to the museum's Touch and See Room. Spatially enabled evaluation sounded like a mutually beneficial approach, so we decided to collaborate. I'm glad that Bob involved me before data collection began because I was able to suggest data recording procedures that later facilitated the mapping and analysis. I don't think it would be appropriate to share the statistical results here, but I think it's okay to share one of the maps. I think the maps turned out well, and I'm looking forward to hearing the PIUs' impressions.

Structural equation modeling of complex sample data in R: An update

| No Comments

My last post demonstrated a structural equation model (SEM) of complex sample data in Rlogo.jpg. I attempted to replicate Laura Stapleton's example, but my standard errors were larger than expected.

I wrote to Laura, and she graciously reviewed my results. She attributed the standard error discrepancies to a change in IRT score variances. Specifically, the ECLS-K math and reading scores have been re-scaled since the original data set was published, resulting in larger variances in the currently published version. She supplied me with her version of the original public-use data set.

The updated results with the original data set are nearly identical to her published results. I feel confident now about using ECLS-B jackknife replicate weights for SEM in Rlogo.jpg, and I learned a lot from attempting the replication.

Path diagram of U.S. kindergartners' cognitive achievement (indicated by reading, math, and general ability scores) regressed on number of residences in last four months Path_Diagram_Stapleton's_Data.png Barplot_of_Replication_Results_Stapleton's_Data.png

Structural equation modeling of complex sample data in R

| No Comments

I am using data from the Early Childhood Longitudinal Study (ECLS) Birth Cohort for my research assistantship with Judy Temple. I have an analysis in mind that will involve factor analysis and path analysis simultaneously (i.e., a structural equation model).

The ECLS-B data and other large microdata sets represent the population, offer good statistical power, and provide comprehensive measures, making them suitable for structural equation modeling. However, those advantages are often achieved through stratified cluster sampling, which nests participants within primary sampling units in order to ensure adequate representation of strata and hold down data collection costs. Moreover, individuals representing small groups in the population are oversampled, which requires analytically re-weighting those cases downward to reflect population proportions but not down-weighting sample sizes in the standard error calculations. Calculating standard errors under complex sampling conditions is not straightforward compared to simple random sampling.

Is it possible to fit structural equation models of complex sample data in Rlogo.jpg? Several statistical software programs, including the survey package, can perform standard analyses (e.g., means, generalized linear models) in a manner appropriate for complex sample data. However, hardly any programs offer the ability to fit structural equation models to such data. Using some guidance offered by John Fox, author of the sem package, and an excellent article by Laura Stapleton, I decided to give it a try with R.

Stapleton used data from the ECLS Kindergarten Cohort and two commercial statistical software packages to demonstrate a structural equation model that applies sampling weights and accounts for multistage sampling. Because the ECLS-K is publicly available and Rlogo.jpg is free, I was able to attempt her jackknife example. As hoped, the replication yielded parameter estimates that were comparable to Stapelton's, as well as standard errors that were larger than the naive standard errors. However, my jackknife standard errors were consistently larger than Stapleton's. I don't yet know why they were so large, but it will be good practice for me to find out. It will also be good practice to replicate her example of bootstrapping standard errors. I welcome any feedback about this approach.