# Recently in Praxes Category

## Measures of Effective Teaching weighting schemes: An improved plot

I came across the following plot while reading "Ensuring Fair and Reliable Measures of Effective Teaching".

Maybe I've spent too much time reading Andrew Gelman's blog because I noticed two issues with the plot that could be improved. The most serious issue is that the first two sets of bars display correlation coefficients, but the last set of bars displays squared correlation coefficients. It doesn't matter that the first two sets represent a type of validity and the last represents reliability. What matters is that the bars incorrectly imply a common scale for the entire figure and could mislead readers. Perhaps the authors made the somewhat common mistake of assuming that a reliability coefficient is a correlation coefficient because both are denoted r? It's probably just a graphic design mistake that the authors didn't catch. The second issue I noticed is far less serious (and is debatable). Since the bars do not represent counts, the coefficients should be displayed as dots.

I used to create an improved (but less colorful) version of their plot that displays only correlation coefficients with dots instead of bars. The new version shows more clearly that each variety of teacher effectiveness composite score "predicts" itself (from a test-retest perspective) better than any of them predict the criteria. In other words, the improved plot illustrates how score reliability is a necessary but insufficient condition that can act as a ceiling on validity.

## Applying generalizability theory with R: A package

I have contributed my generalizability functions to the QME: Quantitative Methods in Education package. I described my efforts in an earlier post. This project continues to be a great learning experience for me, especially in terms of seeing how classical test theory, generalizability theory, and multilevel modeling overlap. I will probably submit a proposal to demonstrate the package at either NCME or AERA in 2014. You can use the following commands to install the package:

library(devtools)
install_github("QME", "zief0002")


## Which schools are closing achievement gaps in Minnesota?

Minnesota has large and persistent achievement gaps. It's an empirically supported proposition no matter how you look at it. But if we are going to tackle the problem, it does matter how we look at it! Well-intentioned advocates and researchers sometimes talk about "the achievement gap" as if there is one definition. Are they referring to test score differences? Or graduation rates? Or gaps in educational opportunities? Are they referring to math? Or reading? Or science? Are they referring to students of color? Or students in poverty? Or English learners? Or special education students? Or all four groups? They don't often say. Referring to "the achievement gap" is good rhetoric, but I contend that a lack of specificity is impeding progress. I also contend that we can learn from schools that are closing achievement gaps.

Advocates and researchers would be wise to use the Minnesota Department of Education's (MDE) achievement gap reduction measure when possible. MDE quantifies gaps as the difference between the growth of historically under-achieving groups relative to more advantaged groups. The calculation yields a negative number if a school helped reduce achievement gaps in Minnesota and a positive number if a school added to existing gaps. (See page 78 of Minnesota's approved No Child Left Behind (NCLB) waiver for the methodology.) Here's why I think advocates and researchers should consistently publicize and track the achievement gap reduction measure:

• It's official. Minnesota's accountability system holds schools accountable for achievement gap reduction (i.e., it's one of the Multiple Measures). It's recognized by the U.S. Department of Education, it's reported annually by MDE, and it's increasingly familiar to teachers, administrators, and parents.
• It's fair. Student growth is a better measure of the contribution of a school or intervention than proficiency rates, which reflect the economic affluence and selectivity. When we over-emphasize proficiency it prevents us from finding schools in lower-income areas that are closing gaps. Growth measures only compare students and schools to those with similar achievement at the beginning of the school year.
• It's embodies a theory of action. As long as some groups are starting the school year behind, then the only way to close gaps is to accelerate their learning relative to their more advantaged peers. It's a race, and traditional achievement gap measures paint a picture of the starting line. The achievement gap reduction measure tells the pace of learning over a school year.

I used to create an interactive map to help advocates and researchers become familiar with the achievement gap reduction measure and encourage recognition of schools that are helping close achievement gaps in Minnesota. The data underlying the map are publicly available from MDE's Data Center. MDE does not publish achievement gap reduction for each student group--it only publishes overall achievement gap reduction measures (i.e., student-weighted averages across all historically lower-achieving student groups). I averaged the annual values over three years (2010-2012) to improve precision. You can zoom in and click on a school to learn more about it. Now that we know which schools are helping close achievement gaps in Minnesota, we should recognize their successes and learn from them. What are the successful schools doing? How can we encourage other schools to adopt their strategies?

Schools helping reduce achievement gaps in Minnesota: Top schools by type
Type School District Direction of gaps Achievement gap reduction
ALP Bemidji High ALP BEMIDJI PUBLIC SCHOOL DISTRICT Closing -0.13
CHARTER Twin Cities International Elem TWIN CITIES INTERNATIONAL ELEM SCH. Closing -0.33
CHARTER Rochester Math and Science Academy ROCHESTER MATH AND SCIENCE ACADEMY Closing -0.18
CHARTER Seed Academy and Harvest Prep school HARVEST PREP SCHOOL-SEED ACADEMY Closing -0.18
CHARTER Schoolcraft Learning Community SCHOOLCRAFT LEARNING COMMUNITY CHTR Closing -0.15
ELEM Atheneum Elem INVER GROVE HEIGHTS SCHOOLS Closing -0.71
ELEM Winsted Elem HOWARD LAKE-WAVERLY-WINSTED Closing -0.39
ELEM Kimberly Lane Elem WAYZATA PUBLIC SCHOOL DISTRICT Closing -0.36
ELEM Withrow Elem STILLWATER AREA PUBLIC SCHOOL DIST. Closing -0.32
ELEM Lake Marion Elem LAKEVILLE PUBLIC SCHOOL DISTRICT Closing -0.31
ELEM Goodview Elem WINONA AREA PUBLIC SCHOOL DISTRICT Closing -0.31
ELEM Forest Hills Elem EDEN PRAIRIE PUBLIC SCHOOL DISTRICT Closing -0.28
ELEM Washington Elem MANKATO PUBLIC SCHOOL DISTRICT Closing -0.27
ELEM Frazee-Vergas Elem FRAZEE-VERGAS PUBLIC SCHOOL DIST. Closing -0.26
HIGH Arlington High SIBLEY EAST SCHOOL DISTRICT Closing -0.31
HIGH St. Anthony Village High ST. ANTHONY-NEW BRIGHTON SCHOOLS Closing -0.21
HIGH Hawley High HAWLEY PUBLIC SCHOOL DISTRICT Closing -0.12
HIGH Mounds View High MOUNDS VIEW PUBLIC SCHOOL DISTRICT Closing -0.08
MIDD Minnetonka East MINNETONKA PUBLIC SCHOOL DISTRICT Closing -0.21
MIDD Valley View Middle EDINA PUBLIC SCHOOL DISTRICT Closing -0.18
MIDD Minnetonka West Middle MINNETONKA PUBLIC SCHOOL DISTRICT Closing -0.14
MIDD New Prague Middle NEW PRAGUE AREA SCHOOLS Closing -0.12
MIDD Wayzata Central Middle WAYZATA PUBLIC SCHOOL DISTRICT Closing -0.12
MIDD Sauk Rapids-Rice Middle SAUK RAPIDS-RICE PUBLIC SCHOOLS Closing -0.12
MIDD Mahtomedi Middle MAHTOMEDI PUBLIC SCHOOL DISTRICT Closing -0.10
MIDD Sartell Middle SARTELL-ST. STEPHEN SCHOOL DISTRICT Closing -0.10
OTHER Park Spanish Immersion El. ST. LOUIS PARK PUBLIC SCHOOL DIST. Closing -0.20
OTHER School of Environmental Studies ROSEMOUNT-APPLE VALLEY-EAGAN Closing -0.19
OTHER Friedell Middle ROCHESTER PUBLIC SCHOOL DISTRICT Closing -0.09

Remembering Hap Casmey inspired me to write this entry. I enjoyed getting to know Hap later in his life and hearing about his influence on education policy in Minnesota.

## Evaluation 2012 presentation

The annual conference of the American Evaluation Association came to Minneapolis last week. I enjoyed getting re-acquainted with many of my evaluation colleagues and learning about evaluation studies from around the world without leaving the city I call home. I presented a poster titled "Teacher evaluation and the Standards of Effective Instruction rating instrument: Psychometric considerations".

## Applying generalizability theory with R

| 1 Comment

One of the things I like about my current job is getting to apply generalizability theory. I have to confess that I didn't appreciate G theory at first, but now that I'm analyzing data from performance assessments (i.e., ratings of instructional effectiveness) I think it's great for a number of reasons. I like the concept of universe scores, treating items as a random facet, estimating multiple sources of error, the ability to focus on different objects of measurement (e.g., ratees and raters), and G theory's strong emphasis on finding ways to increase generalizability (i.e., decision studies). And as someone who has far more experience with item response theory (IRT) where conditional standard errors of measurement (CSEMs) are the norm, I'm especially intrigued by the prospect of using G theory to estimate CSEMs.

It's pretty obvious from this blog that I like statistical programming with because it helps me learn methods far better than simply reading about them or using a pre-existing software package. In that vein, I decided to write some functions to conduct G and D studies with . I've been using synthetic data from Brennan's book and Shavelson and Webb's book to ensure that my results match theirs. The nice thing about using for G theory is that you can choose functions for many different packages to fully apply G theory. For example, by using lme4 to estimate variance components, arm to extract standard errors of random intercepts, and ggplot2 to create faceted plots, I am able to create "vertical ruler" plots, which are prevalent among practitioners of many-facets Rasch. Such plots make it easy to identify the (statistically significantly) highest and lowest performers as well as the most lenient and stringent raters.

The plots below illustrate how I've been implementing G theory with . The plots were produced from Brennan's synthetic data set number 4, and the dependability coefficients match those from GENOVA. I've never developed an package, but given that no one else has so far, I'd be interested in doing so with the code I'm developing.

## Estimating congeneric reliability with R

| 1 Comment

I was recently tasked with estimating the reliability of scores composed of items on different scales. The lack of a common scale could be seen in widely varying item score means and variances. As discussed by Graham, it's not widely known in applied educational research that Cronbach's alpha assumes essential tau-equivalence and underestimates reliability when the assumption isn't met. A friend made a good suggestion to consider stratified alpha, which is commonly used to estimate internal consistency when scores are composed of multiple choice items (scored 0-1) and constructed response items (e.g., scored 0-4). However, the assessment with which I was concerned does not have clear item-type strata. I decided to estimate congeneric reliability because it makes few assumptions (e.g., unidimensionality) and doesn't require grouping items into essentially tau-equivalent strata.

With the help of Graham's paper and a LISREL tutorial by Raykov I wrote an program that estimates congeneric reliability. The program uses Fox's sem() library to conduct the confirmatory factor analysis with minimal constraints (variance of common true score fixed at 1 for identifiability). The estimated loadings and error variances are then summed to calculate reliability (i.e., the ratio of true score variance to observed score variance) as:

$\hat{\rho}_{XX'} = \frac { \left( \sum \hat{\lambda}_i \right) ^ 2} {\left( \sum \hat{\lambda}_i \right) ^ 2 + \sum \hat{\delta}_{i} }$.

I obtained a congeneric reliability estimate of 0.80 and an internal consistency estimate of 0.58 for the scores I was analyzing. If the items had been essentially tau-equivalent, then the reliability estimates would have been the same. If I had assumed tau-equivalence, than I would have underestimated the reliability of the total scores (and overestimated standard errors of measurement). The example below replicates Graham's heuristic example.

> ############################################
> #Replicate results from Graham (2006) to check congeneric reliability code/calculations.
>
> library(sem)
> library(psych)
> library(stringr)
>
> #Variance/covariance matrix
> S.graham <- readMoments(diag = T, names = paste("x", 1:7, sep = ""))
1:  4.98
2:  4.60  5.59
4:  4.45  4.42  6.30
7:  3.84  3.81  3.66  6.44
11:  5.71  5.67  5.52  4.91 11.86
16: 23.85 23.68 22.92 19.87 34.28 127.65
22: 46.53 46.20 44.67 38.57 62.30 244.36 471.95
29:
>
> ############################################
> #A function to estimate and compare congeneric reliability and internal consistency
> funk.congeneric <- function(cfa.out) {
+   names.loadings <- str_detect(names(cfa.out$coeff), "loading") + names.errors <- str_detect(names(cfa.out$coeff), "error")
+   r.congeneric <- sum(cfa.out$coeff[names.loadings]) ^ 2 / + (sum(cfa.out$coeff[names.loadings]) ^ 2 + sum(cfa.out$coeff[names.errors])) + round(c("Congeneric" = r.congeneric, "Alpha" = alpha(cfa.out$S)$total$raw_alpha), 2)
+ }
>
> ############################################
> #Congeneric model; tau-equivalent items
> model.graham <- specifyModel()
6: x1 <-> x1, error1
7: x2 <-> x2, error2
8: x3 <-> x3, error3
9: x4 <-> x4, error4
10: x5 <-> x5, error5
11: T <-> T, NA, 1
12:
> cfa.out <- sem(model = model.graham, S = S.graham, N = 60)
> summary(cfa.out)

Model Chisquare =  0.11781   Df =  5 Pr(>Chisq) = 0.99976
Chisquare (null model) =  232.13   Df =  10
Goodness-of-fit index =  0.9992
RMSEA index =  0   90% CI: (NA, NA)
Bentler-Bonnett NFI =  0.99949
Tucker-Lewis NNFI =  1.044
Bentler CFI =  1
SRMR =  0.0049092
AIC =  20.118
AICc =  4.6076
BIC =  41.061
CAIC =  -25.354

Normalized Residuals
Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
-0.038700 -0.008860 -0.000002  0.004700  0.002450  0.119000

R-square for Endogenous Variables
x1     x2     x3     x4     x5
0.9286 0.8175 0.6790 0.4962 0.5968

Parameter Estimates
Estimate Std Error z value Pr(>|z|)
error1   0.35559  0.17469   2.0356  4.1793e-02 x1 <--> x1
error2   1.02000  0.25339   4.0255  5.6861e-05 x2 <--> x2
error3   2.02222  0.41688   4.8509  1.2293e-06 x3 <--> x3
error4   3.24471  0.62679   5.1767  2.2583e-07 x4 <--> x4
error5   4.78227  0.94911   5.0387  4.6871e-07 x5 <--> x5

Iterations =  21
> pathDiagram(cfa.out, edge.labels = "values", ignore.double = F, rank.direction = "TB")


> funk.congeneric(cfa.out)
Congeneric      Alpha
0.91       0.91
>
> ############################################
> #Congeneric model; tau-inequivalent items
> model.graham <- specifyModel()
6: x1 <-> x1, error1
7: x2 <-> x2, error2
8: x3 <-> x3, error3
9: x4 <-> x4, error4
10: x7 <-> x7, error7
11: T <-> T, NA, 1
12:
> cfa.out <- sem(model = model.graham, S = S.graham, N = 60)
> summary(cfa.out)

Model Chisquare =  0.0072298   Df =  5 Pr(>Chisq) = 1
Chisquare (null model) =  353.42   Df =  10
Goodness-of-fit index =  0.99995
RMSEA index =  0   90% CI: (NA, NA)
Bentler-Bonnett NFI =  0.99998
Tucker-Lewis NNFI =  1.0291
Bentler CFI =  1
SRMR =  0.0010915
AIC =  20.007
AICc =  4.497
BIC =  40.951
CAIC =  -25.464

Normalized Residuals
Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
-2.76e-02 -1.27e-04 -1.00e-07 -1.92e-03  1.96e-04  3.70e-03

R-square for Endogenous Variables
x1     x2     x3     x4     x7
0.9303 0.8171 0.6778 0.4942 0.9902

Parameter Estimates
Estimate Std Error z value  Pr(>|z|)
error1    0.34688 0.090227   3.84451 1.2080e-04 x1 <--> x1
error2    1.02240 0.200635   5.09584 3.4720e-07 x2 <--> x2
error3    2.02973 0.382466   5.30694 1.1148e-07 x3 <--> x3
error4    3.25764 0.605376   5.38118 7.3998e-08 x4 <--> x4
error7    4.64661 6.387765   0.72742 4.6697e-01 x7 <--> x7

Iterations =  38
> pathDiagram(cfa.out, edge.labels = "values", ignore.double = F, rank.direction = "TB")


> funk.congeneric(cfa.out)
Congeneric      Alpha
0.99       0.56


## Converting Rasch units (RITs) to thetas

How do you place Measures of Academic Progress (MAP) Rash unit scores (RITs) on the normal theta metric (and vice versa)? I searched the Web but couldn't find a straightforward answer. What I've gathered from various sources is that RIT scores are just logits multiplied by 10 and centered on 200. From there, logits can be transformed to thetas based on the probabilities of the respective distributions, as shown in the code below.

RIT scores are new to me because I've worked mostly with the 3-parameter model until recently. I've noticed that a lot psychometricians who use item response theory (IRT) fall into one of two camps: Rasch or 3PL. My personal preference is somewhere in the middle. I think 1) it's unrealistic to assume items discriminate equally and 2) estimates of a are useful for test construction and adaptive selection. The c parameter, on the other hand, is costly to estimate and not entirely essential. So I guess you could say I'm a dispassionate 2PL guy.

> #Simulate some thetas (z-scores).
> thetas <- sort(rnorm(20))
> thetas
[1] -1.34485345 -1.08206871 -1.04673807 -0.98990444
[5] -0.96802821 -0.81766629 -0.56671883 -0.18795885
[9] -0.17779446 -0.08314090 -0.04615256  0.13611619
[13]  0.13796440  0.43172163  0.47472683  0.57965345
[17]  0.64315758  0.71571489  1.10205842  1.16067809
>
> #Convert thetas to probabilities then logits.
> p <- pnorm(thetas)
> logits <- log(p / (1 - p))
>
> #Convert logits to RITs.
> RITs <- logits * 10 + 200
> RITs
[1] 176.7823 181.8148 182.4653 183.5001 183.8947 186.5552
[7] 190.8243 196.9958 197.1587 198.6728 199.2634 202.1739
[13] 202.2035 206.9477 207.6533 209.3914 210.4565 211.6874
[19] 218.5559 219.6538
>
> #Place RITs on normal theta metric.
> funk.RITs.thetas <- function(x) {
+   qnorm(plogis((x - 200) / 10))
+ }
> funk.RITs.thetas(RITs)
[1] -1.34485345 -1.08206871 -1.04673807 -0.98990444
[5] -0.96802821 -0.81766629 -0.56671883 -0.18795885
[9] -0.17779446 -0.08314090 -0.04615256  0.13611619
[13]  0.13796440  0.43172163  0.47472683  0.57965345
[17]  0.64315758  0.71571489  1.10205842  1.16067809


## Creating accessible tables and graphs with R

I was recently given the task of improving the accessibility of a statistical report, with the goal of meeting level AA of the Web Content Accessibility Guidelines 2.0. The report contains about 600 pages of tables and graphs in a DOC file. The rule of thumb for tables is to ensure that a visually impaired person can use the tab key to navigate easily across cells in a row from the top left of the table to the bottom right. Additionally, tables and graphs should contain alternative titles and descriptions.

Last year, I produced the report with the packages xtable and R2HTML. James Falkofske notes that hypertext markup language (HTML) files are inherently fluid, allowing readers with a wide range of abilities and assistive technologies to access the content. HTML tables with little or no nesting generally meet the navigation guidelines, even when opened with Word and saved as a word processor document. However, alternative titles and descriptions are not applied to tables and graphs automatically.

It would be very time consuming to retroactively add alternative text to 600 pages of tables and graphs in Word, so I decided to look for ways to modify the defaults in the xtable and R2HTML packages. The example below shows that it's possible to add alternative text while writing tables and graphs to a HTML file. For tables, the goal is to add title and summary attributes to the <TABLE> tag. Those attributes will show up as alternative text when the HTML file is opened in Word. For graphs, the goal is to add title and "alt" attributes to the <IMG> tag. The graph's "alt" attribute will carry over to the alternative description field when the HTML file is opened in Word, but the title won't. However, adding title attributes will allow web browsers to display a float-over title for graphs as well as tables. Try it out by floating your cursor over the table and graph below.

########################################
library(xtable) #for tables
library(R2HTML) #for graphs
library(datasets) #for data

########################################
#Initiate a HTML file to export results.

#The file can be named with a DOC extension to facilitate opening it with Word.
name.file.htm <- "Accessible Tables and Graphs.doc"
HTML.title(
"Add alternative titles and descriptions to tables and graphs in a HTML-based DOC file",
file = name.file.htm, HR = 1, append = F)

########################################
#Accessible table example

#Assign alternative title and description to a single object.
html.table.attributes <- paste('border=1',
'title="Table of days of frost and years of life expectancy"',
'summary="Table of days of frost and years of life expectancy"')

#The xtable(caption) argument places the title in the first row of the table, making it
#more difficult to designate table headers to repeat across multiple pages in a document.
#Produce the table title with HTML.title() instead.
HTML.title("Days of frost and years of life expectancy", file = name.file.htm, HR = 3)

#Create table and write to file.
data.frost.exp <- as.data.frame(state.x77[, c("Frost", "Life Exp")])
names(data.frost.exp)[2] <- "Life"
table.xtable <- xtable(data.frost.exp, digits = c(NA, 0, 2))
print(table.xtable, type = "html", html.table.attributes = html.table.attributes,
file = name.file.htm, border = 1, append = T)
HTML("<br>", file = name.file.htm)

########################################
#Accessible graph example

#Modify HTMLInsertGraph() by adding Title and Alt arguments.
HTMLInsertGraph <- function (GraphFileName = "", Caption = "",
Title = "untitled graph", Alt = "a graph",
GraphBorder = 1, Align = "center",
WidthHTML = 500, HeightHTML = NULL,
file = get(".HTML.file"), append = TRUE, ...) {
cat("\n", file = file, append = append, ...)
cat(paste("<p align=", Align, "><img src='", GraphFileName,
"' title='", Title, "' alt='", Alt, "' border=", GraphBorder,
if (!is.null(WidthHTML))
paste(" width=", WidthHTML, sep = "")
else "",
if (!is.null(HeightHTML))
paste(" height=", HeightHTML, sep = "")
else "", ">", sep = "", collapse = ""),
file = file, append = TRUE, sep = "")
if (Caption != "")
cat(paste("<br><i class=caption>", Caption, "</i>"),
file = file, append = TRUE, sep = "")
invisible(return(TRUE))
}

#Assign alternative title and description to objects.
Title <- "Graph of days of frost and years of life expectancy"
Alt <- "Graph of days of frost and years of life expectancy"

name.file.plot <- "Days of frost and years of life expectancy.png"
png(name.file.plot, width = 350, height = 350)
plot(data.frost.exp, main = "Days of frost and years of life expectancy", family = "serif")
abline(lm(Life ~ Frost, data.frost.exp), col = "red")
dev.off()
HTMLInsertGraph(GraphFileName = name.file.plot, file = name.file.htm, Align = "left",
WidthHTML = 350, HeightHTML = 350, append = T, Title = Title, Alt = Alt)
HTML("<br>", file = name.file.htm)


### Days of frost and years of life expectancy

Frost Life
Alabama 20 69.05
Arizona 15 70.55
Arkansas 65 70.66
California 20 71.71
Connecticut 139 72.48
Delaware 103 70.06
Florida 11 70.66
Georgia 60 68.54
Hawaii 0 73.60
Idaho 126 71.87
Illinois 127 70.14
Indiana 122 70.88
Iowa 140 72.56
Kansas 114 72.58
Kentucky 95 70.10
Louisiana 12 68.76
Maine 161 70.39
Maryland 101 70.22
Massachusetts 103 71.83
Michigan 125 70.63
Minnesota 160 72.96
Mississippi 50 68.09
Missouri 108 70.69
Montana 155 70.56
New Hampshire 174 71.23
New Jersey 115 70.93
New Mexico 120 70.32
New York 82 70.55
North Carolina 80 69.21
North Dakota 186 72.78
Ohio 124 70.82
Oklahoma 82 71.42
Oregon 44 72.13
Pennsylvania 126 70.43
Rhode Island 127 71.90
South Carolina 65 67.96
South Dakota 172 72.08
Tennessee 70 70.11
Texas 35 70.90
Utah 137 72.90
Vermont 168 71.64
Virginia 85 70.08
Washington 32 71.72
West Virginia 100 69.48
Wisconsin 149 72.48
Wyoming 173 70.29

## Best breweries in Minnesota: Value-added ratings

| 1 Comment

A good friend of mine told me to check out Beer Advocate online. Their rating system does a good job of balancing simplicity (A-F grades) with thoroughness (measures of central tendency and dispersion). I also appreciate the way it encourages raters and readers to consider multiple dimensions of a beer's quality--its look, smell, taste, and feel--in addition to overall quality.

As a practitioner of multilevel and latent variable modeling, one aspect of Beer Advocate's rating system has given me pause: a lack of adjustment for consistently high- or low-raters. What's concerning about a simple average of ratings? Let's imagine a plausible scenario: a motivated group of beer aficionados with high standards get their hands on a limited-release beer and publish their reviews on Beer Advocate. The beer could be world class, but if only tough raters have tried it, then the average rating will appear lower. Conversely, a large number of uncritical raters could review a mediocre beer, resulting in an excellent rating on average.

Is my concern warranted? If so, which beers and breweries are rated highest after adjusting for raters and other factors? To answer those questions, I wrote some code to gather Beer Advocate ratings of beers produced by Minnesota breweries. The code relies heavily on the XML and stringr packages. After gathering the ratings, I decided to check my assumption of within-rater consistency. The intraclass correlation coefficient of r = 0.13 indicates a small degree of consistency within raters, but enough to justify my concern.

I specified a multilevel model to obtain value-added ratings of beers and breweries while adjusting for rater and beer characteristics. For simplicity, I decided to focus on the overall rating, although I would like to consider multiple dimensions of beer quality (look, smell, taste, and feel) in future analyses. The model specifies random effects for raters, cross-classified with beers nested in breweries. The fixed effects were the serving type experienced by the rater for a given beer (e.g., can), the beer's alcohol by volume (ABV), and its style (e.g., porter).

\begin{align*} \small{\text{Level 1: Ratings}}\\ \text{ln}\left(\tfrac{(Overall_{ijkl}-1)\div 4}{1-(Overall_{ijkl}-1)\div 4}\right) &= \beta_{0jkl} + \mathbf{Served}_{ij} \boldsymbol{\beta}_1 + e_{ijkl}\\ \small{\text{Level 2: Raters and beers}}\\ \beta_{0jkl} &= \beta_{00l} + \beta_{01} ABV_k + \mathbf{Style}_k \boldsymbol{\beta}_{02} + r_{0j} + r_{0kl}\\ \small{\text{Level 3: Breweries}}\\ \beta_{00l} &= \beta_{000} + r_{00l} \end{align*}.

Note that I transformed the overall rating from a 1-5 scale to logits. I did so to avoid predictions below the floor of 1 point and above the ceiling of 5 points on the original scale (i.e., to enforce lower and upper asymptotes) and to spread the ratings out at the tails (e.g., make it "harder" to go from 4.5 to 5 than going from 3.5 to 4). The chart below shows the resulting ogival relationship between the original point system and the logits.

Plot showing transformation from points to logits

I estimated the model parameters with the lme4 package. Given the large number of fixed effects (and the need for future blog content), I'll describe the influence of serving type, ABV, and style at a later date. The value added by breweries after adjusting for rater and beer characteristics is shown in the table below. Value-added logits were transformed back to the original scale, with values in the table representing point deviations from the brewery average, but the logit and point ratings are virtually the same. August Schell Brewing Company recently won national accolades, but it came in second to a relative newcomer, Fulton Beer. Summit Brewing and Surly Brewing followed closely in the third and fourth spots. The results show some discrepancies between the value-added ratings and Beer Advocate's grades, which are based on the simple average of all beer ratings. For example, Harriet Brewing received an A-, compared to Schell's B grade, but Harriet's value-added rating is below average.

(average beer rating)
Fulton Beer 0.85 0.80 A-
August Schell Brewing Co., Inc. 0.83 0.79 B
Summit Brewing Company 0.80 0.76 B+
Surly Brewing Company 0.71 0.68 A-
Minneapolis Town Hall Brewery 0.27 0.27 A-
Lake Superior Brewing Company 0.18 0.18 B
Fitger's Brewhouse 0.10 0.10 B+
Olvalde Farm & Brewing Company 0.08 0.08 A-
Granite City Food & Brewery 0.04 0.04 B-
Flat Earth Brewing Company 0.03 0.03 B+
Brau Brothers Brewing Co., LLC 0.02 0.02 B
Pig's Eye Brewing Company 0.00 0.00 D+
Harriet Brewing -0.09 -0.09 A-
Lift Bridge Brewery -0.11 -0.11 B+
Leech Lake Brewing Company -0.14 -0.14 B+
Barley John's Brew Pub -0.26 -0.26 B+
McCann's Food & Brew -0.27 -0.26 B+
St. Croix Brewing Company, LLC -0.69 -0.66 C
Cold Spring Brewing Co. -0.99 -0.91 D+
Great Waters Brewing -1.37 -1.19 B+

A different picture emerges when one considers that value-added ratings are point estimates within a range of certainty. The prediction-interval plot below shows that we can't confidently conclude that Fulton Beer scored above average, let alone out performed Shell's, Summit, or Surly. We can confidently say, though, that Shell's, Summit, and Surly scored above average, and along with Minneapolis Town Hall Brewery, they significantly out-performed Cold Spring Brewing Co. and Great Waters Brewing.

Plot of breweries' value-added ratings with 95% prediction intervals

The results highlight some of the benefits of multilevel modeling and the limits of value-added ratings. By controlling for beer characteristics and separately estimating variance components for raters and beers nested in breweries, I have reduced the degree to which raters' tastes and brewers' preferences for certain types of beers are confounding brewery ratings. That is, the value-added ratings reflect the brewery's overall quality with greater fidelity. Nevertheless, we should keep in mind that value-added ratings are based on error unexplained by theory/fixed effects, that they are point estimates within a range of certainty, and that one should interpret value-added ratings with caution because personal preferences still count. This analysis has motivated me to try beers by the breweries with high value-added ratings, but I know from experience that I like beers brewed by Brau Brothers, Lift Bridge, and Barley John's, even though their value-added point estimates place them in the bottom half of Minnesota breweries.

## Using irtoys, plyr, and ggplot2 for test development

My job as a Quantitative Analyst at the Minnesota Department of Education requires me to "provide evaluation, statistical, and psychometric support to the Division of Research and Assessment through the entire process of design, development, and reporting to the Minnesota assessment system." That's an accurate job description, but a little vague. What I really do is write a lot of code to apply item response theory (IRT) and compare item and test characteristics across multiple years and/or forms to maintain high quality.

is great for plotting item and test characteristics across multiple years and/or forms, but it requires using a few packages in concert (i.e., to my knowledge there is no single package designed for this purpose). The irtoys package provides many useful functions for test development under IRT; the plyr package allows one to apply irtoys' functions to subsets of item parameters in a data set (e.g., by strand); and the ggplot2 package allows one to overlay and facet item and test curves.

In the following reproducible example, I graphically compare item and test characteristics of the 2009 and 2010 grade 3 Minnesota Comprehensive Assessments of reading ability. Note that for simplicity, I have excluded two polytomous items from 2009 and a few dichotomous items from 2010 from the published item parameter estimates. Also note that items have been assigned to strands arbitrarily in this example. Graphs such as these allow us to answer questions like, "Did the 2010 form yield as much information (i.e., measure with equal or less error) at the cut scores as the 2009 form?" The graphs also provide guidance for replacing undesirable items.


########################################
#Combine two years' worth of item parameter estimates from grade 3.

data.pars.2009 <- matrix(ncol = 3, byrow = T, c(0.57, -1.24, 0.17, 0.93, -0.74, 0.25,
1.45, -0.45, 0.19, 0.97, -0.5, 0.13, 0.95, -0.8, 0.62, 1.07, -0.38, 0.24, 0.87,
0.37, 0.22, 0.61, -1.45, 0.02, 0.84, -1.03, 0.19, 0.72, -0.23, 0.12, 0.93, -1.6,
0.25, 0.68, -2.02, 0.06, 1.1, -1.5, 0.36, 0.74, -1.6, 0.51, 1.01, -1.01, 0.21,
0.81, -1.52, 0.16, 0.93, -0.5, 0.11, 0.34, -0.75, 0.02, 0.92, -0.93, 0.19, 1.14,
0.21, 0.26, 0.54, -0.59, 0.1, 0.86, -0.52, 0.28, 1.04, -1.77, 0.05, 0.84, -1.68,
0.02, 1.46, -1.7, 0.17, 1.3, -1.16, 0.21, 0.51, -0.62, 0.1, 1.09, -1.04, 0.17,
0.66, -2.06, 0.09, 1.11, -0.61, 0.11, 1.34, -0.9, 0.26, 1.3, -1.24, 0.19, 1.37,
-1.72, 0.24, 1.09, -1.37, 0.16, 0.89, -0.94, 0.08, 1.24, -1.38, 0.14, 0.79, -1.32,
0.11, 1.09, -1.69, 0.13))
data.pars.2009 <- data.frame(2009, data.pars.2009)
names(data.pars.2009) <- c("year", "a", "b", "c")
data.pars.2009$strand <- factor(rep(1:4, length.out = 38), labels = paste("Strand", 1:4)) data.pars.2010 <- matrix(ncol = 3, byrow = T, c(1, -2.02, 0.05, 0.61, -0.88, 0.24, 0.74, -1.36, 0.08, 0.93, 0.63, 0.25, 0.94, 0.38, 0.23, 0.56, -0.14, 0.32, 0.7, -0.73, 0.08, 1.1, -2.54, 0.02, 1.17, -0.36, 0.22, 0.46, 0.76, 0.09, 0.77, -1.53, 0.09, 1.5, -1.27, 0.18, 1.01, -1.41, 0.1, 1.63, -1.61, 0.21, 1.33, -1.63, 0.13, 1.11, -1.01, 0.19, 1.05, -1.17, 0.07, 0.88, -0.33, 0.23, 0.56, -0.65, 0.28, 1.03, 0.45, 0.13, 1.29, -1.73, 0.12, 0.96, -1.54, 0.15, 0.62, -1.33, 0.09, 0.67, -1.41, 0.06, 0.74, 0.16, 0.3, 1.33, -0.94, 0.26, 1.31, -1.71, 0.23, 1.14, -1.42, 0.16, 0.96, -0.87, 0.09, 1.22, -1.36, 0.14, 0.86, -1.19, 0.17, 0.94, -2.48, 0.02, 0.81, -1.96, 0.02, 0.98, -0.86, 0.13, 0.8, -0.68, 0.11, 0.72, -1.67, 0.04, 0.63, 0.32, 0.16, 1.24, -1.64, 0.33)) data.pars.2010 <- data.frame(2010, data.pars.2010) names(data.pars.2010) <- c("year", "a", "b", "c") data.pars.2010$strand <- factor(rep(1:4, length.out = 38), labels = paste("Strand", 1:4))

data.pars <- rbind(data.pars.2009, data.pars.2010)
data.pars$year <- factor(data.pars$year)
data.pars$id <- factor(1:nrow(data.pars)) #Apply D scaling constant. data.pars$a <- data.pars$a * 1.7 #Theta values for plotting thetas <- seq(-3, 3, by = 0.1) #Theta cut scores cuts.grade3 <- c(-1.40, -0.84, -0.01) ######################################## #Plot overlapping test characteristic curves. funk <- function(z) with(trf(z[, c("a", "b", "c")], x = thetas), data.frame(Ability = x, f)) data.plot <- ddply(data.pars, "year", funk) g1 <- ggplot(data.plot, aes(x = Ability, y = f)) + geom_vline(xintercept = cuts.grade3, linetype = 2, lwd = 1, color = "darkgrey") + geom_line(aes(color = year, linetype = year), lwd = 1) + scale_colour_manual(name = "Year", values = c("darkgoldenrod1", "darkblue")) + scale_linetype_manual(name = "Year", values = 2:1) + scale_y_continuous(name = "Expected score") + opts(title = "Test characteristic curves") g1  The test characteristic curves appear comparable up to the "exceeds proficiency standard" cut score, above which the test became more difficult compared to 2009.  ######################################## #Plot overlapping conditional standard errors of measurement (from test information). funk <- function(z) with(tif(z[, c("a", "b", "c")], x = thetas), data.frame(Ability = x, f)) data.plot <- ddply(data.pars, "year", funk) data.plot$f <- 1/sqrt(data.plot$f) g1 <- ggplot(data.plot, aes(x = Ability, y = f)) + geom_vline(xintercept = cuts.grade3, linetype = 2, lwd = 1, color = "darkgrey") + geom_line(aes(color = year, linetype = year), lwd = 1) + scale_colour_manual(name = "Year", values = c("darkgoldenrod1", "darkblue")) + scale_linetype_manual(name = "Year", values = 2:1) + scale_y_continuous(name = "CSEM") + opts(title = "Conditional standard errors of measurement") g1  The conditional standard errors of measurement appear comparable, overall, except for the 2010 standard error at the exceeds-proficiency cut, which appears to slightly exceed the 0.25 rule of thumb.  ######################################## #Plot overlapping CSEMs faceted by strand. funk <- function(z) with(tif(z[, c("a", "b", "c")], x = thetas), data.frame(Ability = x, f)) data.plot <- ddply(data.pars, c("year", "strand"), funk) data.plot$f <- 1/sqrt(data.plot$f) g1 <- ggplot(data.plot, aes(x = Ability, y = f)) + geom_vline(xintercept = cuts.grade3, linetype = 2, lwd = 1, color = "darkgrey") + geom_line(aes(color = year, linetype = year), lwd = 1) + scale_colour_manual(name = "Year", values = c("darkgoldenrod1", "darkblue")) + scale_linetype_manual(name = "Year", values = 2:1) + facet_wrap( ~ strand) + scale_y_continuous(name = "CSEM") + opts(title = "Conditional standard errors of measurement") g1  It looks like strand 3 could use a highly discriminating difficult item to bring its CSEM in line with the corresponding 2009 CSEM at the exceeds-proficiency cut.  ######################################## #Plot item characteristic curves faceted by strand. funk <- function(z) with(irf(z[, c("a", "b", "c")], x = thetas), data.frame(Ability = x, f)) data.plot <- ddply(data.pars, c("id", "year", "strand"), funk) g1 <- ggplot(data.plot, aes(x = Ability, y = f)) + geom_vline(xintercept = cuts.grade3, linetype = 2, lwd = 1, color = "darkgrey") + geom_line(aes(group = id:year, color = year, linetype = year)) + scale_colour_manual(name = "Year", values = c("darkgoldenrod1", "darkblue")) + scale_linetype_manual(name = "Year", values = 2:1) + facet_wrap( ~ strand) + scale_y_continuous(name = "Probability", breaks = seq(0, 1, by = 0.25), limits = c(0, 1)) + opts(title="Item characteristic curves") g1  Compared to 2009, there are fewer items from strands 1 and 2 that students are able to guess easily (i.e., high c parameters/lower asymptotes). As noted above, strand 3 lacks an item that discriminates at the exceeds-proficiency cut. Replacing the poorly discriminating difficult item in strand 2 with a higher discriminating one would also reduce measurement error of high-ability students. ## Plotting survey responses representing three or more dimensions | No Comments Bob Williams recently posted a question to EVALTALK about open source software for plotting survey responses in three dimensions: I'm wondering how to display Likert scale data in three dimensions (ie the Likert scores for each subject on 3 scale dimensions). My graphic software is not that sophisticated and I haven't been able to find an open source equivalent.... that can plot something along three dimensions (X,Y,Z) like this:  X, Y, Z Subject A 2, 5, 8 Subject B 3, 5, 9 Subject C 3, 7, 3 Another evaluator, Bill Harris, recommended and ggplot2 in particular. Bob's question and the mention of inspired me to write up a brief demonstration of plotting multi-dimensional data. As I have mentioned in the past (see here, here, and here), plotting or mapping three or more dimensions can be difficult and potentially misleading because plots and maps are inherently two-dimensional. I would say that plotting three survey response dimensions actually requires plotting four dimensions: X, Y, Z, and the frequency of each X, Y, and Z combination. That's because evaluators commonly have large data sets with repeated response patterns. I simulated a larger data set from Bob's example in order to illustrate plotting four dimensions. #Open a recording window (if using Windows). windows(record=T) #Manually read in the data. data.3D <- data.frame(Subject = c("A", "B", "C"), X = c(2, 3, 3), Y = c(5, 5, 7), Z = c(8, 9, 3)) data.3D #Larger data sets will contain frequently repeated combinations of responses. #Simulate a larger data set with repeated responses combinations. data.4D <- data.3D for(i in 1:200) { temp <- data.frame(Subject=factor(paste(as.character(data.3D[, 1]), i, sep=".")), X=round(jitter(data.3D[, 2], 7), 0), Y=round(jitter(data.3D[, 2], 7), 0), Z=round(jitter(data.3D[, 4], 7), 0)) data.4D <- rbind(data.4D, temp) } library(plyr) data.4D <- ddply(data.4D, c("X", "Y", "Z"), "nrow") names(data.4D)[which(names(data.4D)=="nrow")] <- "Frequency" data.4D <- data.4D[order(data.4D$Frequency, decreasing=T), ]

#Scatterplot matrices
pairs(data.3D[2:4], xlim=c(0, 10), ylim=c(0, 10),
main="Responses to three Likert scale items")
library(psych)
pairs.panels(sapply(data.4D[1:3], jitter), xlim=c(0, 10), ylim=c(0, 10), ellipses=F,
hist.col="blue", pch=1, scale=T, main="Responses to three Likert scale items")


#Load ggplot2 library; set theme without gray background.
library(ggplot2)
theme_set(theme_bw())

#Create a scatterplot with size and color varying by the third dimension.
plot1 <- ggplot(data=data.3D, aes(x=X, y=Y, color=Z, size=Z)) +
geom_point() +
xlim(0, 10) +
ylim(0, 10) +
scale_colour_gradient(breaks=0:10, limits=c(0, 10), low="lightblue", high="darkblue") +
scale_size_continuous(breaks=0:10, limits=c(0, 10)) +
opts(title="Responses to three Likert scale items")
plot1

#Create a jittered scatterplot with color varying by the third dimension
#and size by response combination frequency.
plot2 <- ggplot(data=data.4D, aes(x=X, y=Y, color=Z, size=Frequency)) +
geom_jitter() +
xlim(0, 10) +
ylim(0, 10) +
scale_colour_gradient(breaks=0:10, limits=c(0, 10), low="lightblue", high="darkblue") +
scale_size_continuous() +
opts(title="Responses to three Likert scale items")
plot2


#Reshape data from wide to longer.
data.3D.longer <- melt(data.3D, id.vars=c("Subject", "Y"), variable_name="Item")
names(data.3D.longer)[which(names(data.3D.longer)=="value")] <- "Response"
data.3D.longer

#Create a scatterplot with faceted second and third dimensions.
plot3 <- ggplot(data=data.3D.longer, aes(x=Response, y=Y)) +
geom_point(color="blue") +
xlim(0, 10) +
ylim(0, 10) +
facet_grid(. ~ Item) +
opts(title="Responses to three Likert scale items")
plot3

#Reshape simulated data from wide to longer.
data.4D.longer <- melt(data.4D, id.vars=c("Y", "Frequency"), variable_name="Item")
names(data.4D.longer)[which(names(data.4D.longer)=="value")] <- "Response"

#Create a scatterplot with faceted second and third dimensions
#and size varying by response combination frequency.
plot4 <- ggplot(data=data.4D.longer, aes(x=Response, y=Y, size=Frequency)) +
geom_jitter(color="blue") +
xlim(0, 10) +
ylim(0, 10) +
facet_grid(. ~ Item) +
opts(title="Responses to three Likert scale items")
plot4


## Linear mixed-effects regression p-values in R: A likelihood ratio test function

Following Douglas Bates' advice for those required to produce p-values for fixed-effects terms in a mixed-effects model, I wrote a function to perform a likelihood ratio test for each term in an lmer() object. Bates has championed the notion that calculating the p-values for fixed effects is not trivial. That's because with unbalanced, multilevel data, the denominator degrees of freedom used to penalize certainty are unknown (i.e., we're uncertain about how uncertain we should be). As the author of lme4, the foremost mixed-effects modeling package in , he has practiced what he preaches by declining to approximate denominator degrees of freedom as SAS does. Bates contends that alternative inferential approaches make p-values unnecessary. I agree with that position, focusing instead on information criteria and effect sizes. However, as a program evaluator, I recognize that some stakeholders find p-values useful for understanding findings.

A likelihood ratio test can be used to test $\inline H_0:\beta=0$ if the sample size is large. According to Fitzmaurice, Laird, and Ware, twice the difference between the maximized log-likelihoods of two nested models, $\inline G^2 = 2(\hat l_\text{full}-\hat l_\text{reduced})$, represents the degree to which the reduced model is inadequate. When the sample size is large, $\inline G^2 \sim \chi^2$ with degrees of freedom equal to the difference in parameters between the full and reduced models, $\inline m_\text{full} - m_\text{reduced}$. The p-value is $\inline \text{Pr} ( \chi^2 > G^2 | H_0)$.

What if the sample size is not large? A p-value based on $\inline \chi^2$ will be too liberal (i.e., the type I error rate will exceed the nominal p-value). More conservatively, we might say that $\inline G^2 \sim F$ with $\inline {m_\text{full} - m_\text{reduced}$ numerator and $\inline N_\text{effective}-m_\text{full}-1$ denominator degrees of freedom. According to Snijders and Bosker, the effective sample size lies somewhere between $\inline M$ total micro-observations (i.e., at level one) and $\inline N$ clusters randomly sampled in earlier stages (i.e., at higher levels). Formally, the effective sample size is $\inline N_\text{effective}=\tfrac{Nn}{1+(n-1)\rho}$, where $\inline n$ observations are nested within each cluster and intraclass correlation is $\inline \rho=\tfrac{\tau^2}{\tau^2+\sigma^2}$. Even if $\inline M=Nn$ is large, $\inline N_\text{effective}$ (and statistical power) could be quite small if $\inline N$ is small and $\inline \rho}$ is large: $\inline \tfrac{5*100}{1+(100-1)0.8}=6.2$. Unbalanced designs, modeling three or more levels, and cross-level interactions add to our uncertainty about the denominator degrees of freedom.

The function I wrote chews up the lmer() model call and concatenates the frame and model matrix slots, after which it iteratively fits (via maximum likelihood instead of restricted ML) models reduced by each fixed effect and compares them to the full model, yielding a vector of p-values based on $\inline \chi^2 (1)$. As the example shows, the function can handle shortcut formulas whereby lower order terms are implied by an interaction term. The function doesn't currently handle weights, glmer() objects, or on-the-fly transformations of the dependent variable [e.g., log(dep.var) ~ ...]. The accuracy of resulting p-values depends on large sample properties, as discussed above, so I don't recommend using the function with small samples. I'm working on another function that will calculate p-values based on the effective sample size estimated from intraclass correlation. I will post that function in a future entry. I'm sure the following function could be improved, but I wanted to go ahead share it with other applied researchers whose audience likes p-values. Please let me know if you see ways to make it better.

p.values.lmer <- function(x) {
summary.model <- summary(x)
data.lmer <- data.frame(model.matrix(x))
names(data.lmer) <- names(fixef(x))
names(data.lmer) <- gsub(pattern=":", x=names(data.lmer), replacement=".", fixed=T)
names(data.lmer) <- ifelse(names(data.lmer)=="(Intercept)", "Intercept", names(data.lmer))
string.call <- strsplit(x=as.character(x@call), split=" + (", fixed=T)
var.dep <- unlist(strsplit(x=unlist(string.call)[2], " ~ ", fixed=T))[1]
vars.fixef <- names(data.lmer)
formula.ranef <- paste("+ (", string.call[[2]][-1], sep="")
formula.ranef <- paste(formula.ranef, collapse=" ")
formula.full <- as.formula(paste(var.dep, "~ -1 +", paste(vars.fixef, collapse=" + "),
formula.ranef))
data.ranef <- data.frame(x@frame[,
which(names(x@frame) %in% names(ranef(x)))])
names(data.ranef) <- names(ranef(x))
data.lmer <- data.frame(x@frame[, 1], data.lmer, data.ranef)
names(data.lmer)[1] <- var.dep
out.full <- lmer(formula.full, data=data.lmer, REML=F)
p.value.LRT <- vector(length=length(vars.fixef))
for(i in 1:length(vars.fixef)) {
formula.reduced <- as.formula(paste(var.dep, "~ -1 +", paste(vars.fixef[-i],
collapse=" + "), formula.ranef))
out.reduced <- lmer(formula.reduced, data=data.lmer, REML=F)
print(paste("Reduced by:", vars.fixef[i]))
print(out.LRT <- data.frame(anova(out.full, out.reduced)))
p.value.LRT[i] <- round(out.LRT[2, 7], 3)
}
summary.model@coefs <- cbind(summary.model@coefs, p.value.LRT)
summary.model@methTitle <- c("\n", summary.model@methTitle,
"\n(p-values from comparing nested models fit by maximum likelihood)")
print(summary.model)
}

library(lme4)
library(SASmixed)
lmer.out <- lmer(strength ~ Program * Time + (Time|Subj), data=Weights)
p.values.lmer(lmer.out)


Yields:

Linear mixed model fit by REML
(p-values from comparing nested models fit by maximum likelihood)
Formula: strength ~ Program * Time + (Time | Subj)
Data: Weights
AIC  BIC logLik deviance REMLdev
1343 1383 -661.7     1313    1323
Random effects:
Groups   Name        Variance Std.Dev. Corr
Subj     (Intercept) 9.038486 3.00641
Time        0.031086 0.17631  -0.118
Residual             0.632957 0.79559
Number of obs: 399, groups: Subj, 57

Fixed effects:
Estimate Std. Error   t value p.value.LRT
(Intercept)     79.99018    0.68578 116.64000       0.000
ProgramRI        0.07009    1.02867   0.07000       0.944
ProgramWI        1.11526    0.95822   1.16000       0.235
Time            -0.02411    0.04286  -0.56000       0.564
ProgramRI:Time   0.12902    0.06429   2.01000       0.043
ProgramWI:Time   0.18397    0.05989   3.07000       0.002

Correlation of Fixed Effects:
(Intr) PrgrRI PrgrWI Time   PrRI:T
ProgramRI   -0.667
ProgramWI   -0.716  0.477
Time        -0.174  0.116  0.125
ProgrmRI:Tm  0.116 -0.174 -0.083 -0.667
ProgrmWI:Tm  0.125 -0.083 -0.174 -0.716  0.477


## Structural equation modeling for theory-driven evaluation

Just to show that I'm not the biggest slacker-blogger on the Web, I want to direct you to my guest-post on AEA365: A tip-a-day by and for evaluators. I chose my "tip" because I think structural equation modeling (SEM) and logic modeling would complement each other very well, but very few researchers have combined the two approaches. Those of us who use SEM know how important it is to have strong prior theory for model fit and valid conclusions. We could learn a lot from evaluators who are skilled at developing logic models. Conversely, theory-driven evaluators could improve their practice by carefully attending to statistical power, construct validity, attenuation due to measurement error, and the decomposition of total effects. I am very interested in hearing others' opinions on this issue, so please leave a comment here or at AEA365.

A logic model (left) operationalized as a partial mediation growth model (right)

## Regression discontinuity gallery: Spatiotemporal piecewise regression

I think I have settled on a dissertation topic: spatiotemporal piecewise regression evaluation. Spatiotemporal piecewise regression (SPR) refers to the analysis of longitudinal data from a spatial regression discontinuity (SRD) design with multiple pre-test observations. SRD offers a way to quasi-experimentally estimate local average treatment effects (LATEs) of geographically implemented programs or policies; SPR is a way to estimate change in LATEs over time. My dissertation will describe SPR methodology in detail, including validity threats, and demonstrate an SPR evaluation of an educational program. As discussed by Shadish, Cook, and Campbell, multiple pre-test observations will allow me to address validity questions.

I decided to conduct a preliminary SPR analysis of data from the educational program to determine if the topic would be feasible and to include some results in a conference paper proposal. As discussed in earlier posts (here, here, and here), can be used to plot regression discontinuity fitted lines, showing the LATE at the treatment assignment cutoff point. The ggplot2 package can be used to plot SPR fitted lines and change in LATEs over time, but it's not easy. One reason is that ggplot2 has a steep learning curve and limits control over the legend's appearance, although its curve is not as steep as lattice's. Another reason SPR plots are difficult to produce is that they represent several dimensions: north/south/east/west (reduced to one-dimensional distance), time, and the outcome. The SPR plots below show that participation in the educational program is associated with an initially positive, small LATE that diminishes over time.

Spatiotemporal piecewise regression fitted line plots

## Compressing KML files with R

Compressing Google Earth files with is not well documented on the Web, so I thought I should share my approach. I am working on my presentation for the upcoming Minnesota Assessment Conference. I will be presenting geographic maps of 2010 Minnesota assessment results. To minimize clutter in the maps and enhance the experience of stakeholders, I am creating two versions of each map: a static PDF map with no school district labels and a Web-based, interactive map. Each PDF will link to a keyhole markup language (KML) file through Google Maps, much like this proficiency map described here.

The KML file sizes exceed Google Maps' interface limit, so I had to find a way to compress them into KMZ files. I could compress each KML file using Google Earth or 7-Zip, but since I am already creating the maps with , I decided to find a way to compress the KML files via code. Gunzipping is easy in (e.g., see gzfile), but Google Maps will not accept gunzipped KML files. Thanks to Duncan Temple Lang, the zip() function in library(Rcompression) will create KMZ files with standard ZIP compression. Here's a reproducible example:

library(maptools)data(wrld_simpl)sw <- slot(wrld_simpl[wrld_simplNAME=="South Africa",], "polygons")[[1]]tf <- tempfile()kmlPolygon(sw, kmlfile=paste(tf, ".kml", sep=""), name="South Africa", col="#df0000aa", lwd=5, border=4, kmlname="KMZ test", kmldescription="<i>KMZ</i> file created with <a href='http://www.r-project.org'>R</a>.")install.packages("Rcompression", repos="http://www.omegahat.org/R")library(Rcompression)zip(zipfile=paste(tf, ".kmz", sep=""), files=paste(tf, ".kml", sep=""))paste(tf, ".kmz", sep="") #path and filename ## Exploring dimensions of process quality measured by the Early Childhood Environment Rating Scale: Differential item functioning analysis | No Comments The Early Childhood Environment Rating Scale (Revised; ECERS-R) was designed to measure process quality in child care centers, but the data set that I have been analyzing contains ratings of center classrooms and family child care (FCC) homes. A sizable portion of the program's budget was spent on training ECERS-R raters. Training them to reliably use both the ECERS-R and the Family Day Care Rating Scale (FDCRS) was cost prohibitive. Was it unfair to rate FCC homes with the ECERS-R instrument? I conducted a differential item functioning (DIF) analysis to answer that question and remedy any apparent biases. Borrowing a quasi-experimental matching technique (not for the purpose of inferring causality), I employed propensity score weighting to match center classrooms with FCC homes of similar quality. For each item j, classroom/home i's propensity score (i.e., predicted probability of being an FCC home) was estimated using the following logistic regression model: $\hat p_{ij} = \widehat {\text{Pr}} (FCC_i=1) = \text{logit}^{-1} (\hat \beta_0 + \hat \beta_1 Total_{cij})$, where FCC is a focal group dummy variable (1 if home; 0 if center) and Totalc stands for corrected total score (i.e., mean of item scores excluding item j). Classroom weights were calculated from the inverse of a site's predicted probability of its actual type and normalized so the weights sum to the original sample size: $weight_{ij}=\frac{n \left(\frac{FCC_i}{\hat p_{ij}}+\frac{1-FCC_i}{1-\hat p_{ij}}\right)} { \sum_{i=1}^{n} \left({\frac{FCC_i}{\hat p_{ij}} + \frac{1-FCC_i}{1-\hat p_{ij}}}\right)}$. Such weights can induce overlapping distributions (i.e., balanced groups). In this case, the weights were were applied via weighted least squares (WLS) regression of item scores on corrected total scores and the focal group dummy, as well as an interaction term to consider the possibility of nonuniform/crossing DIF: $Score_{ij} = \beta_0 + \beta_1 Total_{cij} + \beta_2 FCC_i + \beta_3 Total_{cij} FCC_i + \varepsilon_{ij}$. By minimizing $\inline \sum_{i=1}^n (weight_{ij} \cdot e_{ij}^2)$, WLS gave classrooms/homes with weights greater than 1 more influence over parameter estimates and decreased the influence of those with weights less than 1. The results suggest that three provisions for learning items function differentially and in a non-uniform manner, as indicated by statistically significant interactions (see the table below). None of the language/interaction items exhibited DIF. The regression coefficients suggest that differentially functioning ECERS items did not consistently favor one type of care over the other. After dropping the three differentially functioning items from the provisions for learning scale and re-running the DIF analysis, none of the remaining items exhibited differential functioning. Summary of WLS estimates: Differentially functioning ECERS items Estimate Std. Error t value Pr(> |t|) Item 8: Gross motor equipment Intercept 3.014 2.044 1.475 0.149 Corrected total 0.523 0.445 1.175 0.248 FCC home -5.234 2.278 -2.297 0.028 Corrected*FCC 1.161 0.498 2.332 0.026 Item 22: Blocks Intercept -3.990 1.919 -2.079 0.045 Corrected total 1.608 0.403 3.994 0.000 FCC home 3.929 2.076 1.892 0.067 Corrected*FCC -0.893 0.437 -2.045 0.049 Item 25: Nature/science Intercept -4.802 2.566 -1.872 0.070 Corrected total 1.833 0.540 3.393 0.002 FCC home 5.996 2.771 2.164 0.038 Corrected*FCC -1.443 0.584 -2.471 0.019 Plot of weighted observations and fitted lines: Three crossing DIF items and one fair item (Item 19: Fine motor activities) Scores from the final provisions for learning scale exhibited good reliability of α = 0.87. Scores from the language/interaction scale also exhibited good reliability of α = 0.86. None of the items detracted problematically from overall reliability (i.e., the most that α increased after dropping any item was 0.01 in one instance). I want to caution readers against generalizing the factor structure and DIF findings beyond this study because the sample is small in size (n = 38 classrooms/homes) and was drawn from a specific population (urban child care businesses subject to local market conditions and state-specific regulations). I largely avoided making statistical inferences in the preceding analyses by exploring the data and comparing relative fit, but in this case I used a significance level of 0.05 to infer DIF. The data do not provide a large amount of statistical power, so there could be more items that truly function differentially. For the final paper, I may apply a standardized effect size criterion. Standardized effect sizes are used to identify DIF with very large samples because the p-values would lead one to infer DIF for almost every item. De-emphasizing p-values in favor of effect sizes may be appropriate with small samples, too. ## Exploring dimensions of process quality measured by the Early Childhood Environment Rating Scale: Confirmatory model comparison | No Comments I followed up the exploratory factor analysis (EFA) of the Early Childhood Environment Rating Scale (ECERS-R) with a confirmatory model comparison of the six-, two-, and one-factor solutions. The six-factor specification adhered to the original subscales in the ECERS-R instrument (after dropping highly skewed items); the two-factor specification simply restricted the minor loadings from the EFA to zero; and the one-factor specification restricted items to zero if they loaded less than |0.3| in the one-factor EFA solution. The six-factor model did not converge, and the two-factor model exhibited better fit than the one-factor model. These results support earlier findings by Sakai and colleagues (2003) and Cassidy and colleagues (2005) that the ECERS-R measures two latent factors: provisions for learning and language/interaction experienced by children. The goal of this analysis was to find a factorially simple solution. Higher order or hierarchical models of greater complexity would be worth considering in light of evolving theories and needs surrounding measures of child care quality and given that: • several ECERS-R items cross-loaded in the EFAs and were subsequently excluded • the overall fit of the two-factor model was poor • the estimated correlation between the latent factors was somewhat large (0.68). Comparison of one- and two-factor models  CFA fit measure One factor Two factors Model χ2 750, df = 377, p < 0.01 306, df = 208, p < 0.01 χ2 (null model) 1139, df = 406 653, df = 231 Goodness-of-fit index 0.509 0.645 Adjusted goodness-of-fit index 0.433 0.569 RMSEA index 0.164 0.113 Bentler-Bonnett NFI 0.341 0.532 Tucker-Lewis NNFI 0.452 0.742 Bentler CFI 0.491 0.768 SRMR 0.115 0.107 BIC -621 -451 Measurement model (click image) ## Exploring dimensions of process quality measured by the Early Childhood Environment Rating Scale | No Comments The Early Childhood Environment Rating Scale, revised edition (ECERS-R; Harms, Clifford, & Cryer, 1998), is an instrument used widely to observe and rate levels of process quality in child care centers. The authors have divided the instrument into six subscales to help ensure broad and flexible coverage of various aspects of child care quality. (The ECERS-R actually contains seven subscales, but the last one pertains to parents and staff instead of process quality experienced by children.) Psychometric analyses suggest the ECERS-R actually measures two latent dimensions of quality at most. Perlman, Zellman, and Le (2004) concluded that process quality, as measured by the ECERS-R, is unidimensional. Sakai and colleagues (2003) and Cassidy and colleagues (2005) concluded that the ECERS-R measures two latent dimensions, although their factor loadings and interpretations differ across the two studies. I am writing a paper to present at the upcoming American Educational Research Association (AERA) conference in Denver. The paper will report results from a structural equation mediation model of the influence of on-site child care professional development on school readiness through child care quality. Given the lack of agreement among the earlier psychometric analyses of the ECERS-R, I conducted an exploratory factor analysis to see if I could replicate findings from one of the earlier studies. As shown in the table below, my preliminary results and interpretations do not align perfectly with either of the two-factor solutions from the earlier studies, but the similarities helped me decide which items to drop and which of my interpretations to keep. I plan to use a confirmatory factor analysis to formally compare model fit between the one- and two-factor solution before estimating the mediation model. I hope the summary below will help others who are wrestling with the possibility of multiple dimensions of child care quality as measured by the ECERS-R. Summary of ECERS-R two-factor solutions from three studies Item numberItemSubscaleSakai and colleagues (2003)Cassidy and colleagues (2005)Moore (2010)Decision 1Indoor spaceSpace and furnishingsProvisions for learning Drop 2Furniture for care, play, and learningSpace and furnishingsTeaching and interactions Provisions for learningDiscrepancy 3Furnishings for relaxationSpace and furnishingsTeaching and interactionsMaterials/activities Discrepancy 4Room arrangementSpace and furnishingsProvisions for learning Drop 5Space for privacySpace and furnishings Materials/activitiesProvisions for learningKeep 6Space for gross motorSpace and furnishings Provisions for learningDiscrepancy 7Child-related displaySpace and furnishingsTeaching and interactions Drop 8Gross motor equipmentSpace and furnishingsProvisions for learning Provisions for learningKeep 9Greeting/departingPersonal care routinesTeaching and interactions Drop 10Meals/ snacksPersonal care routinesProvisions for learning Drop 11Nap/restPersonal care routines Provisions for learningDiscrepancy 12Toileting/diaperingPersonal care routinesProvisions for learning Drop 13Health practicesPersonal care routinesProvisions for learning Drop 14Safety practicesPersonal care routinesProvisions for learning Drop 15Books and picturesLanguage-reasoningProvisions for learningMaterials/activitiesProvisions for learningKeep 16Encouraging children to communicateLanguage-reasoning Provisions for learningDiscrepancy 17Using language to develop reasoning skillsLanguage-reasoningTeaching and interactionsLanguage/interactionLanguage/interactionKeep 18Informal use of languageLanguage-reasoningTeaching and interactionsLanguage/interactionLanguage/interactionKeep 19Fine motorActivitiesTeaching and interactionsMaterials/activitiesProvisions for learningKeep 20ArtActivities Materials/activities Drop 21Music/movementActivitiesTeaching and interactions Language/interactionKeep 22BlocksActivitiesProvisions for learningMaterials/activitiesProvisions for learningKeep 23Sand/waterActivitiesProvisions for learning Drop 24Dramatic playActivitiesProvisions for learningMaterials/activitiesProvisions for learningKeep 25Nature/scienceActivities Materials/activitiesProvisions for learningKeep 26Math/numbersActivitiesTeaching and interactionsMaterials/activitiesProvisions for learningKeep 27Use of TV, video, and/or computersActivities Language/interactionDiscrepancy 28Promoting acceptance of diversityActivitiesProvisions for learning Provisions for learningKeep 29Supervision of gross motor activitiesInteraction Language/interactionDiscrepancy 30General supervision of childrenInteractionProvisions for learningLanguage/interaction Discrepancy 31DisciplineInteraction Language/interactionLanguage/interactionKeep 32Staff-child interactionsInteractionProvisions for learningLanguage/interactionLanguage/interactionKeep 33Interactions among childrenInteractionTeaching and interactionsLanguage/interactionLanguage/interactionKeep 34ScheduleProgram structureProvisions for learning Drop 35Free playProgram structureTeaching and interactions Drop 36Group timeProgram structureProvisions for learningLanguage/interactionLanguage/interactionKeep Note: Empty cells indicate loadings less than |0.3|, cross-loadings greater than |0.3|, or items skewed greater than |2|. Item 37, which asks about provisions for children with disabilities was excluded due to high missingness. ## Multilevel Rasch: Estimation with R | 2 Comments My last post described how test development designs could be thought of as multistage sampling designs and how multilevel Rasch could be used to estimate parameters under that assumption. I decided to use to compare estimates from the usual Rasch model, the generalized linear model (GLM) with a logit link, and generalized linear mixed-effects regression (GLMER), treating items as fixed then random effects. The GLMER model with item fixed effects exhibited the best fit of the Law School Admission Test (LSAT) data provided with the ltm package. Moreover, intraclass correlation has reduced the effective sample size relative to simple random sampling, lending further support to the multilevel Rasch approach.  #Load libraries. library(ltm) library(lme4) library(xtable) #Prepare data for GLM. LSAT.long <- reshape(LSAT, times=1:5, timevar="Item", varying=list(1:5), direction="long") names(LSAT.long) <- c("Item", "Score", "ID") LSAT.longItem <- as.factor(LSAT.long$Item) LSAT.long$ID <- as.factor(LSAT.long$ID) #Compute Rasch, GLM, and GLMER estimates and compare fit. out.rasch <- rasch(LSAT, constraint=cbind(ncol(LSAT)+1, 1)) print(xtable(summary(out.rasch)$coefficients, digits=3), type="html")

value std.err z.vals
Dffclt.Item1 -2.872 0.129 -22.307
Dffclt.Item2 -1.063 0.082 -12.946
Dffclt.Item3 -0.258 0.077 -3.363
Dffclt.Item4 -1.388 0.086 -16.048
Dffclt.Item5 -2.219 0.105 -21.166
Dscrmn 1.000

 out.glm <- glm(Score ~ Item, LSAT.long, family="binomial") print(xtable(summary(out.glm)$coefficients, digits=3), type="html") Estimate Std. Error z value Pr(> |z|) (Intercept) 2.498 0.119 20.933 0.000 Item2 -1.607 0.138 -11.635 0.000 Item3 -2.285 0.135 -16.899 0.000 Item4 -1.329 0.141 -9.450 0.000 Item5 -0.597 0.152 -3.930 0.000 out.glmer <- glmer(Score ~ Item + (1|ID), LSAT.long, family="binomial") print(xtable(summary(out.glmer)@coefs, digits=3, caption="Fixed effects"), type="html", caption.placement="top") print(xtable(summary(out.glmer)@REmat, digits=3, caption="Random effects"), type="html", caption.placement="top", include.rownames=F) Fixed effects Estimate Std. Error z value Pr(> |z|) (Intercept) 2.705 0.129 21.029 0.000 Item2 -1.711 0.145 -11.774 0.000 Item3 -2.467 0.142 -17.353 0.000 Item4 -1.406 0.148 -9.498 0.000 Item5 -0.623 0.160 -3.880 0.000 Random effects Groups Name Variance Std.Dev. ID (Intercept) 0.502 0.70852  out.glmer.re <- glmer(Score ~ (1|Item) + (1|ID), LSAT.long, family="binomial") print(xtable(summary(out.glmer.re)@coefs, digits=3, caption="Fixed effects"), type="html", caption.placement="top") print(xtable(summary(out.glmer.re)@REmat[,-6], digits=3, caption="Random effects"), type="html", caption.placement="top", include.rownames=F) Fixed effects Estimate Std. Error z value Pr(> |z|) (Intercept) 1.448 0.379 3.818 0.000 Random effects Groups Name Variance Std.Dev. ID (Intercept) 0.45193 0.67226 Item (Intercept) 0.70968 0.84243 print(xtable(AICs.Rasch.GLM <- data.frame(Rasch=summary(out.rasch)$AIC, GLM=summary(out.glm)$aic, GLMER.fe=summary(out.glmer)@AICtab$AIC, GLMER.re=summary(out.glmer.re)@AICtab$AIC, row.names="AIC"), caption="Akaike's information criteria (AICs)"), type="html", caption.placement="top") Akaike's information criteria (AICs) Rasch GLM GLMER.fe GLMER.re AIC 4956.11 4996.87 4950.80 4977.25  #Easiness estimates. easiness <- data.frame(Rasch=out.rasch$coefficients[,1], GLM=c(out.glm$coefficients[1], out.glm$coefficients[1]+out.glm$coefficients[-1]), GLMER.fe=c(fixef(out.glmer)[1], fixef(out.glmer)[1]+fixef(out.glmer)[-1]), GLMER.re=fixef(out.glmer.re)+unlist(ranef(out.glmer.re)$Item)) print(xtable(easiness, digits=3), type="html")

Rasch GLM GLMER.fe GLMER.re
Item 1 2.872 2.498 2.705 2.527
Item 2 1.063 0.891 0.994 0.917
Item 3 0.258 0.213 0.237 0.224
Item 4 1.388 1.169 1.299 1.201
Item 5 2.219 1.901 2.082 1.938

 #Estimated probabilities of a correct response. pr.correct <- sapply(easiness, plogis) row.names(pr.correct) <- row.names(easiness) print(xtable(pr.correct, digits=3), type="html")

Rasch GLM GLMER.fe GLMER.re
Item 1 0.946 0.924 0.937 0.926
Item 2 0.743 0.709 0.730 0.714
Item 3 0.564 0.553 0.559 0.556
Item 4 0.800 0.763 0.786 0.769
Item 5 0.902 0.870 0.889 0.874

 #Difficulty estimates. difficulties <- easiness*-1 print(xtable(difficulties, digits=3), type="html")

Rasch GLM GLMER.fe GLMER.re
Item 1 -2.872 -2.498 -2.705 -2.527
Item 2 -1.063 -0.891 -0.994 -0.917
Item 3 -0.258 -0.213 -0.237 -0.224
Item 4 -1.388 -1.169 -1.299 -1.201
Item 5 -2.219 -1.901 -2.082 -1.938

 #Calculate design effects and effective samples sizes from intraclass correlation coefficients and sampling unit sizes. multistage.consequences <- function(ICC, N) { M <- nrow(LSAT.long) n <- M/N #number of responses per item deff <- 1+(n-1)*ICC M.effective <- trunc(M/deff) return(data.frame(ICC, M, N, n, deff, M.effective)) } model.ICC.Item <- glmer(Score ~ 1 + (1|Item), family=binomial, data=LSAT.long) ICC.Item <- as.numeric(VarCorr(model.ICC.Item)$Item)/(as.numeric(VarCorr(model.ICC.Item)$Item)+pi^2/3) multistage.Item <- multistage.consequences(ICC.Item, 5) model.ICC.ID <- glmer(Score ~ 1 + (1|ID), family=binomial, data=LSAT.long) ICC.ID <- as.numeric(VarCorr(model.ICC.ID)$ID)/(as.numeric(VarCorr(model.ICC.ID)$ID)+pi^2/3) multistage.ID <- multistage.consequences(ICC.ID, 1000) multistage <- data.frame(cbind(t(multistage.Item), t(multistage.ID))) names(multistage) <- c("Item", "ID") print(xtable(multistage, digits=3), type="html")

Item ID
ICC 0.159 0.070
M 5000.000 5000.000
N 5.000 1000.000
n 1000.000 5.000
deff 160.295 1.281
M.effective 31.000 3903.000

 #Plot test characteristic curves plot.tcc <- function(difficulties, add, col, lty) { thetas <- seq(-3.8,3.8,length=100) temp <- matrix(nrow=length(difficulties), ncol=length(thetas)) for(i in 1:length(thetas)) {for(theta in thetas) temp[,which(thetas==theta)] <- plogis(theta-difficulties)} if(missing(add)) {plot(thetas, colSums(temp), ylim=c(0,5), col=NULL, ylab=expression(tau), xlab=expression(theta), main="Test characteristic curves")} lines(thetas, colSums(temp), col=col, lty=lty) } plot.tcc(difficulties=difficulties[,1], col="black", lty=1) plot.tcc(difficulties=difficulties[,2], add=T, col="blue", lty=2) plot.tcc(difficulties=difficulties[,3], add=T, col="red", lty=3) plot.tcc(difficulties=difficulties[,4], add=T, col="green", lty=4) legend("bottomright", c("Rasch", "GLM", "GLMER, fixed items", "GLMER, random items "), col=c("black", "blue", "red", "green"), lty=1:4)

## Multilevel Rasch

After years of studying and applying quantitative social science research methods, it has become easier to catch glimpses of larger connections between seemingly independent methodological approaches. I have learned, for example, that survey development has a lot in common with test development. I have also learned that longitudinal data analysis is like spatial/geographic data analysis in that both attend to reference-based dependencies (i.e., one dimensional time and two-or-more dimensional space).

Many methods seem independent because they are taught in isolation as topics courses (e.g., survey methods) or only within disciplines where they are traditionally applied (e.g., social network analysis within sociology). I credit my professors with teaching me to see the larger connections. I also credit the University of Minnesota with offering interdisciplinary courses, such as Latent Variable Measurement Models and Path Analysis taught by Melanie Wall in Biostatistics--a course traditionally taught by psychologists.

Item response theory (IRT) scaling and multilevel modeling are two other quantitative methods that have more in common than they seem at first glance. The Rasch model is usually expressed as

$\text{Pr}(\theta_s) = \tfrac{1}{1+e^{-(\theta_s-b_i)}}$,

where $\inline \theta_s$ represents a test taker's latent ability, and $\inline b_i$ represents the difficulty of an item. The Rasch model can also be expressed as a generalized linear model (GLM) with a logit link:

$\text{ln}\left(\tfrac{\text{Pr}(Y_{is}=1)}{\text{Pr}(Y_{is}=0)}\right)=\beta_i+\theta_s$,

where $\inline Y_{is}$ is a dummy variable (1 if a correct response, 0 otherwise) and $\inline \beta_i$ is the easiness of an item.

Test development designs can be thought of as multistage sampling designs in which test takers represent primary sampling units and item responses represent secondary sampling units nested within test takers. Items can also be thought of as primary sampling units in which responses are nested. If item responses exhibit sizable intraclass correlation, then mixed-effects modeling may be appropriate to account for loss of statistical power relative to simple random sampling. Mixed-effects models may also help with vertical scaling and identifying differential item functioning (DIF) by considering fixed effects for test taker age and group membership. Some authors treat items as fixed effects; while others treat them as random effects. In the latter case, the Rasch model can be expressed as a two-level mixed model with crossed random effects:

\begin{align*} \small{\text{Level 1: Item responses}}\\ \text{ln}\left(\tfrac{\text{Pr}(Y_{mis}=1)}{\text{Pr}(Y_{mis}=0)}\right) &= \beta_{0is} + e_{mis}\\ \small{\text{Level 2: Items and test takers}}\\ \beta_{0is} &= \beta_{00} + r_{0i} + r_{0s} \end{align*}.