# July 2011 Archives

## Best breweries in Minnesota: Value-added ratings

| 1 Comment

A good friend of mine told me to check out Beer Advocate online. Their rating system does a good job of balancing simplicity (A-F grades) with thoroughness (measures of central tendency and dispersion). I also appreciate the way it encourages raters and readers to consider multiple dimensions of a beer's quality--its look, smell, taste, and feel--in addition to overall quality.

As a practitioner of multilevel and latent variable modeling, one aspect of Beer Advocate's rating system has given me pause: a lack of adjustment for consistently high- or low-raters. What's concerning about a simple average of ratings? Let's imagine a plausible scenario: a motivated group of beer aficionados with high standards get their hands on a limited-release beer and publish their reviews on Beer Advocate. The beer could be world class, but if only tough raters have tried it, then the average rating will appear lower. Conversely, a large number of uncritical raters could review a mediocre beer, resulting in an excellent rating on average.

Is my concern warranted? If so, which beers and breweries are rated highest after adjusting for raters and other factors? To answer those questions, I wrote some code to gather Beer Advocate ratings of beers produced by Minnesota breweries. The code relies heavily on the XML and stringr packages. After gathering the ratings, I decided to check my assumption of within-rater consistency. The intraclass correlation coefficient of r = 0.13 indicates a small degree of consistency within raters, but enough to justify my concern.

I specified a multilevel model to obtain value-added ratings of beers and breweries while adjusting for rater and beer characteristics. For simplicity, I decided to focus on the overall rating, although I would like to consider multiple dimensions of beer quality (look, smell, taste, and feel) in future analyses. The model specifies random effects for raters, cross-classified with beers nested in breweries. The fixed effects were the serving type experienced by the rater for a given beer (e.g., can), the beer's alcohol by volume (ABV), and its style (e.g., porter).

\begin{align*} \small{\text{Level 1: Ratings}}\\ \text{ln}\left(\tfrac{(Overall_{ijkl}-1)\div 4}{1-(Overall_{ijkl}-1)\div 4}\right) &= \beta_{0jkl} + \mathbf{Served}_{ij} \boldsymbol{\beta}_1 + e_{ijkl}\\ \small{\text{Level 2: Raters and beers}}\\ \beta_{0jkl} &= \beta_{00l} + \beta_{01} ABV_k + \mathbf{Style}_k \boldsymbol{\beta}_{02} + r_{0j} + r_{0kl}\\ \small{\text{Level 3: Breweries}}\\ \beta_{00l} &= \beta_{000} + r_{00l} \end{align*}.

Note that I transformed the overall rating from a 1-5 scale to logits. I did so to avoid predictions below the floor of 1 point and above the ceiling of 5 points on the original scale (i.e., to enforce lower and upper asymptotes) and to spread the ratings out at the tails (e.g., make it "harder" to go from 4.5 to 5 than going from 3.5 to 4). The chart below shows the resulting ogival relationship between the original point system and the logits.

Plot showing transformation from points to logits

I estimated the model parameters with the lme4 package. Given the large number of fixed effects (and the need for future blog content), I'll describe the influence of serving type, ABV, and style at a later date. The value added by breweries after adjusting for rater and beer characteristics is shown in the table below. Value-added logits were transformed back to the original scale, with values in the table representing point deviations from the brewery average, but the logit and point ratings are virtually the same. August Schell Brewing Company recently won national accolades, but it came in second to a relative newcomer, Fulton Beer. Summit Brewing and Surly Brewing followed closely in the third and fourth spots. The results show some discrepancies between the value-added ratings and Beer Advocate's grades, which are based on the simple average of all beer ratings. For example, Harriet Brewing received an A-, compared to Schell's B grade, but Harriet's value-added rating is below average.

(average beer rating)
Fulton Beer 0.85 0.80 A-
August Schell Brewing Co., Inc. 0.83 0.79 B
Summit Brewing Company 0.80 0.76 B+
Surly Brewing Company 0.71 0.68 A-
Minneapolis Town Hall Brewery 0.27 0.27 A-
Lake Superior Brewing Company 0.18 0.18 B
Fitger's Brewhouse 0.10 0.10 B+
Olvalde Farm & Brewing Company 0.08 0.08 A-
Granite City Food & Brewery 0.04 0.04 B-
Flat Earth Brewing Company 0.03 0.03 B+
Brau Brothers Brewing Co., LLC 0.02 0.02 B
Pig's Eye Brewing Company 0.00 0.00 D+
Harriet Brewing -0.09 -0.09 A-
Lift Bridge Brewery -0.11 -0.11 B+
Leech Lake Brewing Company -0.14 -0.14 B+
Barley John's Brew Pub -0.26 -0.26 B+
McCann's Food & Brew -0.27 -0.26 B+
St. Croix Brewing Company, LLC -0.69 -0.66 C
Cold Spring Brewing Co. -0.99 -0.91 D+
Great Waters Brewing -1.37 -1.19 B+

A different picture emerges when one considers that value-added ratings are point estimates within a range of certainty. The prediction-interval plot below shows that we can't confidently conclude that Fulton Beer scored above average, let alone out performed Shell's, Summit, or Surly. We can confidently say, though, that Shell's, Summit, and Surly scored above average, and along with Minneapolis Town Hall Brewery, they significantly out-performed Cold Spring Brewing Co. and Great Waters Brewing.

Plot of breweries' value-added ratings with 95% prediction intervals

The results highlight some of the benefits of multilevel modeling and the limits of value-added ratings. By controlling for beer characteristics and separately estimating variance components for raters and beers nested in breweries, I have reduced the degree to which raters' tastes and brewers' preferences for certain types of beers are confounding brewery ratings. That is, the value-added ratings reflect the brewery's overall quality with greater fidelity. Nevertheless, we should keep in mind that value-added ratings are based on error unexplained by theory/fixed effects, that they are point estimates within a range of certainty, and that one should interpret value-added ratings with caution because personal preferences still count. This analysis has motivated me to try beers by the breweries with high value-added ratings, but I know from experience that I like beers brewed by Brau Brothers, Lift Bridge, and Barley John's, even though their value-added point estimates place them in the bottom half of Minnesota breweries.

## Using irtoys, plyr, and ggplot2 for test development

My job as a Quantitative Analyst at the Minnesota Department of Education requires me to "provide evaluation, statistical, and psychometric support to the Division of Research and Assessment through the entire process of design, development, and reporting to the Minnesota assessment system." That's an accurate job description, but a little vague. What I really do is write a lot of code to apply item response theory (IRT) and compare item and test characteristics across multiple years and/or forms to maintain high quality.

is great for plotting item and test characteristics across multiple years and/or forms, but it requires using a few packages in concert (i.e., to my knowledge there is no single package designed for this purpose). The irtoys package provides many useful functions for test development under IRT; the plyr package allows one to apply irtoys' functions to subsets of item parameters in a data set (e.g., by strand); and the ggplot2 package allows one to overlay and facet item and test curves.

In the following reproducible example, I graphically compare item and test characteristics of the 2009 and 2010 grade 3 Minnesota Comprehensive Assessments of reading ability. Note that for simplicity, I have excluded two polytomous items from 2009 and a few dichotomous items from 2010 from the published item parameter estimates. Also note that items have been assigned to strands arbitrarily in this example. Graphs such as these allow us to answer questions like, "Did the 2010 form yield as much information (i.e., measure with equal or less error) at the cut scores as the 2009 form?" The graphs also provide guidance for replacing undesirable items.


########################################
#Combine two years' worth of item parameter estimates from grade 3.

data.pars.2009 <- matrix(ncol = 3, byrow = T, c(0.57, -1.24, 0.17, 0.93, -0.74, 0.25,
1.45, -0.45, 0.19, 0.97, -0.5, 0.13, 0.95, -0.8, 0.62, 1.07, -0.38, 0.24, 0.87,
0.37, 0.22, 0.61, -1.45, 0.02, 0.84, -1.03, 0.19, 0.72, -0.23, 0.12, 0.93, -1.6,
0.25, 0.68, -2.02, 0.06, 1.1, -1.5, 0.36, 0.74, -1.6, 0.51, 1.01, -1.01, 0.21,
0.81, -1.52, 0.16, 0.93, -0.5, 0.11, 0.34, -0.75, 0.02, 0.92, -0.93, 0.19, 1.14,
0.21, 0.26, 0.54, -0.59, 0.1, 0.86, -0.52, 0.28, 1.04, -1.77, 0.05, 0.84, -1.68,
0.02, 1.46, -1.7, 0.17, 1.3, -1.16, 0.21, 0.51, -0.62, 0.1, 1.09, -1.04, 0.17,
0.66, -2.06, 0.09, 1.11, -0.61, 0.11, 1.34, -0.9, 0.26, 1.3, -1.24, 0.19, 1.37,
-1.72, 0.24, 1.09, -1.37, 0.16, 0.89, -0.94, 0.08, 1.24, -1.38, 0.14, 0.79, -1.32,
0.11, 1.09, -1.69, 0.13))
data.pars.2009 <- data.frame(2009, data.pars.2009)
names(data.pars.2009) <- c("year", "a", "b", "c")
data.pars.2009$strand <- factor(rep(1:4, length.out = 38), labels = paste("Strand", 1:4)) data.pars.2010 <- matrix(ncol = 3, byrow = T, c(1, -2.02, 0.05, 0.61, -0.88, 0.24, 0.74, -1.36, 0.08, 0.93, 0.63, 0.25, 0.94, 0.38, 0.23, 0.56, -0.14, 0.32, 0.7, -0.73, 0.08, 1.1, -2.54, 0.02, 1.17, -0.36, 0.22, 0.46, 0.76, 0.09, 0.77, -1.53, 0.09, 1.5, -1.27, 0.18, 1.01, -1.41, 0.1, 1.63, -1.61, 0.21, 1.33, -1.63, 0.13, 1.11, -1.01, 0.19, 1.05, -1.17, 0.07, 0.88, -0.33, 0.23, 0.56, -0.65, 0.28, 1.03, 0.45, 0.13, 1.29, -1.73, 0.12, 0.96, -1.54, 0.15, 0.62, -1.33, 0.09, 0.67, -1.41, 0.06, 0.74, 0.16, 0.3, 1.33, -0.94, 0.26, 1.31, -1.71, 0.23, 1.14, -1.42, 0.16, 0.96, -0.87, 0.09, 1.22, -1.36, 0.14, 0.86, -1.19, 0.17, 0.94, -2.48, 0.02, 0.81, -1.96, 0.02, 0.98, -0.86, 0.13, 0.8, -0.68, 0.11, 0.72, -1.67, 0.04, 0.63, 0.32, 0.16, 1.24, -1.64, 0.33)) data.pars.2010 <- data.frame(2010, data.pars.2010) names(data.pars.2010) <- c("year", "a", "b", "c") data.pars.2010$strand <- factor(rep(1:4, length.out = 38), labels = paste("Strand", 1:4))

data.pars <- rbind(data.pars.2009, data.pars.2010)
data.pars$year <- factor(data.pars$year)
data.pars$id <- factor(1:nrow(data.pars)) #Apply D scaling constant. data.pars$a <- data.pars$a * 1.7 #Theta values for plotting thetas <- seq(-3, 3, by = 0.1) #Theta cut scores cuts.grade3 <- c(-1.40, -0.84, -0.01) ######################################## #Plot overlapping test characteristic curves. funk <- function(z) with(trf(z[, c("a", "b", "c")], x = thetas), data.frame(Ability = x, f)) data.plot <- ddply(data.pars, "year", funk) g1 <- ggplot(data.plot, aes(x = Ability, y = f)) + geom_vline(xintercept = cuts.grade3, linetype = 2, lwd = 1, color = "darkgrey") + geom_line(aes(color = year, linetype = year), lwd = 1) + scale_colour_manual(name = "Year", values = c("darkgoldenrod1", "darkblue")) + scale_linetype_manual(name = "Year", values = 2:1) + scale_y_continuous(name = "Expected score") + opts(title = "Test characteristic curves") g1  The test characteristic curves appear comparable up to the "exceeds proficiency standard" cut score, above which the test became more difficult compared to 2009.  ######################################## #Plot overlapping conditional standard errors of measurement (from test information). funk <- function(z) with(tif(z[, c("a", "b", "c")], x = thetas), data.frame(Ability = x, f)) data.plot <- ddply(data.pars, "year", funk) data.plot$f <- 1/sqrt(data.plot$f) g1 <- ggplot(data.plot, aes(x = Ability, y = f)) + geom_vline(xintercept = cuts.grade3, linetype = 2, lwd = 1, color = "darkgrey") + geom_line(aes(color = year, linetype = year), lwd = 1) + scale_colour_manual(name = "Year", values = c("darkgoldenrod1", "darkblue")) + scale_linetype_manual(name = "Year", values = 2:1) + scale_y_continuous(name = "CSEM") + opts(title = "Conditional standard errors of measurement") g1  The conditional standard errors of measurement appear comparable, overall, except for the 2010 standard error at the exceeds-proficiency cut, which appears to slightly exceed the 0.25 rule of thumb.  ######################################## #Plot overlapping CSEMs faceted by strand. funk <- function(z) with(tif(z[, c("a", "b", "c")], x = thetas), data.frame(Ability = x, f)) data.plot <- ddply(data.pars, c("year", "strand"), funk) data.plot$f <- 1/sqrt(data.plot\$f)

g1 <- ggplot(data.plot, aes(x = Ability, y = f)) +
geom_vline(xintercept = cuts.grade3, linetype = 2, lwd = 1, color = "darkgrey") +
geom_line(aes(color = year, linetype = year), lwd = 1) +
scale_colour_manual(name = "Year", values = c("darkgoldenrod1", "darkblue")) +
scale_linetype_manual(name = "Year", values = 2:1) +
facet_wrap( ~ strand) +
scale_y_continuous(name = "CSEM") +
opts(title = "Conditional standard errors of measurement")
g1



It looks like strand 3 could use a highly discriminating difficult item to bring its CSEM in line with the corresponding 2009 CSEM at the exceeds-proficiency cut.


########################################
#Plot item characteristic curves faceted by strand.

funk <- function(z) with(irf(z[, c("a", "b", "c")], x = thetas), data.frame(Ability = x, f))
data.plot <- ddply(data.pars, c("id", "year", "strand"), funk)

g1 <- ggplot(data.plot, aes(x = Ability, y = f)) +
geom_vline(xintercept = cuts.grade3, linetype = 2, lwd = 1, color = "darkgrey") +
geom_line(aes(group = id:year, color = year, linetype = year)) +
scale_colour_manual(name = "Year", values = c("darkgoldenrod1", "darkblue")) +
scale_linetype_manual(name = "Year", values = 2:1) +
facet_wrap( ~ strand) +
scale_y_continuous(name = "Probability", breaks = seq(0, 1, by = 0.25), limits = c(0, 1)) +
opts(title="Item characteristic curves")
g1



Compared to 2009, there are fewer items from strands 1 and 2 that students are able to guess easily (i.e., high c parameters/lower asymptotes). As noted above, strand 3 lacks an item that discriminates at the exceeds-proficiency cut. Replacing the poorly discriminating difficult item in strand 2 with a higher discriminating one would also reduce measurement error of high-ability students.