What's not working in Stat 8051-52


If we remodel and restructure our applied Statistics courses for our MS program (and perhaps first year PhDs), what kind of changes would you like to see.

Let's focus on Stat 8051 and 8052 now. Here's my criticsm of these courses, as they are now:

  1. We have too much classical linear regression. Nothing fundamentally wrong with that, except that many (most? all?) students now have a decent idea what regression is, before coming into our program. Most have had some exposure to regression methodology like ordinary least squares, some software, some theory, some experience with data analysis. So, do we need to begin with where we begin the course, and go as slowly as we do?
  2. Who'se afraid of experimental design? We have, essentially, a full semester-long 4-credit course on very classical experimental design. Do we need all of that? How much of modern statistical research is on design of experiments, and why should it take precedence over every other applied statistical topic? Also, how much of the course is actually applicable in a modern world?
  3. Too little of new and exciting stuff: A semester of regression and a semester of experimental design leaves little or no room for any other topic. Is this a problem?


I was frequently asked lots of computational issues for Linear Regressions for large dataset, such as what is the data cannot fit the main memory, or something related to online learning algorithms for OLS.

So, I was hoping the regression can cover the large scale learning issues since we are in the era of "big data"...

Can we introduce more industrial applications of experimental design nowadays ? I know it is still very useful like designing ads campaign, a/b test for certain things...

One bad thing I found was that, these two courses emphasize on knowing / remembering formulas for regression, anova etc. This is also tested in the in-class exams. Frankly, this is not very useful as mostly the derivations of those formulas are not covered in detail, so people who really want to do research on them, do not have enough background to start with. Also, all those things can be done with software, and testing if someone can do them by hand is not that important. This accentuates the authors third point that by focusing on these trivial things we are missing many new materials.

Suggested extra topics : Partial least squares, Non parametric regressions, Quantile regression, ordinal regression, high dimension data, asymptotic behavior of model selection criteria, penalized regression methods, etc. All of these are related to linear models and I think can be incorporated in these two semester long courses. The 53-54 series should deal with other more advanced and computational problems.

Sen made a great point. IT companies do experimental designs, but not in the agricultural way, but more about online marketing campaign(direct mail vs phone), a/b testing etc. I also feel like we focus too much on the basic and traditional materials, we need to learn some new ideas and methods. Mining the massive datasets is another great book I would recommend.

There are tons of topics to cover in the field of regression. We cover linear regression pretty well, but I would like more advanced and broad topics for regression.

A couple of thoughts on this:

1. We need to recognize that we are training two groups of people for two very different purposes: MS candidates who will work in industry and PhD candidates who will become professors (let's ignore for a moment that some PhD candidates go into industry also). Those two groups have very different end destinations and need very different educations. This idea of making the sequence one-size-fits-all doesn't serve anybody. Since I'm an MS now in industry, I'm going to limit the remainder of my comments to what is best for that group.

2. Outside academia, almost no one uses R. Everything is SAS. I would estimate that not exposing our MS students to SAS decreases their market value by something like $20,000/year. It doesn't matter if R is intrinsically better than SAS; SAS dominates the corporate world, and nobody is going to go around and convince every analytics group in the world that they should switch. Also, nobody buys the old SAS and R are basically the same line because it's not true. Every MS candidate should have enough SAS training by the end of the applied sequence that they could pass the BASE SAS Level 1 and 2 certifications and should have to pass one of the specialized SAS certifications by the end of the second year. In fact, that should be a requirement of the plan B project. Doing at least some work in MatLab would also be a good idea.

3. Nobody in industry is ever going to ask you to perform a proof. It's much more important to understand how to implement and interpret a broad range of methodologies. Theory should still be presented in lecture, but every homework and exam problem should use a real life data set -- not a random data set from a parametric distribution, but an honest-to-god, drawn-from-life, doesn't-fit-perfectly-into-our-assumptions data set. 95% of the data sets I work with do not follow a normal distribution. 95% of the methods I learned in grad school started out by assuming a normal distribution.

4. Programming optimization needs to be introduced into the curriculum. Right now, industry wide, analysts are working with enormous data sets and computers that are too old and slow to handle them. Being able to write code that will process in the most efficient way possible is extremely important. Tricks like altering your system options and knowing how the various procedures process in comparison to one another make a world of difference in that environment.

5. A much broader variety of methodologies should be taught in the applied sequence. In fact, by the end of the sequence, students should have at least a basic familiarity with all of the major statistical methods including data mining, time series, non-parametric, multivariate, etc. If I have a problem at work, the first step toward solving it is knowing all the methods available to me. If I know a method exists, I can research it and figure it out. But if it didn't get covered in the applied sequence and I didn't happen to take that elective, that method might as well not exist for me. A very in depth knowledge of one or two methods is not nearly as valuable to me as a basic familiarity with a lot of methods. Learning how to study those methods independently and in greater detail is something that you will have to learn as a part of an applied plan B anyway.

6. Right now, everything is big data and machine learning. Most of the jobs I get contacted about are marketing driven; they are mostly looking for someone who can build an unsupervised learning model that will help them cluster consumers to sell them more stuff.

For PhD candidates, the sequence as it exists makes more sense (I would guess). In fact, I can see how it would work extremely well for someone entering a career in academia. But for masters students, I would gut it and start over.

Less regression is probably okay, but let's not remove it entirely. I think there's a lot of value in having one of our professors discuss regression to give it proper place in the students' toolbox, even if we don't spend a lot of time on it, and also make sure the students are all on the same page. We definitely don't want to give the impression that we're skipping it because it's outdated, and other methods are always to be preferred.

More clarity on experimental units, random effects, etc, especially at MS level. I know of at least one MS student that recently graduated who didn't really get what a random effect was and what it meant that the data set we were discussing had more than one level of experimental unit.

Students don't have a good grasp of the computational tools they need. While I understand that our purpose is not to teach R, we absolutely cannot let these students graduate without being able to do basic tasks. For example, many of our students don't know how to make trellis style plots (facets) or reshape data from wide to long format. (The new computing class will hopefully help; if so, these classes should coordinate.)

Although I'm a fan of the random effects/split plot sections of 8052, I think it's less important to go through the calculations for expected mean squares than it is to know how likelihood based methods compare to the more traditional methods. What are those older methods? Are they ever still appropriate? Do the new methods give the same results?

Leave a comment

About this Entry

This page contains a single entry by published on September 30, 2013 1:29 PM.

Stat applied courses: what's good was the previous entry in this blog.

Find recent content on the main index or look in the archives to find all content.