« Conversing with Computers | Main | Follow-up on topic models »

Topic Models in Presidential Debates

The 2008 Presidential debates are now behind us, which is kind of a relief from a political perspective, but form the point of view of statistical natural language processing it means one source of data is now unfortunately gone. With that perspective in mind, I wanted to see if I could do anything interesting with the data.

I've been itching to use Andrew McCallum's Mallet software for language-related machine learning tasks to see how easy it is, and so I thought this would be a good opportunity to give it a test drive. Bottom line: Really easy, to do what I'm about to do.

The tool I used is the topic modeling, which uses a Latent Dirchlet Allocation model to build a model of topics used in the data set. Functionally, you give it a number of topics you want, and a text data set, and it gives you back the set of words that best represent each topic. So where do the topics come from? Well, the algorithm figures them out! Often, but not always, one can simply look at the representative words for each topic and give it a simple category label. How it works exactly is a difficult topic, and even a laymen's explanation would take too much time and space for today, but I'll keep that in mind for a future post. But for now, let's get to the results.

The results below show 3 "experiments," one each for each of the candidates' words alone, and then another that combines both candidates' words as well as the moderator's words. For each experiment, there are 5 lines for the 5 topics, each topic represented by the words that are most likely to be generated given that topic. You'll notice that some words are in more than one topic -- this is ok! I will refrain from doing any analysis here, so if anyone wants to they can read the topic list and try to figure out their own topics and post them in the comments and I will do the same! Without further ado, here we go!

McCain Topic Models (Trained on only John McCain's responses)

obama people ll don american taxes country world voted give years long oil state energy war washington fundamental increase countries
america tax americans back friends record nuclear home things economy tough security working power sit national times tom jobs fine
senator united states spending lot iraq strategy thing general control job georgia troops important defense defeat afghanistan russians person ukraine
sen america joe money business campaign reform fact made small dollars jobs children party businesses wealth pay spread spending choice
ve time government health make president care point billion senate understand issue work cut great fought plan today young programs

Obama's Topic Models (Trained on only Obama's responses)

senator billion john deal iran afghanistan troops states problem nuclear united iraq back part pakistan world war military al place
sen policy point economic pay campaign education college ll additional afford future tough trade young taxes joe free based behalf
ve mccain important don work making economy means cut understand lot america fact put end companies middle change government credit
tax president years energy time spending country things oil bush year families working issues talk absolutely made good korea cuts
make people health care give percent money american plan provide system policies true issue crisis doesn insurance support businesses agree

All topic models (Trained on Obama, McCain, and moderator utterances)

senator obama time question afghanistan troops country security tonight war strategy russia pakistan tom ph georgia lead defense general street
sen obama ll give campaign america joe country pay reform time education plan billion voted jobs small trade free fine
spending united states america government back iraq understand nuclear americans lot don record taxes senate home business great friends washington
mccain make important president billion things point year john deal iran crisis problem means bush issues plan policies talk making
ve people health care tax years american energy work world economy oil money issue cut don percent fact working insurance

To reproduce this you need the following:
  1. Debate transcripts: Debate 1, Debate 2, Debate 3 (Click on "Print" and then copy/paste the text into a file.)
  2. Perl script to extract text by candidate (Download and change the extension from .txt to .pl)
  3. Mallet package and instructions on topic modeling

Comments

OK there might be a few topics here that I can pick out.

McCain #3 and Obama #1 both seem to be roughly "Foreign policy"

McCain #2 seems to be economy - "tough" and "times", "jobs", "working", "tax", "economy", and even "nuclear", which McCain has associated with job-creation.

Obama #5 seems to be health care.

All #1 is defense/foreign policy
All #2 is economic issues?
All #5 is "domestic policy/programs" (pretty broad I know)

fascinating.

additionally, it's interesting how our human minds seek to comprehend the associated words in each topic. this is akin to removing all discourse and syntax and only using lexical semantics to understand meaning... yet we still use the other words in context to find build our concept of coherence and meaning.

i wonder if giving a smaller list of representative words would render it more understandable? without the syntax and discourse structure, it is difficult to "chunk" the information within each topic, so perhaps a whole "memory element" in our short term memory is taken up by each word and we run out of space.