« Topic Models in Presidential Debates | Main | Human Language Technology on Cellphones »

Follow-up on topic models

After talking with some co-workers in the lab, I decided to re-run some of the experiments from below with different topic numbers. The rationale was that five topics seemed like a number that's kind of in "no-man's land." That is, you could say that there are fewer topics (foreign policy, domestic policy, and campaign b.s. is one factorization we came up with), or you could say that there are more topics (Iraq, Iran, Afghanistan, health care, energy policy, terrorism, tax policy, etc...), but choosing exactly five topics requires a very contrived and odd factorization of the discussion topics. There is probably not enough data to meaningfully try to figure out more than five topics, so I went smaller. Here are the results for three topics:

Obama's Topic Models

health people care give sen tax american ve plan provide cut insurance policies system lot working policy businesses companies economic
mccain make important president years don energy work point things percent making economy means understand bush true america fact put
ve senator billion john tax time spending deal country oil iran afghanistan problem year troops world states nuclear united iraq

McCain's Topic Models:

ve people ll don time american make tax americans back care friends record nuclear business billion home senate world things
senator obama united states spending lot president iraq point understand strategy work government cut thing great war washington issue sit
obama america sen health taxes country joe voted years money campaign government plan give reform increase fact made dollars history

This is a little bit sneaky, since I'm not really doing a task or evaluation here per se. Rather, I'm just doing the training process and looking at the resulting models. As such, it's tempting to just eye-ball the training results and pick the topic size based on the models that look the best (i.e. most interpretable by a human mind). If I had the data readily available, I would probably try every topic size I could above five and beyond. As it is, i decided to try to be a little bit intellectually honest and only post the number we came up with in discussions in the lab last week, three. I don't think the results are extremely illuminating here, but that's life!

One thing we can take away from this is the importance of finding the correct representation of training data. If I had more time, I would tinker with the training data a bit more. Right now every line in the data counts as one data point, and the line breaks are chosen essentially by the transcriber(s). This should roughly correspond to one topic per line, since usually a candidate's turn consists of talking about the topic that was raised by the moderator. But this is not foolproof, since the candidates often start by answering the question, then pivot to talk about something else they want to make sure gets mentioned. Finally, the transcription is not perfect, and there are some places where the line breaks might be arbitrary. If I were to do this analysis more carefully, I would use a automatic sentence segmenter and just train the models sentence by sentence.

One more thing: what is this good for? Well, the analysis here and below might not be good for much, for the reasons mentioned in the last paragraph (it's hard to tell). But assuming there was more data and the models were clear-cut topics, this could be very useful. One thing you might do is use these models to classify new texts. If there were a fourth debate, you could take the transcriptions from it, and in combination with the models, cluster each sentence in the new debate, and assign it to the topics. The way this usually works in text clustering is what's sometimes called "soft clustering", in which a sentence isn't assigned to a single cluster, but is assigned to each cluster with a certain weight. So if your three learned topic models roughly correspond to domestic policy, foreign policy, and campaign b.s., a new sentence about Iraq might classify as 70% foreign policy, 25% domestic policy, and 5% campaign b.s., since Iraq policy also has an impact on domestic policy, despite being y definition foreign.

So I think just that result is cool. But if you're not an AI researcher, and you are actually concerned with practicality, you might think, but what does that get you? Well, I'm sure there are lots of practical uses, but I'll just list a couple I can think of off the top of my head. You could use a system like this to automatically build a database of a candidate's stances on various issues. You could build a system which took questions from interested voters, assigned them to one of the topic categories, and then retrieved the similar statements made by the candidate to try to answer the question. Any other ideas?