The example commands given above are fairly simple, and this is clearly the best way to start. If I try to get the system to understand "Computer, could you be a dear and roll down the rear starboard window halfway, please," I would probably not meet with early success. While the accuracy rate of each word in Sphinx is fairly high, over a complete utterance the accuracy is (word_acc^N), where word_acc is the per-word accuracy and N is the number of words in the utterance. Therefore, for a per-word accuracy of e.g. 90% and a mere 5 words, accuracy of utterance recognition dips to 59%, and after 10 words it drops to 35%. By limiting the vocabulary, though, I can increase the single word recognition, and by keeping the commands as short "stock" phrases I can increase the total utterance recognition accuracy.
I have started in this direction a little bit already. I used this corpus as a starting point for commands to be recognized. Using the CMU online language modeling tool, I uploaded the above-mentioned corpus file to obtain this language model file.
If you would like to try this on your own computer, download the files above. Then install sphinx-2 on your computer. There will be a sample language model set up in "/usr/local/share/sphinx2/lm/turtle/". I'm not sure the exact directory, I'm doing this from memory. The turtle directory contains a sample language model that is suitable for an environment to control a turtle graphics system. Since my environment is much different, I created a new directory called "/usr/local/share/sphinx2/lm/car/" and unpacked the above tarball there. Now, there is a program installed with sphinx2 called sphinx2-test, located in "/usr/local/bin". This is a script which loads the turtle language model and calls sphinx-continuous, which is a program that just waits for speech input and outputs the most likely utterance. Now make a copy of sphinx2-test called sphinx2-car and edit the script to include your car language model instead of the turtle model - it's only like two lines. Whew! That sounds like a lot of work, but if you're familiar with *nix at all it should be fairly trivial.
I did all this on my computer, and ran the sphinx-car program. So far, it hasn't missed a beat, as long as I "stick to the script." I tried it the morning after I had set it up, and forgot that the language model doesn't know the word "radio" or "stereo," just the words "volume", "up", and "down," so I obviously wasn't met with much success. The next thing I tried was playing loud music near the microphone to see if that bothered it. Obviously if I'm using this system to control my stereo there will be background music, and I hope that isn't a deal-breaker. The first experiment was The Shins - this worked fine - I think the singing is too high-pitched to be interpreted as a voice, since the system was probably mostly trained using men. The next experiment was Johnny Cash. This is a much tougher test, because not only does he have a nice deep voice, but much of what constitutes "singing" for the Man in Black is indistuingishable from talking. The system did interpret some of this music as speech, but it didn't assign it any words that it knew. So, as long as there aren't any Johnny Cash songs where he yells out "Roll the window up!" or "Smash into the car in front of you!" I should be safe.
So, the simplest part of the project is effectively completed. The remaining work to be done in speech recognition is the following: