As I mentioned yesterday, now that I have some rudimentary speech recognition, I would like to try controlling my stereo from a computer connection. One thing that may help me is the fact that my car stereo has a removable face - this means that there is a very well-defined interface point between the controls (the face) and the brains (the non-face part).
As you can see, there is a 15-pin connection that connects the face to the back matter (I really wish I had a better name for that). The next sub-goal of this project is to hook up wires running from the back matter to the face. Using an oscilloscope (no I don't have one of these, and yes, I know they are expensive), I can hopefully eavesdrop on the signals that are sent between the two parts. By experimenting and pressing buttons on the face I should be able to capture signals, and then later try to replicate these out on the ports of my laptop. The stuff I wrote yesterday about speech recognition was trivial for me, but this stuff, which is probably trivial to any EE undergrad, leaves me utterly afraid. Okay, Tim, take a deep breath, relax, and begin to subdivide the task. Thank you other voice in my head, I feel much better now.
The first sub-sub-goal is now to come up with the wire connections between the face and the back matter. If I can't get this right, then there is no point in even worrying about an oscilloscope. There are some pictures below of the sides of the back matter/face connection.
As you can see, there are little pins sticking out on each side that grab onto holes on the side of the face. One possibility is that I can make use of those to make a wire connection that is able to hold itself in place using natural methods. Honestly, I'll probably just tape some wires onto a piece of wood, which itself will be taped to the car dash.
The example commands given above are fairly simple, and this is clearly the best way to start. If I try to get the system to understand "Computer, could you be a dear and roll down the rear starboard window halfway, please," I would probably not meet with early success. While the accuracy rate of each word in Sphinx is fairly high, over a complete utterance the accuracy is (word_acc^N), where word_acc is the per-word accuracy and N is the number of words in the utterance. Therefore, for a per-word accuracy of e.g. 90% and a mere 5 words, accuracy of utterance recognition dips to 59%, and after 10 words it drops to 35%. By limiting the vocabulary, though, I can increase the single word recognition, and by keeping the commands as short "stock" phrases I can increase the total utterance recognition accuracy.
I have started in this direction a little bit already. I used this corpus as a starting point for commands to be recognized. Using the CMU online language modeling tool, I uploaded the above-mentioned corpus file to obtain this language model file.
If you would like to try this on your own computer, download the files above. Then install sphinx-2 on your computer. There will be a sample language model set up in "/usr/local/share/sphinx2/lm/turtle/". I'm not sure the exact directory, I'm doing this from memory. The turtle directory contains a sample language model that is suitable for an environment to control a turtle graphics system. Since my environment is much different, I created a new directory called "/usr/local/share/sphinx2/lm/car/" and unpacked the above tarball there. Now, there is a program installed with sphinx2 called sphinx2-test, located in "/usr/local/bin". This is a script which loads the turtle language model and calls sphinx-continuous, which is a program that just waits for speech input and outputs the most likely utterance. Now make a copy of sphinx2-test called sphinx2-car and edit the script to include your car language model instead of the turtle model - it's only like two lines. Whew! That sounds like a lot of work, but if you're familiar with *nix at all it should be fairly trivial.
I did all this on my computer, and ran the sphinx-car program. So far, it hasn't missed a beat, as long as I "stick to the script." I tried it the morning after I had set it up, and forgot that the language model doesn't know the word "radio" or "stereo," just the words "volume", "up", and "down," so I obviously wasn't met with much success. The next thing I tried was playing loud music near the microphone to see if that bothered it. Obviously if I'm using this system to control my stereo there will be background music, and I hope that isn't a deal-breaker. The first experiment was The Shins - this worked fine - I think the singing is too high-pitched to be interpreted as a voice, since the system was probably mostly trained using men. The next experiment was Johnny Cash. This is a much tougher test, because not only does he have a nice deep voice, but much of what constitutes "singing" for the Man in Black is indistuingishable from talking. The system did interpret some of this music as speech, but it didn't assign it any words that it knew. So, as long as there aren't any Johnny Cash songs where he yells out "Roll the window up!" or "Smash into the car in front of you!" I should be safe.
So, the simplest part of the project is effectively completed. The remaining work to be done in speech recognition is the following:
The tagline of this site is "Tracking the failure of my latest pie in the sky project." Sometimes I start projects that are a little too ambitious for my limited time, finances, and intelligence. One reason for keeping a project log (plog) is that with my potential failure or success out in the public, I may be more motivated to actually complete the project. The worst case scenario is that I'll have a record of my project, and I might be able to tell where I went wrong or right.