« Confusing Sarah Palin (chatbot) | Main

Reading, Life Indexing, and Self-Driving Cars

After the last entry I wrote about an article on self-driving cars, I looked forward to the new installment of the series of stories it was based on. I recommend reading that article if you're interested in the topic, but it didn't really provide enough grist for another entry on applied AI.

But while reading, I had an interesting thought. I was noticing the author making many factual claims, which I believed, but was somewhat surprised that they were not justified or explained. The standard blogging technique is just to link to background reading or supporting arguments when making claims that require support but whose arguments do not fit into the scope of the entry. This is not always easy, though, and I can empathize with the author. He, like most technology professionals, probably does a lot of reading, both for pleasure and for work, and doesn't always realize and note while reading which points will be important to remember and reference later.

This is why I would really enjoy an application or web browser add-on that could index every webpage I read all day, every day, on every computer. This application would take advantage of the rapidly decreasing price of storage, which makes indexing even the entirety of text that voracious readers consume a trivial task. The closest thing I have seen to something like this is Google's "Web History" widget, which keeps track of your google searches. If you were looking to build something like this to be immediately useful, you could leverage any web history you have as a starting point, even if it doesn't make up the entirety of your reading list.

Now, applications. First of all, the inspiration behind this, the desired link to explanation that eludes the voracious reader. The dream application would work like this: While writing a blog post (or journal article, or whatever), you realize a claim you are making would be more impactful with a direct citation. Simply copy a sequence of words (perhaps the "anchor text" you would use if you were hyperlinking), and paste it into the indexing engine.

In its most simple form, the results can simply come from a google-type bag-of-words string matching algorithm, ranked by relevance. But more advanced approaches could use something like the topic model approaches we've discussed to return documents that not only match keywords but rather the topics represented by the search query. The most advanced step (and the subject of much my of research) is to use natural language understanding techniques to actually index the relations between referents in the documents, and within the query, and find a match based on a document making similar claims about these relations as your query phrase. Of course, there are many incremental steps between current research and this last technology, but there is no reason why the data collection part cannot begin right now, even without any simplifications or compression done to the text data.