Thursday, July 12, 2007

Google History

GoogleAnswers is a simple google api experiment. The program tries to answer historical questions with years as answers. It does this by scanning the top 10 pages according to google (from the google cache) for years, i.e. sequences of 4 numbers. The numbers that appear most are good candidates for being the answer.

Of course some number occur more than others on the Internet. 1999 appears more often then 1907, so if we find 1907 as often as 1999, then 1907 is probably the answer. To correct these frequencies differences, I let google search for all numbers from 1800 upwards. This file is available separately. After correcting the frequencies, a new list of candidates is established.

The third refinement is to have google count the number of times the answer candidate occurs together with the question (i.e. what has been called the googleshare of the candidate answer for the question). The answer with the most hits is our best guess. You could correct this answer again for the relative frequencies of numbers on the Internet, but I'm not sure about that.
Does it work? Not perfectly, but better than I expected. Events are placed pretty accurately in time. Events in peoples lives less so, probably because people have so many events in their lives. But please experiment with the code and improve!

Please note that you need a google api key to use this code.
http://douweosinga.com/