June 13th, 2008

idea, intrigue

Say what you mean

So a while back I mentioned using Contextual Network Graphs as translation tools then said nothing more about it.

Then at Google I/O I saw a talk by Jeff Dean that talked about sort of similar ideas to do with a statistical approach to machine translation (the relevant stuff starts at 40:53).

So my idea was this.

You take a corpus of text in a given language, the larger the better, and every time two words appear next to each other you either create a link between them or make the link between them stronger (depending on whether you've seen that juxtaposition before).

Then you do it again with another language. Then you analyse the two graphs too look for similar patterns. An easy way to do this would be to take a word that you know is the same in both languages and energise it in both graphs and then list the resulting energised nodes in both.

So, for example, you might know have a corpus in both English and French with the phrases "dog sled", "dog house", "dog food", "dog catcher", "dog end" in the respective languages and you happen to know that "dog" in French is "chien". So you activate the "dog" and "chien" nodes with same amount energy in both graphs, wait a for the energy spread to settle and then list the energised noes and compare the fact that "house" has a residual energy of (completely arbitarily) 0.0034 and that "maison" has a residual energy of 0.0032. Which gives you a probability that the two words are the same.

You do that with a few other words and then you end up with a bunch of statistics about which words are the same. Do that with a large enough corpus and you have a list of words you know each with a list of words in the foreign language that it might be (with the probabilities that they're the same) - hell you can even find out the probabilities of what they mean in different contexts (by activating the energy of the surrounding words) so that you can know that "free" can mean "gratuit" or "libre" depending on the context.

Now, this obviously has several flaws in it but when I did a rough and ready test a few years back I was pleasantly surprised by the results. And I'm kind of gratitifed that Google's approach doesn't appear to be completely unsimilar.