Log in

No account? Create an account

Mon, Jan. 11th, 2010, 12:04 pm
I know what I like and I like what I know

I actually wrote this about 2 years ago and, for some reason, never posted it but I was having a conversation about it recently and figured I might as well publish. And be damned.

Searching isn't a one step malarkey. Ignoring tokenising and stemming, thesauruses and normalisation you have to select a working set of documents and then rank them.

How you rank them isn't a simple thing either - it's generally a multi faceted score based on a whole host of factors such as some sort of document score like Lucene's, recentness or some sort of trust metric like Google's PageRank amongst others.

That's enough to get you a basic result set but it's very passive.  Shouldn't a search system react to its users and learn what they like, like a particularly skilled lover who is handy with a bottle of baby oil and a lightly blunted cheese grater?

So, you can build a click tracking network which is a form of neural network. You train it using the raw tokens in the query (note, the raw tokens, not the normalised, stemmed and otherwise munged ones), what the results shown were and what they clicked on.

Chaper 4 of "Programming Collective Intelligence" has some great stuff on this but, in brief - you build a multilayer perceptron (MLP) network consisting of, err, multiple layers of nodes.

The first layer will be the tokens entered in the query, each token representing one node. The last layer will be the weightings for the different matches that were returned (each match being one node). The middle layer (MLPs can have multiple layers but in this case we'll only use one) is never contacted by the outside world but it's still very important.

All nodes on a layer are conceptually connected to all the nodes on its neighbour layer(s).

First off we train the network - when someone searches for 'polish sausage' and then clicks on Document A we use a algorithm called back propogation to strengthen the connection between the Document A node and the nodes for 'polish' and 'sausage' using the hidden, middle layer.

Then the next time someone searches for 'polish sausage' we turn on the nodes representing the tokens and they in turn on nodes in the hidden layer and when they get a strong enough input they'll turn on nodes in the output layer. In the end the output layer's nodes will have been turned on a certain amount and we can use this amount to calculate how strongly certain words are associated with certain Documents. We then adjust the search ranking accordingly and viola! Bob is your Mother's Brother.

This turns out to be useful because, if someone was to search for 'furniture polish' instead we can learn from our users that Document A is useless and instead rank Document B higher.

Observant readers will have noticed that this mechanism seems startling close to Contextual Network Graphs and they should be applauded. There are differences however and, unfortunately, the margin of this email is too narrow to contain them.

The neat thing is that we can do this at more than one level. We can track the clicks of everyone and we can also track all your clicks and those will give us different weightings which we can combine somehow to (such as, in the simplest case, use your weighting unless you haven't rated this query in which case fall back to the crowd's rating) to tweak the scores that way.

Or, if we know who your friends are, we can combine their weightings too, perhaps factoring how much we think you like them into the equation too.

Now, at first, unless you were dealing with a closed social network, that may have been tricky but now we can utilise the Open Social Graph for maximum funitude. The clouds will part, birds will sing, rainbows will, err, bow and all will be a shiny and happy future with jet packs and hovercraft for all.

Apart from the pesky privacy thing raising its head again.

For example - say you only have one friend on this social network. When you're logged out a search for "Donkey" gives you jolly pictures of the aforementioned beasts, perhaps wearing jaunty hats or chilling at the beach

Then you log in and do the same search and suddenly the same search brings back images of a more, shall we say, interspecies erotica nature.

At which you have some handy blackmail material for you single friend.

Ah, you might say, can't we guard against that by, say,  only using your friends' data if you have more than a certain number of friends?

See, if there's one thing we learnt is that patterns can be extracted from even carefully anonymised data such as the NetFlix recommendation data or, more pertinently, the AOL search results leak. So, if we implemented this data it's possible, nay likely even (since the universe seems to tend towards maximum lulz) that someone would figure out a way to extract the data and work out who had been clicking on what and suddenly your clownsuit-wearing muskrat fetish is public knowledge.

Which, you know, is nothing to be ashamed of per se but might lead to awkward small talk at the next party you go to.
(Deleted comment)

Wed, Jan. 13th, 2010 11:24 pm (UTC)

Tokenise using the same thing that your search engine is then either have each token have its own id, or hash the token into a numerical id. There's really nothing to stop you using the tokens as the input though.