Simon Wistow (deflatermouse) wrote,
Simon Wistow

Put my thang down flip it and reverse it

The most common form of search engine is what's called an Inverted Index. The concept is very simple - you take a document and break it up into tokens (although as per my previous post that's not as simple as you'd think) then you give the document an id and then go to the inverted index and note that document id next to every token.

So, for example, say the document contains the phrase
the quick brown fox
and we give it the document id 1 then we index
brown: 1
fox: 1
Then we get two new documents
the quick and the dead
as cunning as a fox
and give them the ids 2 and 3 respectively.

Now the index looks like
brown: 1
cunning: 3 
dead: 2
fox: 1, 3
quick:1, 2
and you repeat with any other documents you have lying around.

Then to search all you need to do is take the query and tokenise it. Let's say "foxy brown" and it gets tokenised to "brown, fox". Then we go through each token and look up the document ids in the inverted index and keep a running score. In this case
1: 2
3: 1
i.e document 1 has a score of two and document 3 has a score of one. Clearly document 1 is the best match.

So that's inverted indexes in a nutshell. Easy huh? Well, as ever with this whole search caboodle, it's not quite that easy. Why? Well say there are two documents, one which contains the phrase "foxy brown is a singer" and the other that says "The quick brown fox jumps over the lazy dog" and you search for "foxy brown" well then you'd prefer the first document not the second. So order is important. In which case we can store the token's position in the document as a tuple and then weight stuff that had the same order as the query. So, again, indexing "the quick brown fox" as document 1:
brown: {1, 3}
fox: {1, 4}
quick: {1, 2}
Except, well, do we take the document position pre removing stop words or not? And order doesn't always work - if we're searching for "hot grits recipes" and there's a page with "recipes for hot grits" then you're going to prefer it over "saying 'natalie portman hot grits' on Slashdot is a recipe for becoming known as a troll" even though the second (probably irrelevant) page has the words in the right order.

So another thing we can try is the distance between the words of the query. Actually what we do is take the cosine distance between the query vector and the document vector which is an application of a vector space model (of which more later). It's all explained here so I won't go into it.

The thing is that it's even more complicated than that - say for example you're crawling the web and indexing pages (I hear that there are a couple of companies doing that these days) - would you consider the phrase "Quick Foxes" in the title of a page more relevant than in the main body? Do you boost it if it's marked up as a headline ala <h1>Quick Foxes</h1>. Does nearer the 'top' of the page make it more valuable than nearer the bottom? All good questions.

What Google did is use PageRank as an attempt to post filter pages. How it works is you select a working set using your inverted index (and I think what a lot of people don't realise is the fact that Google, as well as the more well known PageRank system, also had a blazingly fast Inverted Index which helped) and then sort that working set using PageRank.

PageRank itself is based on a summation formula i.e you calculate the PageRank score of a page Pi and then do it again based on the new value (and the new value of the pages linked to it) for n iterations. This in turn can be represented as matrix, a sparse matrix (let's call it H) in fact which is good because it's much less effort to calculate the summation for a sparse matrix than for a dense matrix - O(nnz(H)) (where nnz(H) is the number of non-zero values in the matrix) as opposed to O(n2).

H in turn looks like a stochastic transition probablity matrix for a Markov Chain which means you can bring a whole load of maths to bear on the problem. Which is good, because it's a biggy - It's been said (by Cleve Moler, the founder of Matlab) that PageRank is "The world's largest Matrix calculation". This was back in 2002 when the matrix was only in the order of 2.7 billion non-zero elements. It's significantly larger now.

And this leads to a problem - you can't recompute that matrix in real time, it's too big. This means that you can only recompute it every so often (I think the figure was once every 40 days) an event known colloquially by SEOs (Search Engine Optimisers) as "The Google Dance". However a mjor downside is that this means that your index isn't very fresh. And this displeases the bloggers who are mighty and opinionated and require their egos soothing and the nurturing reassurance of search engine listing. To be honest I have no idea how The Big G solved this - my guess would be to maybe collect a subset of the internet and recompute that much smaller matrix daily and then combine the score somehow with larger slower matrix. Or maybe they figured out a way to only partially recompute the big matrix. Maybe the pigeons just got smarter. Who knows? Googlers are notoriously secretive - I heard that once a Googler roundhouse kicked someone to death because they accidentally oversaw an impromptu dinner-table Google Maps demo the evening before launch.

For those with more interest there's a fantastic book called PageRank and Beyond by Amy N. Langville and Carl D. Meyer that goes into a lot of detail about PageRank and the HITS system which is remarkably similar (although HITS cares about both inbound and outbound links rather than just inbound). It's very maths heavy but worth the read if you're that way inclined.

That said, during my time at Yahoo! I was lead to believe that in actuality the current PageRank system is actually a series of scoring functions that add a weighting to each potential hit from the Inverted Index. One of these is the citation based system known commonly as PageRank but there are many more. Occasionally the power of each weighting is updated. I have no idea if this is true. It sounds plausible though and if that's good enough for Wikipedia then it's good enough for me.

So, lurching unsteadily back to to LJ and search - basically we have a pretty good system of scoring based on word position but it'd be nice if we could create some sort of post-sorting function a la Google based on the author of the post, how popular they are, the popularity of (i.e the number of replies to) their comments (although this could help trollers) and whether they're on your friends list or not. But this will have to wait for later.
Tags: inverted indexes, more than you ever wanted to know, pagerank, search

  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.