Simon Wistow (deflatermouse) wrote,
Simon Wistow

Pesky Real World!

Occasionally I see people complaining that they can't search for "foo bar" in a search engine and only get back pages with that phrase in.

Want to know why? Ok, to have that ability they'd have to have an inverted index with every n-gram of words on every web page they index.

Yahoo! and Google have been having a pissing fight about whos index of pages is bigger but let's be conservative and go with Yahoo!'s 20 billion objects of which they claim 19.2 billion of that is web pages. Let's pretend they're lying and that it's only 10 billion pages.

Now, let's be conservative and pretend that, on average a web page has 150 words on a page. That makes, err, 5.71 times 10 to the power of 262 n-grams. Or, in scientific notation 5.71E262.

The average character length of an English word is 5 and the average length (i.e number of words) of an n-gram will be 75 so, assuming 1 byte per character, the average size of an index for a page will be

    5.71E262 x 5 x 75 x 1

which equals 2.14E265. To get the size of the index for every page (in Terabytes) we have to do

    (1E13 x 2.14E265)/(1024 x 1024 x 1024 x 1024)

which equals


or, ironically, about 2E66 (also known as 2 undecillion) googolplex Terabytes.

A 1Tb LaCie Big Disk weighs in at about 6.7x1.7x10.6 in, or 120.7 cubic inches (approximately). Given that the volume of the earth in cubic inches is

    4/3 x PI x  3963^3 x 5280^3  x 12^3

or 6.62E25 cubic inches, then to have an inverted index for every page we'd need a room the size of 3.55E242 Earths.

Ok, let's try and be a little more sensible and go with only n-grams of three words or less of which there should be 50 x 75 x 150 (1312500 ) on each page. The new average n-gram length would then be 2.

    1312500 x 5 x 2 x 1

Which means a measly 1.31E7 bytes index per page and 1.19E8 Terabytes for the whole index.

Of course this will still take up 1.44E10 cubic inches, or to put it another way, 52 million imperial (proper) gallons which is, apparently, approximately the amount of water used in each transit of the Panama canal.

You could probably fake it using Latent Semantic Indexing to get a list of candidate pages and then weed out the ones which a simple grep shows doesn't have the phrase in. It wouldn't be perfect but it would give the illusion of working.

Slightly more doable might be to split the search phrase into n-grams of length=3 and then find all the pages with all those n-grams in and then do the grep. Maybe.

Tags: apparently, conservative, cubic inches, english word, err, foo bar, google, googol, gram, inverted index, ironically, lacie, occasionally, panama canal, scientific notation, terabytes

  • I know what I like and I like what I know

    I actually wrote this about 2 years ago and, for some reason, never posted it but I was having a conversation about it recently and figured I might…

  • All the. Small things.

    Way back, when I was at Yahoo!, there was naturally a lot of discussion about the meteoric rise of Google. At the time we'd dumped Altavista and were…

  • Psychosomatic addict insane

    Specially for burr86 it's time for yet another in the far more than you ever wanted to know series. And today this…

  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.