October 14th, 2005

diesel, learning, evil, sweeti

c00k1ng h4XX0rs

Returning from Montreal and hopped up on jetlag and melatonin, I had an idea. london.food shows that there's at least a healthy overlap between geeks and cooking. Indeed cooking is a lot like hacking - it's creative but technical, spanning the gap between engineering and art. You can cargo cult by just blindly following the recipe or you can understand why you mix butter and oil together. And, of course, there's more than one way to do it.

So, my idea is to pitch a Cooking Hacks book that is part tutorial, part recipe book and part list of useful tips such as "If your dish is too salty then you can add milk", "whisk some butter into a sauce to make it glossy", "if you've got no sour cream add lemon juice and baking soda to normal cream", "to get griddle marks simply ..."

It appears to have some traction among the other denizens of the list so I'm going to plonk myself down in a pub with some of the more enthusiastic suspects and try and bash out a sample contents page and ideas for a chapter.

And, of course, that would another thing of my list of 100 things to do

diesel, learning, evil, sweeti

Pesky Real World!

Occasionally I see people complaining that they can't search for "foo bar" in a search engine and only get back pages with that phrase in.

Want to know why? Ok, to have that ability they'd have to have an inverted index with every n-gram of words on every web page they index.

Yahoo! and Google have been having a pissing fight about whos index of pages is bigger but let's be conservative and go with Yahoo!'s 20 billion objects of which they claim 19.2 billion of that is web pages. Let's pretend they're lying and that it's only 10 billion pages.

Now, let's be conservative and pretend that, on average a web page has 150 words on a page. That makes, err, 5.71 times 10 to the power of 262 n-grams. Or, in scientific notation 5.71E262.

The average character length of an English word is 5 and the average length (i.e number of words) of an n-gram will be 75 so, assuming 1 byte per character, the average size of an index for a page will be

    5.71E262 x 5 x 75 x 1

which equals 2.14E265. To get the size of the index for every page (in Terabytes) we have to do

    (1E13 x 2.14E265)/(1024 x 1024 x 1024 x 1024)

which equals


or, ironically, about 2E66 (also known as 2 undecillion) googolplex Terabytes.

A 1Tb LaCie Big Disk weighs in at about 6.7x1.7x10.6 in, or 120.7 cubic inches (approximately). Given that the volume of the earth in cubic inches is

    4/3 x PI x  3963^3 x 5280^3  x 12^3

or 6.62E25 cubic inches, then to have an inverted index for every page we'd need a room the size of 3.55E242 Earths.

Ok, let's try and be a little more sensible and go with only n-grams of three words or less of which there should be 50 x 75 x 150 (1312500 ) on each page. The new average n-gram length would then be 2.

    1312500 x 5 x 2 x 1

Which means a measly 1.31E7 bytes index per page and 1.19E8 Terabytes for the whole index.

Of course this will still take up 1.44E10 cubic inches, or to put it another way, 52 million imperial (proper) gallons which is, apparently, approximately the amount of water used in each transit of the Panama canal.

You could probably fake it using Latent Semantic Indexing to get a list of candidate pages and then weed out the ones which a simple grep shows doesn't have the phrase in. It wouldn't be perfect but it would give the illusion of working.

Slightly more doable might be to split the search phrase into n-grams of length=3 and then find all the pages with all those n-grams in and then do the grep. Maybe.