So, as previously mentioned inverted indexes have their problems - the chief one being that, well, they're quite dumb and really have no idea what you're searching for. So if you search for the phrase "kitty litter" you are unlikely to get any results back that mention "cat litter" even though, really, you should.
One way round that would be to have some sort of gigantic thesaurus and/or list of synonyms but in reality that probably wouldn't work very well since words can mean more than one thing or can have nuances you're probably looking at getting a lot of false positives. Which is worse than getting fewer results. Also it requires a lot of language specific knowledge and ... well, it's a pain in the arse to be frank.
But, that information is out there. The speaker of a language builds up a mental map of what words go together and we can even imply that a word we haven't heard probably means something like another one just from context - maybe there's a way to extract that information from our document corpus. And this is where LSI comes in. In essence what LSI does is stop thinking about what words documents have in them and starts thinking about what they're actually about. "MAGIC!" you say. "Surely some deeply scary Minsky-esque AI voodoo is going on here. Machines that read and think and understand! Lions lying down with Lambs. Pass me my jetpack and my flying car because surely the next step is automatons that will do my work for me!".
Sadly, no. In actuality LSI works despite (actually, probably because) it's a bit dumb. It's like the Idiot Savant of Search Engines. It doesn't know what it's doing it just does it.
The first form of LSI I'm going to talk about is Vector Space Engines. Of course, Wikipedia has the goods but there's a great article on Perl.com about building Vector Space Engines by the rather brilliant Maciej Ceglowski whose name keeps popping up all over the place when you talk about this stuff. For all you bigots, snobs and ignoramuses who are turning up your nose due to the fact that I've mentioned the word Perl - get over it. It's a really good article and you don't need to understand Perl to understand it. Also, Ruby's too slow and Python smells of wee.
The Cliff's Notes version is this - you tokenise, stem and generally munge your document and then you use each of those tokens as a value along an axis unique to that token (which probably appears in multiple documents). What you end up is a n-dimensional space (where n is the number of unique tokens in your corpus) with documents at points plotted in that space. In turn each document has a vector from the origin with a magnitude (i.e how many times each word occurs in the document), and a direction (i.e which words were in the document, and how common they are).
To do a search you take a query, find its vector and then take cosine distance between that and the vectors of all the documents in the index. As Maciej puts it
"Our intuition is that document vectors with many words in common will point in roughly the same direction, so the angle between two document vectors is a good measure of their similarity. Taking the cosine of that angle gives us a value from zero to one, which is handy. Documents with no words in common will have a cosine of zero; documents that are identical will have a cosine of one. Partial matches will have an intermediate value - the closer that value is to one, the more relevant the document."
Alternatively you can take an existing document's vector and compare it and find "more documents like these" for no extra cost. Nifty!
So how does this help us with the cat/kitty problem? Well, notice the phrase "document vectors with many words in common"? Therein lies the secret. Two articles talking about stuff for felines to crap in will use similar language - "tray", "bloody animal", "smelly", "all over my best shoes" etc etc so even if they use the word "cat" instead of "kitty" they'll still be in the same neighbourhood of our n-dimensional universe. Et voila!
So, why does nobody use Vector Space Engines then? They seem like the bee's knees. The mutt's nuts. The alpha and the omega.
In short ... building the vector is slow. Everytime a new term is added (i.e n is increased) everything has to get recalculated.
It's not completely unworkable of course. Google has the same issue with their PRank matrix (which is also a sparse matrix) but, in reality, it's not really practical.
On the other hand searching is fast. You have to compare your query vector against the vector of every document in the system but that's only O(n) (a different n this time) and can easily be farmed out to parallel machines. Which is nice.
However, let's look at an alternative - Contextual Network Graphs.
Now these are sexy, sexy beasts. If you buy me a drink and get me started on them then I can talk all night (even more than I normally do) about how they can be applied to every problem known to man from Spam to Automatic Document Translation to World Peace to The Schleswig-Holstein problem which Lord Palmerston once described as "so complicated, only three men in Europe have ever understood it. One was Prince Albert, who is dead. The second was a German professor who became mad. I am the third and I have forgotten all about it.”. But I'll refrain from going off on that particular tangent and stick to plain searching.
So, you start off in the same way as you always do with search - tokenising your new document. Then you create a new node in a graph for that document. You then add an edge from your Document Node to the Term Node of each of your tokens (creating a new node if one doesn't already exist). The weight of the edge (if you're feeling fruity) is the number of times the document appears. In fact, if you're in a really daring mood then each node<->node connection can have multiple edges coming in and out represrenting different metrics (where the Term appeared in the document, how rare a word it is etc etc).
Then to search the graph you energise one or more nodes.
Now, if you're searching 'normally' you tokenise the query and then energise the node of each of Term in the query. Each of those Term Nodes gets a certain amount of energy and then it energises the nodes connected to it (with a modification factor based on the weight of the edge). Then each of those energised nodes energises the nodes it's connected to and so on and so on with each iteration passing on a little less energy. You keep doing this until you hit an energy threshold at which point you look at the most energised nodes and these are your results. This is called Spread Activation Energy.
Now, obviously if you're doing a 'normal' search then you energise the Term nodes and pick the top most energised Doc nodes but what you could acutally do is energise a Document Node and find the other most energised Document Nodes (for a "more like this" search) or energise a Doc Node and find the most energised Term nodes (and thus get suggested categories or, for blog posts, tags for your doc which may not even appear in the body of the text) or Term to Term searches (for example we could suggest other search terms to use based on your most recent search or feed in all the interests of LJ users and then based on your current interests suggest other interests you might want to look at).
All using one handy-dandy graph.
And the best thing? Adding a new Document is O(1)!
By now, if you thought Vector Space engines were neat, you may be like me at the moment and be so excited that you've peed a little. It's ok. it's a common reaction to CNGs and is nothing to be ashamed about.
There is just one teeeny weeny eeensy little problem. Ickle. Tiny. Minute. Practically not worth mentioning.
Energising the graph is *mumble* really really slow *mumble*.
The annoying thing is that if we built the graph physically by say having ball bearings with voltmeters attached, wired together using metal cables of different resistances and then energised the graph by electrifying the source nodes and then selecting the most energised nodes for our results then searching would be so close to O(1) as to be practically indistinguishable to the naked eye. All I need is a large warehouse somewhere in the third world and a team of dextrous orphans abseiling down from the ceiling wielding soldering irons.
Sadly those concerned with so called "human rights" and other such liberal wishy washisms would put a stop to such endeavours so the next time you're lamenting the lack of a better search engine blame Amnesty International[*].
There is light at the end of the tunnel though. Moores law can only help (and this helps Vector Space engines as well, I've been playing around with recalculating the vectors using GPUs which are rather fast at doing such things) and people are looking at novel graph partitioning techniques and a few other sneaky tricks so maybe we'll see a proper CNG engine before the end of the decade. Which would be sweeeeeeeeeeet.
In the meantime read the stuff at Knowledgesearch.org which is a team from NITLE and Middlebury College (including Maciej) which has a wealth of material. In particular these slides from an EtCon session are a good primer.
Colophon: my user icon is from this picture by my evil twin Simon 'Other Simon' Batistoni. Despite his suggestion that I'm looking coy I'm actually practicing Avalance Beacon searching with my trusty DTS Tracker. And wearing my LJ hoodie.
Well, I thought it was appropriate and ever so slightly clever. And I don't care what you thing. So there. *sniff*
[*]for those with a sense of humour bypass I'm not actually encouraging this. In fcat you should donate money to Amnesty les we all end up in some gulag somewhere with our human rights (or other orifices) being violated.