Log in

No account? Create an account

Thu, Jul. 26th, 2007, 02:07 pm
I get knocked down, but I get up again

Warning! Achtung! けいせい! Ranting ahead. Much of it rendered somewhat incoherent by some sort of devil lurgy that has infested my mitochondria forcing me to do battle with the twin weapons of tea and sudafed. Bloody West Nile. I currently look very much like I do in this userpic except imagine more of a dressing gown motif. And I'm bleeding from my ears.

Incidentally that photo was taken when I worked for a sweat shop in Taiwan sewing laptops and Playstations for 16 hours a day. They didn't let us have chairs but the Benzedrine was free. I'm actually 6 in that photo. I had pituitary problems.

Anyway, so ranting then. I've had bits of this lying around for a while so I'll paste it here whilst I drink my Lemsip and then slope off back bedwards.

Think of an inverted index as being a database. It doesn't have an SQL interface but its still a database. In a large scale production environment your databases need several attributes - reliability, scalability and performance.

It's been all quiet on the search engine front here which is for a bunch of reasons too tedious to go into. We've been using parts of Lucene to build our search engine - in particular the file format which is, to be fair, very clever and very fast. Its tightly tuned for speed and size.

Initial tests proved positive but, after a while, we started to notice several things. For a start, Lucene is brittle. It corrupts easily, readily and often unrecoverably. An unclean shutdown can leave stuff in a state in which you don't really know what's actually happened and which operations have gone through and which haven't. You also have to roll your own locking for concurrent writes since writing isn't thread safe.

There's also a bunch of things you have to do to make it scale.

In short - as far I understand it, there's a whole load of work to be done to make it enterprise safe and ready. To be fair Lucene isn't an out of the box search engine - it's a search engine building toolkit - but I'm starting to wonder if it can ever be made to work reliably at the level that a very large site needs. This is not to knock Lucene, you understand, designing stuff for small websites has very different parameters than for large web sites.

There's potential hope however - I have been playing round with Compass which potentially may have solved several of my issues. I hope. The problem is that it suffers from Java Documentation Syndrome.

It's possible that I'm spoiled by the documentation on CPAN. Its become fashionable to knock CPAN recently - decrying the quality of code (despite the fact that its openness is exactly what made it so successful) or saying that it's hard to install packages (although it is, in this respect, no different from any other package manager) or that certain packages need half of CPAN to install and this is because it engenders a culture in which every last thing to be put into a separate package and turned into a dependency.

But it has a tonne of modules that just save time, it encourages sharing of code (in an educational sense, in a Free software sense and in a stylistic "code should be modular" sense), and it encouraged (through tradition rather than by mechanical means) good documentation (in the whole) and wholesale unit testing. The thing I love most about CPAN is the synopsis section of a module's docs - when it's done well half the time all I ever need to do is copy and paste that bit, change some variables and be done. Worst case scenario it provides an entry point into the module - you get the basic bit running and work from there.

Compare and contrast with the Java documentation I've come across which has been almost uniformly awful. Maybe its just me. Maybe I've just been unlucky. Maybe it's just Open Source Java projects. Documentation seems to be split into 4 parts - written docs, javadoc for the API which amounts to little more than pre and post conditions for methods (how many times have I gone to look at docs for an example and found public static void main(String[] args) as the only docs), a link to some JIRAs and a Wiki.

Part of the problem is that where in Perl everything's turned into a library in Java everything's turned into a Framework. Do it MY WAY or don't do it all. YOU MUST CONFORM:

There often doesn't seem to be an obivous 'main class' to use - I even found this with an Rsync library. When there is documentation it seems to assume that you're going to be running with something else - Ant, Maven, NetBeans, Beans (which I believe is different from NetBeans), J2EE, that you're going to want XML configuration files (sometimes there's a link to "programmatic configuration" which usually says "Do exactly the same as in the XML file". Err, ok. How do I put attributes in then?), that you're only going to want to use all or nothing. Architectural diagrams of high level overviews are wimps, commies and those that don't have the money to spend on Java consultants.

Maybe I'm just not used to the Java style and I'm rubbing up against my own expectations. Maybe I'm just grumpy from irregular sleep and vivid, sweaty but strangely abstract dreams. Maybe I'm right and Java documentation just sucks.

Thu, Jul. 26th, 2007 06:51 pm (UTC)

Good rant, and true. CPAN has spoiled you. It's the epitomy of how good open source documentation can be. Java comes from an enterprisey world where you get paid the same regardless of your results. In other words, no impetus to JFDI.

I'm dipping my toes into Python at the moment, and finding a strangely divided community: lots of it feels very Java, with endless flashily-named projects for doing quite minor things. But next door, there are projects like Django which really do seem to JFDI.

And for heaven's sake, don't read Javadocs when you're ill. You'll just make it worse.
(Deleted comment)

Tue, Aug. 21st, 2007 06:14 pm (UTC)
searchtools: lucene et al.

Have you seen Solr? It's based on Lucene and I think makes it more robust, but the near-realtime index update demands may be too much for it. Compass looks very interesting!

Have you considered starting with a fairly simple fulltext search implementation, tuned for short queries, and then getting more elaborate with stemming and context later?