April 22nd, 2005

diesel, learning, evil, sweeti

trimming bloglines

After a brief discussion with someone about so called 'folksonomies' (euugh) and whether there was duplication of tags leading to bad data someone else mentioned that you could download the current 10,716 Bloglines categories.

So I threw togther various Linugua::EN::* modules and got this which takes a few seconds to run over the whole thing and trims the list by a healthy 2000 odd items (removing the stemming stage means you only trim 1520).

If I could get hold of an ISO format English thesaurus then I reckon I could probably trim an extra 1000 odd off.