?

Log in

No account? Create an account

Thu, Feb. 8th, 2007, 11:20 pm
Speak to me ... of universal laws

Continuing in the tradition of posting to my LJ about work rather than, you know, actually doing said work. As such I think it's time that you and I sat down and had a little talk about languages. Now, there's no need to be embarassed - everybody's got a language whether they want to or not. Some people even have more than one. But we point and laugh at them and call them Polyglots which is actually a pretty funny word when you come to think of it.

So how does this fit in with searching? Well, pull up a chair and I'll tell you.



Almost every search engine will have something called an Analyser (note my use of an 's' not a 'z'. This is because I hail from Albion and therefore stubbornly refuse to use your new-fangled American spellings). An Analyser takes the incoming text and splits it into tokens (or words to you and me). Then they normalise it. Basic normalisations include lower casing all words and then removing stop words - words which are so common that they're useless to search on like 'and', 'or', 'a' etc etc.

The text coming in can either be the document or the query but it's important you use the same analyser on both. The point of the normalisation is to try and increase the number of matches you get so that, for example, if a document contained the words
    THE quick BRoWN foX jumps OVer tHe Lazy Dog

and you searched for
    quick fox

you'd still get a hit even though your capitalisation was different.

Other common forms of normalisation include stemming (turning 'jumps' and 'jumping' into 'jump') and number conversion (turning 'four hundred' into '400' or vice versa although then you might start running into issues with phone numbers etc).

The balance you're trying to find is getting enough results so that the search is useful (you'd be pretty annoyed if the document said 'quick brown fox jumping over lazy dog' and a search for 'jump fox' didn't match) without compromising quality (as an example, a hypothetical normaliser could remove everything except the first letter of the word which would mean that 'quick brown fox' would match the query 'fire' which would be useless).

So - all's well and good eh? Well, not quite. Not all languages are as 'easy' as English. Let's take, for example, Japanese.

Japanese has 3 "alphabets" - Kanji, Hiragana and Katakana. Kanji is logographic and a single glyph describes a whole word so that this



means 'silkworm'. However Kanji can be decomposed into the second alphabet which is Hiragana. Hiragana has 46 syllables (plus various modifiers) and the Hiragana for 'silkworm' is

かいこ



Or kaiko (ka i ko).

Similarly sushi (su shi) is:

すし



The third alphabet - Katakana is also syllabalic. In fact it's the same 46 syllables that Hiragana has. However it's used to spell out foreign words so 'computer' is spelt

コンピューター



or 'ko n pyuu taa' (the 'pu' プ being a 'fu' フ with a 'moro' modifier on the top right).

Bizarrely the word 'stapler' is

ホッチキス



or 'hotchikisu' probably because a brand of staplers imported into Japan were made by the E. H. Hotchkiss Company.

Learning Hiragana and Katakana doesn't take very long - a couple of weeks of easy study (I used index cards which I pasted the Japanese on one side and wrote the English on the other and then just tested my self on the way to and from work) and is well worth the effort (well, if you're a huge nerd like me).

But I digress.


So - back to search. What does that brief diversion into Japanese have to do with it? Well, how do we search for Japanese stuff? Theoretically someone could search for silkworm in three ways - kanji, hiragana and then with western characters ('kaiko') also called Romaji. Well - it seems clear - we decompose Kanji to Hiragana then turn Hiragana and Katakana into Romaji.

And how do we do that?

Well, for the Kanji we can use a lookup table - there are 3000 or so common Kanji so that's not too big. However there's a problem - there are 3 main different ways of romanising Japanese characters: Kunrei, Nihon, and Hepburn wach with various advantages and disadvantages.

But let's say we stick with kunrei. Now all we need to deal with is numbers. Numbers should be easy, right? Well, no - whilst the ordinal numbers are fairly easy (ichi, ni, san) the counting numbers are a different kettle of fish. Different things in Japanese have different counters depending on their nature so that that the first three counters for hon (Long, cylindrical objects - trees, pens etc.) are ip-pon, ni-hon, san-bon but the first three counters for nin (people) are hitori, futari, san-nin.

Oh, and there's politeness levels too -

どちら



or dochira is the polite form of saying

どく



or doku or 'where' ("Which country are you from?" is "O-kuni wa dochira desuka").

So, basically - it's hard. Very hard.

And just to make matters worse - there are no spaces between words. So even tokenising is hard.

"Ah!" you cry "But western languages are easy, no?".

Well, consider the French. Est-ce que vous and Est-ce que tu mean the same thing but have different levels of familiarity.

But, more importantly, think of German.

German has many, many compound words. Compound words are, basically, two nouns stuck together to form another noun. English has them - 'doghouse' is one for example. German takes it to extremes though. For example

Donaudampfschifffahrtsgesellschaftskapitänsunterhosen



which famously means "the underpants of a captain of the Danube steam ship company".

Less extreme - Kinder means children, Wagen means car, Kranken means ill and Haus means house. So - Wagenhaus means garage, Krankenhaus means hospital, Krankenwagen means ambulance and Kinderkrankenhauskrankenwagenhaus means "The garage for the ambulances of the childrens' hospital".

Once again the poor search engineer is confounded in his simple task to merely tokenise the input.

*sob*

Then add onto that people who have posts in more than one language. Or switch language MID-FRICKIN-POST ...

*double sob*

So if you're favourite language isn't supported at launch or support is a bit flaky then you'll know why.

For what it's worth it looks like we'll probably be supporting English, Brazillian, Czech, German, Danish, Greek, Spanish, Finnish, French, Italian, Japanese, Korean, Norwegian, Portugese, Russian, Swedish and Chinese from the start although the quality of each and how they'll actually pan out in production is another matter.

Thu, Feb. 8th, 2007 11:49 pm (UTC)
hachi

I'm happy to say that your link from Japanese to staplers worked perfectly in my mind. I didn't even have to read past the cut for that.

Thu, Feb. 8th, 2007 11:53 pm (UTC)
fetjuel

Hi! This is an interesting post about some confounding stuff.

May I offer some corrections of your Japanese?
- The characters used in Japanese are not really "alphabets"; that is, they don't contain letters representing phonemes. Instead, the kana are syllabaries, and kanji are logograms (not pictograms).
- "Computer" should be コンピューター, "konpyuutaa".
- The familiar form of "where" is どこ, "doko".

I admire your noble quest for good searching!

Fri, Feb. 9th, 2007 12:12 am (UTC)
deflatermouse

>The characters used in Japanese are not really "alphabets"

Yeah, I just figured it was the easiest term to use. You're absolutely right though.

>"Computer" should be コンピューター, "konpyuutaa
>The familiar form of "where" is どこ, "doko".

Whoops, you're right. I stopped my Japanese about a year and a half ago and it's gradually all been slipping away - I should really brush up on it.

Fri, Feb. 9th, 2007 12:12 am (UTC)
jimramsey

Seems like it might be easier to just get everyone to speak English. Or better yet, Esperanto.

Fri, Feb. 9th, 2007 12:22 am (UTC)
xb95

Good luck on that one, dude.

Neat post, Simon!

Sat, Feb. 10th, 2007 01:44 pm (UTC)
peshwengi

Fascinating!

It reminds me of when I was working in the "information retrieval" group at QMUL. They were working on very similar things. Although I wasn't really involved, I was just sticking to my TV project and didn't get into the search stuff...

Mon, Feb. 12th, 2007 08:42 pm (UTC)
osiris_correspo: English-Russian on-line translation

I ask you 2nd month to send me english-russian on-line translator.(with special blog terminology)I am ready to pay for comfortable communication.Thank you in advance

Wed, Feb. 14th, 2007 08:26 pm (UTC)
deflatermouse: Re: English-Russian on-line translation

Hmm, I'm not sure such a service exists. The only ones I've come across is Babel Fish and Google Translate.

Unfortunately since I'm employed already and also don't speak Russian I won't be much use to you.

Thu, Feb. 15th, 2007 05:22 pm (UTC)
osiris_correspo: Re: English-Russian on-line translation

Thank you!Good idea -to work with GOOGlE directly from own site

Thu, Feb. 15th, 2007 12:02 am (UTC)
searchtools

Well, start with exact matches! Don't kill yourself on stemming or lemmatization or any of that stuff until you get exact matches working. And even after, query expansion is a tricky business, ya gotta make sure that the users are along for the ride.

While lightweight pluralization seems to work nicely (people don't even NOTICE that you're doing the right thing), more energetic query expansion can lead in very wrong directions. My classic example of this is stemming "Run, Lola, Run" and getting the movie "Ran". Just... wrong. And the more you push this envelope, the more likely you're going to be wrong and lose user confidence.

Do the straightforward stuff first?