Simon Wistow (deflatermouse) wrote,
Simon Wistow
deflatermouse

Speak to me ... of universal laws

Continuing in the tradition of posting to my LJ about work rather than, you know, actually doing said work. As such I think it's time that you and I sat down and had a little talk about languages. Now, there's no need to be embarassed - everybody's got a language whether they want to or not. Some people even have more than one. But we point and laugh at them and call them Polyglots which is actually a pretty funny word when you come to think of it.

So how does this fit in with searching? Well, pull up a chair and I'll tell you.



Almost every search engine will have something called an Analyser (note my use of an 's' not a 'z'. This is because I hail from Albion and therefore stubbornly refuse to use your new-fangled American spellings). An Analyser takes the incoming text and splits it into tokens (or words to you and me). Then they normalise it. Basic normalisations include lower casing all words and then removing stop words - words which are so common that they're useless to search on like 'and', 'or', 'a' etc etc.

The text coming in can either be the document or the query but it's important you use the same analyser on both. The point of the normalisation is to try and increase the number of matches you get so that, for example, if a document contained the words
    THE quick BRoWN foX jumps OVer tHe Lazy Dog

and you searched for
    quick fox

you'd still get a hit even though your capitalisation was different.

Other common forms of normalisation include stemming (turning 'jumps' and 'jumping' into 'jump') and number conversion (turning 'four hundred' into '400' or vice versa although then you might start running into issues with phone numbers etc).

The balance you're trying to find is getting enough results so that the search is useful (you'd be pretty annoyed if the document said 'quick brown fox jumping over lazy dog' and a search for 'jump fox' didn't match) without compromising quality (as an example, a hypothetical normaliser could remove everything except the first letter of the word which would mean that 'quick brown fox' would match the query 'fire' which would be useless).

So - all's well and good eh? Well, not quite. Not all languages are as 'easy' as English. Let's take, for example, Japanese.

Japanese has 3 "alphabets" - Kanji, Hiragana and Katakana. Kanji is logographic and a single glyph describes a whole word so that this



means 'silkworm'. However Kanji can be decomposed into the second alphabet which is Hiragana. Hiragana has 46 syllables (plus various modifiers) and the Hiragana for 'silkworm' is

かいこ



Or kaiko (ka i ko).

Similarly sushi (su shi) is:

すし



The third alphabet - Katakana is also syllabalic. In fact it's the same 46 syllables that Hiragana has. However it's used to spell out foreign words so 'computer' is spelt

コンピューター



or 'ko n pyuu taa' (the 'pu' プ being a 'fu' フ with a 'moro' modifier on the top right).

Bizarrely the word 'stapler' is

ホッチキス



or 'hotchikisu' probably because a brand of staplers imported into Japan were made by the E. H. Hotchkiss Company.

Learning Hiragana and Katakana doesn't take very long - a couple of weeks of easy study (I used index cards which I pasted the Japanese on one side and wrote the English on the other and then just tested my self on the way to and from work) and is well worth the effort (well, if you're a huge nerd like me).

But I digress.


So - back to search. What does that brief diversion into Japanese have to do with it? Well, how do we search for Japanese stuff? Theoretically someone could search for silkworm in three ways - kanji, hiragana and then with western characters ('kaiko') also called Romaji. Well - it seems clear - we decompose Kanji to Hiragana then turn Hiragana and Katakana into Romaji.

And how do we do that?

Well, for the Kanji we can use a lookup table - there are 3000 or so common Kanji so that's not too big. However there's a problem - there are 3 main different ways of romanising Japanese characters: Kunrei, Nihon, and Hepburn wach with various advantages and disadvantages.

But let's say we stick with kunrei. Now all we need to deal with is numbers. Numbers should be easy, right? Well, no - whilst the ordinal numbers are fairly easy (ichi, ni, san) the counting numbers are a different kettle of fish. Different things in Japanese have different counters depending on their nature so that that the first three counters for hon (Long, cylindrical objects - trees, pens etc.) are ip-pon, ni-hon, san-bon but the first three counters for nin (people) are hitori, futari, san-nin.

Oh, and there's politeness levels too -

どちら



or dochira is the polite form of saying

どく



or doku or 'where' ("Which country are you from?" is "O-kuni wa dochira desuka").

So, basically - it's hard. Very hard.

And just to make matters worse - there are no spaces between words. So even tokenising is hard.

"Ah!" you cry "But western languages are easy, no?".

Well, consider the French. Est-ce que vous and Est-ce que tu mean the same thing but have different levels of familiarity.

But, more importantly, think of German.

German has many, many compound words. Compound words are, basically, two nouns stuck together to form another noun. English has them - 'doghouse' is one for example. German takes it to extremes though. For example

Donaudampfschifffahrtsgesellschaftskapitänsunterhosen



which famously means "the underpants of a captain of the Danube steam ship company".

Less extreme - Kinder means children, Wagen means car, Kranken means ill and Haus means house. So - Wagenhaus means garage, Krankenhaus means hospital, Krankenwagen means ambulance and Kinderkrankenhauskrankenwagenhaus means "The garage for the ambulances of the childrens' hospital".

Once again the poor search engineer is confounded in his simple task to merely tokenise the input.

*sob*

Then add onto that people who have posts in more than one language. Or switch language MID-FRICKIN-POST ...

*double sob*

So if you're favourite language isn't supported at launch or support is a bit flaky then you'll know why.

For what it's worth it looks like we'll probably be supporting English, Brazillian, Czech, German, Danish, Greek, Spanish, Finnish, French, Italian, Japanese, Korean, Norwegian, Portugese, Russian, Swedish and Chinese from the start although the quality of each and how they'll actually pan out in production is another matter.
Tags: french, japanese, language, more than you ever wanted to know, search, tokenisation, underpants
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 10 comments