(keitai-l) Re: Sorting Yomi

From: Curt Sampson <cjs_at_cynic.net>
Date: 01/17/05
Message-ID: <Pine.NEB.4.58.0501171718430.2767@angelic-vtfw.cvpn.cynic.net>
On Mon, 17 Jan 2005, Alex Shinn wrote:

> At Mon, 17 Jan 2005 14:57:59 +0900 (JST), Curt Sampson wrote:
> >
> > This is not true, because sorts based on the numerical representation of
> > a kana can't give tokuon a lower precedence than kana following the kana
> > with tokuon. For example,「じゃきょう」 sorts before 「しゃく」in my
> > dictionary, but with a sort based on character codes, じ (0x3058) comes
> > after し (0x3057), and so じゃきょう would sort after even 「しんぬ」.
>
> Oops, sorry, don't mind me I was asleep when I replied :(

I have made the exact same mistake on this list in the past.

> I think for hiragana only your algorithm works.

Right. But you could translate katakana in the same way, if you wanted,
with a little tweak or two to deal with elongation marks and so on, and
maybe adding a fourth digit if you really care to sort katakana after
hiragana when the words are exactly the same.

> Including kanji, katakana and romaji the JIS standard includes 5
> collation levels - you can see an open source implementation of the
> full collation in Perl's Lingua::JA::Sort::JIS:

Ah, right. That was actually linked from my page, except due to an HTML
error it was hard to see. I didn't really understand the algorithm it
was using, though.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974

***   Contribute to the Keitai Developers' Wiki!   ***
***        http://www.keitai-dev.net/wiki/         ***
Received on Mon Jan 17 10:26:55 2005