(keitai-l) Re: Supported Character Sets for I-mode

From: Curt Sampson <cjs_at_cynic.net>
Date: 01/12/06
Message-ID: <Pine.NEB.4.63.0601121528130.3138@angelic.cynic.net>
All right, my last post on the subject, more to clear up possible
misperceptions that I'm advocating Unicode in all circumstances, than
anything else, as well as to make some keitai gateway comments.

On Wed, 11 Jan 2006, Nick May wrote:

>>  that offers [such] a good compromise of clear standards,
>> ease of use, intertranslation with other character sets and reasonably
>> compact character encodings.
>
> Note the word "compromise".  Its the "one ring to bind them" view.

If this "'one ring to bind them' view" means that I think everybody
should use Unicode even when it doesn't suit the application, that's not
a correct characterization of my viewpoint. Let me explain a different
way:

Every character set and encoding has tradeoffs. One of the tradeoffs
of Unicode is that it can't deal with fair number of very complex,
mostly-typographical issues that are very infrequent in day-to-day work.
The benefit gained by this is that Unicode is simple enough that it's
not a lot of extra work for the ASCII- and ISO-8859-x-using world to
just use Unicode instead.

That's a huge advantage to the hanzi-using world, because it means that
they can get all their characters in places, such as filenames, where it
wouldn't normally be worth the effort. Given the choice between:

     a) no hanzi in filenames,
     b) having any hanzi you want in filenames, but being incompatable
        with computers outside of Asia, or
     c) having the hanzi you want in filenames 99.99% of the time, and
        this working on any computer in the world (in that the files are
        still usable even though you may not see the proper glyphs for
        the characters)

What's the sensible choice for the vast majority of the Japanese
population? Anybody who wishes to outright replace Unicode with
something else for all applications in Japan is saying that it's better
than thousands of Microsoft Word users can't easily exchange a document
with others outside of Japan just so that the odd person here and there
can have access to an extra character, if that.

> But that is not the only perspective. There is another perspective...
> We DO want to be able to encode all kanji in our national literature
> so that it can be kept in electronic form, we DO have to be able to
> handle every possible name that may appear on a driving license. 99%
> of 99%? Not even CLOSE!

That's fair enough. But if they're doing that now, they're not doing it
with EUC-JP, Shift_JIS, or ISO-2022-JP, and they're having to convert
to one of those three for web use or other interchange. Here's how Unicode
changes their situation:

     1. They continue using their custom encoding for their internal stuff.

     2. For interchange work, web sites, etc., they write a relatively
     simple converter, extremely similar to their existing EUC-JP or
     whatever converter(s), to translate between their internal encoding
     and a Unicode encoding.

     In fact, if they're that cheap, or hate it that much, they can just
     use their existing converter and then run the result through a
     separate converter (easily available for free) to translate between
     EUC-JP (or whatever) and a Unicode encoding. The only loss they
     experience there is that they might drop or change more special
     characters than if they did a direct conversion to/from a Unicode
     encoding.

The same sort of thing goes for typeseeing programs and suchlike. If
they're already using a specialized character set, nobody's asking
them to replace that with Unicode. Unicode advocates are just asking
everybody using EUC-JP, Shift-JIS and ISO-2022-JP to replace those
with a Unicode encoding on your public face, which loses you little to
nothing, and gives you a lot of gains.

> Imagine telling 1 in a hundred Americans they can't use their own name
> on their bank account...

You don't have to imagine this. The figure is probably not anywhere
near one in a hundred, but there are plenty of people out there who
consistently have their name mutilated via diacritical removal when
entered into a computer system. Some names got mutilated so often that
they eventually were changed to the mutilated version, such as my friend
Rhonda Carriere. (At some point in the past there must have been an
accent ageu on that last 'e'.)

Until we decide to go with some sort of standard graphical
representation for names (which is unlikely ever to happen), someone's
always going to lose out. Such as the person formerly known as Prince.

> If we are going to go through the pain of abandoning our legacy
> encoding, we want to do so to something that fixes ALL *OUR* problems,
> as far as possible, in a way that serves OUR linguistic culture.

It can't be done. As I pointed out above, something that fixes certain
Japanese linguistic problems is going to be so complex that other
cultures will refuse to implement the support necessary for it, and
you'll lose on the interchangability front.

> My main practical objection to UNICODE now - UTF-8 at least - is that
> it is fat-arsed. The Oompah Loompah of encodings. Want to send data
> to keitai? You will send FAR fewer bites with sjis. It doesn't matter
> whether we convert at the gateway or not - the fact is that SJIS/
> EUCJP is better suited to lowish bandwidth environments.

Ah, on to real keitai stuff now!

If a lower-bandwidth environment is the problem, I don't understand why
you feel conversion at the gateway doesn't matter. So long as what's on
the low-bandwith network is suited to the low-bandwith network, why does
having low-bandwidth-suitable material on the high-bandwidth network (or
not having it there) change anything?

The link between my webserver and docomo's gateway is so ridiculously
fast and overprovisioned relative to the material I'm transmitting that
I'd be happy using an 8-byte-per-character encoding, much less UTF-8. It
would make no difference.

And the carrier is already doing fairly CPU- and memory-copy-intesive
conversions at the gateway anyway, particularly for protocol
conversions, so doing a few character set conversions as well is no big
deal.

> It must have been a no-brainer for DOCOMO as to which encoding they
> went with when they first launched imode....

For the over-air encoding, sure. But I'm not talking about that at
all. I'm just talking about communication between the gateway and the
content server. And for that they already accept UTF-8 and convert it
appropriately. They just don't do it for web content.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974

***   Contribute to the Keitai Developers' Wiki!   ***
***           http://www.keitai-dev.net/           ***
Received on Thu Jan 12 09:15:19 2006