(keitai-l) Re: Supported Character Sets for I-mode

From: Joe Bowbeer <joe.bowbeer_at_gmail.com>
Date: 01/13/06
Message-ID: <31f2a7bd0601122151g4dbb7ee2rcb77b7397b335bec@mail.gmail.com>
I think it's worthwhile to reference one more article from the IBM site.

Elliotte Harold's article makes a strong case for UTF-8:

http://www-128.ibm.com/developerworks/xml/library/x-utf8/

In particular, I think he makes some good points concerning Chinese,
Japanese, and Korean:

http://www-128.ibm.com/developerworks/xml/library/x-utf8/#N100CB

While he is not rigorous in his analysis, I suspect he is correct that:

1. the actual size gain of UTF-8 compared to UTF-16 probably isn't so large
2. the expansion may be offset by the natural compression of ideographic scripts
3. gzip'd UTF-8 will likely be close in size to gzip'd UTF-16

Addressing #1: In my experience developing graphically rich,
network-aware mobile apps in Eurospeak and Japanese, the size of text
content in whichever chosen form is small compared to the size of
images and audio, and the penalty for bloated text is neglible
compared to the network latency.  The small screen size also helps to
minimize the impact of bloated text because small screens tend to
minimize the amount of text that is consumed.  I'm sure there are some
apps whose performance is very sensitive to the text encoding, but I
would have no problem choosing UTF-8 as a default in most cases.  In
my experience, text is peanuts (and latency kills).

Addressing #3: No rigor once again, but it seems reasonable to think
that UTF-8 and UTF-16 would compress to about the same size, given
that the information is the same in both cases.


For me, the most interesting aspect of UTF-8 is its Unicode nature,
and whether Unicode is the best character set for Japanese.

Joe.


On 1/12/06, Nick May <nick@kyushu.com> wrote:
> > , and basically eliminates any size differences in the
> > encodings.
>
> Are you claiming that 3 byte UTF-8 is SO much more compressible than
> 2 byte eucjp  that it is sufficient to make up the difference?  That
> would indeed be interesting and would remove a major issue with UTF-8.
>
Received on Fri Jan 13 07:51:28 2006