(keitai-l) Re: Supported Character Sets for I-mode

From: Alex Shinn <foof_at_synthcode.com>
Date: 01/13/06
Message-ID: <861wzcfxhy.wl@lain.inunome.com>
At Fri, 13 Jan 2006 13:25:39 +0900, Nick May wrote:
> 
> Are you claiming that 3 byte UTF-8 is SO much more compressible than  
> 2 byte eucjp  that it is sufficient to make up the difference?  That  
> would indeed be interesting and would remove a major issue with UTF-8.

Well, at the risk of getting into information theory, given two
different encodings of the same semantic data, an ideal compressor
should indeed be able to compress them to the same size.  Consider the
encoding-aware compressor that first converts the data to the smallest
possible encoding.

Of course, we don't have ideal compressors.  If you take a good
compressor like bzip2 and run it on pure Japanese text (I took the
complete text of Rashomon) you get

  bzip2 compressed EUC-JP: 5575 bytes
  bzip2 compressed UTF-8:  5752 bytes

a mere 3% difference, which, yes, I consider "basically" the same.  At
that point there are much more important things I can spend my time
optimizing.  In this case we're also talking about data which is
significantly ASCII, so the difference will be even smaller.

To be fair, the compressor used by Apache (gzip) is not as good as
bzip2.  Taking a sample front page from http://slashdot.jp/ (which is
largely CSS driven):

  gzip compressed EUC-JP: 14918 bytes
  gzip compressed UTF-8:  15931 bytes

about 6%, which was higher than I expected but nonetheless not a
significant cause for concern.  To me it would be well worth the cost
of being able to use any language at all on a website.

> UTF-8  may well have many advantages over euc-jp and sjis. But its  
> proponents do themselves, and it, a disservice to pretend that moving  
> to it does not involve trade-offs.

I never said this, and certainly didn't intend to give such and
impression, sorry if I did.  I was just addressing one aspect of your
concern, that of the excess bandwidth requirements of UTF-8, pointing
out that in a web application where bandwidth was a concern the
difference between encodings will be minimal.  There are other
applications where the size of UTF-8 may be more of a concern.

-- 
Alex
Received on Fri Jan 13 08:33:19 2006