(keitai-l) OT: Kanji, Hanzi and Unicode

From: James Santagata <jsanta_at_audiencetrax.com>
Date: 06/21/02
Message-Id: <5.1.0.14.0.20020620155620.00a1d8a0@mail.activemessage.com>
Hi Benjamin,

At 06:40 PM 6/20/02 +0900, you wrote:


>On Thursday, June 20, 2002, at 05:24 , James Santagata wrote:
>
> > I'm a little confused here how Unicode increases the number
> > of characters that need to be encoded. My understanding is
> > that Unicode only encodes characters, while a character's
> > physical or visual representation is provided by glyphs
> > whose delivery is provided by the fonts one selects.
>
>Depends on how you interpret the meaning of "character". What I meant to
>say was that the number of graphical representations that need to be
>encoded or dealt with increases.


You are right about semantics and the import of how a "character",
is defined. As a late US President once exclaimed, "it all
depends on what 'is' is."

And so it is with Unicode.

The premise of Unicode, though, is to unify redundant codepoints,
so under Unicode a specific character would  possess one
specific codepoint.  Multiple renderings of the character would
be carried out by the glyphs.

So, for the roman character "a", that would be one code point,
but the glyphs could carry the representation of various styles like
a block letter, cursive style writing and so on.

In life, though, it always seems that great ideas face three roadblocks
- technical, monetary and political.

The technical aspect of this are pretty straightforward. and it isn't
going to really break anyone's bank (I think it actually saves
orders of magnitude of time and money when dealing with
internationalization of apps). But it seems politics
raises its ugly head again especially with the "Han Unification"
aspect, which in my mind is quite senseless.

And I think the Han Unification is a hugely important aspect
because there are so many characters - I personally thank
God everyday I wake up for the 26 character alphabet.

For the opponents of  Han Unification, I attribute a lot of
the opposition to one of two things:

1) A misunderstanding "Hey, they aren't going have
my representation of 'Qi' as a character! #$#@#@!!"

or

2) What I label the "Cooties" factor (I was going to label it the
"French Factor", but thought I'd get too many flames from Francophiles
so I'll stick with "Cooties" factor).

As in:

"I don't want to have my [insert country with pride/animosity]
bundled together with [insert neighboring country with reciprocal
pride/animosity]'s ideogram! #$#@#@!!"

So like two girls on the playground (no offense meant to girls
or playgrounds), these people just pull each other's hair.


>On my system though, Qi, Ki, qi all result in different codes.
>
>Anyway, if it says 20000 characters are covered by the Unicode standard
>so far, then the question is, does that mean 20000 graphical
>representations of characters or does it mean actual characters ?

That would be 20,000 codepoints or actual characters. But any one
codepoint could have tens of different glyphs associated with it.



>Clearly, where different graphical representations are represented by
>the same code, all the characters that are "ancient versions" of
>existing characters don't really need to be explicitly covered. Where
>they are represented by different codes, they would need to be covered.

Correct. As long as it is determined/agreed that the character,
"Qi" is a specific code point, it doesn't really matter (in my opinion)
that we unify all of the physical representations of that character
over the centuries or between countries that use the same/similar
ideogram as part of their orthography into one codepoint.

Others may, and frequently do argue that this is misguided
for a number of reasons (usually political).


>It would seem that part of the work involved in encoding is to decide
>which characters are considered the same and which are considered to be
>stand-alone. A task that may at least in some cases turn out to be
>difficult because scholars may have different opinions.

Yes, this has been a big issue - in my mind mostly political.

>That is my principle understanding too, but it seems to me that it is
>not always clear what constitutes "the same character". For example, on
>my system Qi, Ki and qi produce different codes. Thus, if I have a text
>with the character Qi in it, the character does not change from its
>Traditional representation to a Japanese representation when I change it
>from a Chinese to a Japanese font. Although, I see how this could be
>achieved even if the codes are different.

May I ask how you input your characters? Assuming that "Qi" was
one code point for a discrete character, that was shared
in China, Taiwan, Japan and Korea, than changing the font should
determine how the character is rendered on your monitor.


>Do you have the three writing systems installed on your system ? I'd be
>interested to learn if it is any different on yours ...

I was able to see clearly the 3 different 'Qi' characters you input.

On this computer, I'm running Windows 98 (hold your groans
please) and have installed the Japanese, Korean, Simp Chinese
and Trad Chinese MS IME software and I'm also running a
Vietnamese IME (I think VNI is on this computer).

Sincerely,


James Santagata
  A U D I E N C E T R A X
Audience Management Systems
  http://www.audiencetrax.com
Received on Fri Jun 21 02:25:33 2002