(keitai-l) Re: OT: Kanji, Hanzi and Unicode

From: Benjamin Kowarsch <benjk_at_mac.com> Date: 06/21/02 Message-Id: <4C931678-850B-11D6-9648-003065FB21DC@mac.com>

On Friday, June 21, 2002, at 08:28 , James Santagata wrote:

>> Depends on how you interpret the meaning of "character". What I meant 
>> to
>> say was that the number of graphical representations that need to be
>> encoded or dealt with increases.
>
>
> You are right about semantics and the import of how a "character",
> is defined. As a late US President once exclaimed, "it all
> depends on what 'is' is."

:-) I knew that this one would come up ;-)

> So, for the roman character "a", that would be one code point,
> but the glyphs could carry the representation of various styles like
> a block letter, cursive style writing and so on.

It is because the Roman writing system is very simple and very rational.

> In life, though, it always seems that great ideas face three roadblocks
> - technical, monetary and political.
>
> The technical aspect of this are pretty straightforward. and it isn't
> going to really break anyone's bank (I think it actually saves
> orders of magnitude of time and money when dealing with
> internationalization of apps). But it seems politics
> raises its ugly head again especially with the "Han Unification"
> aspect, which in my mind is quite senseless.

Although I agree with you that there are lots of politics that get in 
the way and should rather not be allowed to, I can also see that the 
technical aspects in some writing systems are not that straightforward - 
that is if "technical" includes "functional"

It is one thing to define a technical specification so it is 
"technically straightforward" - it is another matter still to define it 
such that it is also "user requirements fulfilling".

For us Westerners, reading and writing is a very rational thing. We 
therefore see the task of implementing writing systems into machinery on 
a very rational basis. Users and engineers are therefore very likely to 
be in agreement over the requirements. Simple, rational, minimal 
resources.

For users of some other writing systems, particular those which use 
pictograms and ideograms rather than phonograms the user requirements 
may not be so well aligned with the ease of technical implementation 
objectives.

 From a Western view, we are likely to say "this and that character - 
they are really the same", but from an Oriental view point this may not 
be the case even in a rational sense. Often, old and new characters are 
used within the same writing system and this often means a subtle 
difference in the meaning. From that viewpoint the argument that two 
characters which should be the same are in fact different characters may 
not always be that easily dismissed and it may be a practical 
requirement good enough to support the use of different codes.

Basically, what I am saying here is "Who are we to tell them how to 
classify their characters", but please take this with a grain of salt, 
because Western rationality has in the past indeed made significant 
contributions to classify and order Chinese characters, for which 
Nelson's and Halpern's classification systems and dictionaries are proof 
as they are well appreciated even by Oriental scholars.

In any event this shows that the task is not as easy as it may seem from 
a pure engineering point of view, even if politics were left out of the 
equation. Obviously, in the face of such genuine difficulties, politics 
add difficulties of a kind which one could easily do without.

> And I think the Han Unification is a hugely important aspect
> because there are so many characters - I personally thank
> God everyday I wake up for the 26 character alphabet.

:-)

> For the opponents of  Han Unification, I attribute a lot of
> the opposition to one of two things:
>
> 1) A misunderstanding "Hey, they aren't going have
> my representation of 'Qi' as a character! #$#@#@!!"

you may call that the Talibanisation of Unicode ;-)

> or
>
> 2) What I label the "Cooties" factor (I was going to label it the
> "French Factor", but thought I'd get too many flames from Francophiles
> so I'll stick with "Cooties" factor).
>
> As in:
>
> "I don't want to have my [insert country with pride/animosity]
> bundled together with [insert neighboring country with reciprocal
> pride/animosity]'s ideogram! #$#@#@!!"

Ah, but that's the "Japan-Korea factor", also known in its more violent 
form as the "Arab-Israel factor" (Please note: order strictly Roman 
alphabetic)

a "French Factor" would be quite the opposite, as in

"We need to *exclude* all their characters in *our* standard and replace 
them with new ones we invent specifically for that purpose; then we need 
to make sure all of our own characters are mandatorily *included* in 
*anyone else's* standard." (Note: I do consider myself a Francophile)

>> Anyway, if it says 20000 characters are covered by the Unicode standard
>> so far, then the question is, does that mean 20000 graphical
>> representations of characters or does it mean actual characters ?
>
> That would be 20,000 codepoints or actual characters. But any one
> codepoint could have tens of different glyphs associated with it.

This is OK if it is presented that way, but in a newspaper article if 
would probably be reduced  even beyond layman's terms and simply turn 
out as either

- "The Unicode standard covers (number of codes) characters"; or
- "The Unicode standard covers (number of glyphs) characters"; or
- "While proponents of the Unicode standard claim (number of codes plus 
number of glyphs) characters have been standardised so far, there are 
critics who say ...."

In other words, we are back at the question of what the meaning of 'is' 
is ;-)

>> Clearly, where different graphical representations are represented by
>> the same code, all the characters that are "ancient versions" of
>> existing characters don't really need to be explicitly covered. Where
>> they are represented by different codes, they would need to be covered.
>
> Correct. As long as it is determined/agreed that the character,
> "Qi" is a specific code point, it doesn't really matter (in my opinion)
> that we unify all of the physical representations of that character
> over the centuries or between countries that use the same/similar
> ideogram as part of their orthography into one codepoint.

Emphasis on "determined/agreed" and "same/similar" ;-)

> Others may, and frequently do argue that this is misguided
> for a number of reasons (usually political).
>> It would seem that part of the work involved in encoding is to decide
>> which characters are considered the same and which are considered to be
>> stand-alone. A task that may at least in some cases turn out to be
>> difficult because scholars may have different opinions.
>
> Yes, this has been a big issue - in my mind mostly political.

Well, perhaps they should define and agree on a standard process for the 
determination of whether characters are to be considered same/similar or 
different ;-) and hopefully it would filter out the political arguments 
as disqualified, but probably it would do the exact opposite ;-)

>> That is my principle understanding too, but it seems to me that it is
>> not always clear what constitutes "the same character". For example, on
>> my system Qi, Ki and qi produce different codes. Thus, if I have a text
>> with the character Qi in it, the character does not change from its
>> Traditional representation to a Japanese representation when I change 
>> it
>> from a Chinese to a Japanese font. Although, I see how this could be
>> achieved even if the codes are different.
>
> May I ask how you input your characters? Assuming that "Qi" was
> one code point for a discrete character, that was shared
> in China, Taiwan, Japan and Korea, than changing the font should
> determine how the character is rendered on your monitor.

The Japanese "Ki" I produced using romanised front-end input processors, 
ie you type "ki" and it turns into hiragana ki, then you hit the space 
bar and it converts into a Kanji for Ki, if it's not the one you are 
looking for you keep hitting space until the one you're after shows up. 
Similar with the Chinese FEPs, but I am often having trouble with those 
and then I look them up by radical method from a table, ie you choose 
Radical input, then radical stroke count, remaining stroke count et 
voila you get a list of characters (often still too large to find your 
desired character instantly but ...) from which you choose the desired 
one by clicking on it and it will be inserted at the cursor position.

In other words I didn't specify the codes, the input FEPs did that for 
me.

>> Do you have the three writing systems installed on your system ? I'd be
>> interested to learn if it is any different on yours ...
>
> I was able to see clearly the 3 different 'Qi' characters you input.

That's because you have support for all three script systems installed. 
If you have one missing, that particular one would not be displayed and 
probably show up as garbled Roman text.

> On this computer, I'm running Windows 98 (hold your groans
> please) and have installed the Japanese, Korean, Simp Chinese
> and Trad Chinese MS IME software and I'm also running a
> Vietnamese IME (I think VNI is on this computer).

This looks to me as if they have a system there which is similar to the 
one the old MacOS used where you had to install so called Language Kits 
for each language. A language consisted of the fonts and display 
capabilities plus one or more input methods. It worked, but it was some 
sort of a bolted on thing which sometimes caused funny side effects. For 
example, Japanese in window titles always were preceeded by katakana 
"ME" and had a trailing katakana "MO". I think those where the katanana 
with codes identical to opening and closing quotes in Roman scripts. 
There was even a shareware utility called "MEMO Busters" to get rid of 
it ;-)

On the new OSX which I am using on this machine, this is all pretty much 
integrated and you don't need to add language kits anymore. I have not 
come across any side effects and the integration seems very well done 
and robust. Support for the most common languages is present unless you 
explicitly remove it and all you have to do is check tick boxes for 
desired languages in the language preference pane within system 
preferences.

However, as this OS has only been around for about a year (in this form) 
there are still a number of languages missing - I don't think they have 
Vietnamese yet.

What I like particularly is the way in which applications separate 
program code and text for menus and dialogs. In OSX an application is 
actually a directory (inheritance from NeXT) and that directory has a 
sub directory branch for the code and another for resources. Within the 
resource subtree there is a language resource subdirectory with the 
language resources. Those contain all the text of the application and 
they are just plain vanilla text files. You can edit them and change the 
names of menus and the text in dialogs as you please. This requires no 
technical expertise (other than using a text editor) and no 
recompilation or linking is required. So, you can do your own 
localisation by copying an original language resource file, ie. 
English.lproj and name it after the target language, ie. Japanese.lproj 
and then edit the new language resource file for Japanese and simply 
translate everything in there.

When you are done, you log out and back in again and et voila your 
application is now Japanese (provided your user preference is set so 
Japanese is higher up on the list than whatever other language has a 
resource file - first language on your preferred list to match a 
resource will be the one to be used for display).

This is very neat and I have done it a few times for Japanese users who 
wanted to use a software that didn't come with Japanese resources 
(mostly shareware products).

The notable exception seem to be Microsoft apps. They must have somehow 
worked around this thing - maybe they coded a rule into their packages 
that says "if you are an English package then ignore any non-English 
language resource files".

It seems that the benefits which Unicode and with it other multi-lingual 
techniques in its entourage have brought are not always appreciated. Why 
sell one multi-lingual package if you can sell one package per language 
and charge a multi-lingual user twice or more ? ;-)

regards
benjamin