(keitai-l) Kanji, Hanzi and Unicode

From: Benjamin Kowarsch <benjk_at_mac.com>
Date: 06/20/02
Message-Id: <3A536194-8421-11D6-84DC-003065FB21DC@mac.com>
On Thursday, June 20, 2002, at 02:01 , Curt Sampson wrote:

> On Tue, 18 Jun 2002, john yee wrote:
>
>> I read a bit about Chinese in the full unicode spec (I don't remember
>> the term...). Apparently it doesn't support entire language, just
>> something like 10k+ (characters? brush strokes?).
>
> Last I checked, Unicode supported over 20,000 kanji (though a few
> of those are non-Chinese kanji), and another 6,500 were slated for
> inclusion in the next revision of the standard.

The Chinese call them Hanzi (and in Mandarin, it sounds a bit more like 
hanzu or hanze), not that this matters for the number of Chinese 
characters, though.

> This is far from the full number of characters that have ever been
> used (50-70,000 comes to mind as an estimate I've heard),

I was taught a number of about 50000 Hanzi, stemming from a dictionary 
in which Chinese scholars had aimed to document the evolution of 
characters and list any character ever in use. From that I have always 
assumed that this meant that a large number of those characters would 
have been "previous versions" of characters still in use today, rather 
than abandoned "stand-alone" characters. This would mean that the actual 
number of characters (as opposed to derivatives) is far lower. In any 
event it means that there is no need to have all of them encoded in the 
Unicode standard, unless of course some Chinese scholars want to 
recreate an electronic version of the dictionary which aims to list 
every form of Hanzi there has ever been. In which case it would probably 
make more sense to simply list the non-Unicoded ones as graphics, not as 
fonts.

>  and new ones are being invented all the time.

I don't think there are that many new characters being invented. In fact 
writing reforms in China and Japan has aimed to cut down on the number 
of characters. That's why you have simplified Chinese (used in Mainland 
China) and traditional Chinese (used in Hong Kong and Taiwan) character 
sets, which ironically, from a Unicode standard point to view has 
increased the number of characters that need to be encoded.

For example the character for spirit "Ki" has three different 
representations, one simplified, one traditional and a Japanese version. 
 From a Unicode point of view those are three different characters, but 
they are actually one:

Traditional Chinese "Qi" : 氣
Japanese "Ki" : 気
Simplified Chinese "qi" : 气
(PS: You need all thee coding systems and fonts installed to see the 
characters)

> But many of these are used rarely or not at all. For most circumstances,
> 10,000 kanji is adequate.

Absolutely, for most circumstances 4000-5000 Kanji is adequate. 
Simplified Chinese will require fewer, Traditional Chinese will require 
more, but fewer than 10000 will be adequate. I guess that the number 
20000 plus the 6500 additionals you quoted stems from replication 
because there are three writing systems (or actually four as the Koreans 
also use them alongside Hangul).

In any event, Unicode is definitely good news for everyone who has to 
deal with non-Roman writing systems. Well, that is if vendors actually 
make use of it, some seem fearful ...

http://www.theregister.co.uk/content/39/25742.html

regards
benjamin
Received on Thu Jun 20 10:42:43 2002