(keitai-l) Re: The economy of Japanese text?? Kill me now. (Or kill me later.)

From: Michael Turner <leap_at_gol.com> Date: 12/05/01 Message-ID: <006401c17d7e$8c9c7660$c74fd8cb@phobos>

Hopeless tangling with my good friend Christian here, about
non-terribly-keitai-related subjects.  But skip to the end--
I *am* on topic, finally.

Christian Molstrom wrote:

> I'm lost.  Michael, were you on one of your late night crack
> binges again?

No, I wrapped up this particular crack binge rather early in the
evening.  I had to get on to the glue-sniffing, y'know?  Then the
real hard stuff: sake.

> You can't compare writing systems without factoring out grammar
> and phonology.

Well, I was purposely trying to factor *in* grammar and
phonology to make a point: no writing system is dramatically
more economical than another, when it comes to conveying
a meaning of any appreciable complexity in the languages
that use it.  I used full text samples on nicely-laid-out
pages because I couldn't think of a more meaningful
measure.  I was expecting more variation, and was
surprised to see Japanese and English come out almost
the same.

> ....The example above, Canada, is about is close as you
> get to a pure head-to-head comparison of writing systems.

Not sure what your measure of purity is.  (Afraid to ask.)

SAY something about Canada.  Preferably a lot.  In both
languages, using equivalent levels of address.  Lay the text
out in interfaces optimized for the respective writing
systems for those languages. Then we'll have something
to compare.

[uncomprehended reasoning skipped.]
> There is no logical impossibility that a writing system for
> language X be spatially compact while its grammar and
> orthography be verbose.

No, it's not *logically* impossible.  But it's not logically impossible
for two stores across the street from each other to have prices
for the same item differing by a factor of six.  It's just vanishingly
rare, for some reason.

Why is it that we don't see very space-inefficient writing
systems, used for very wordy languages, combining to make
software localization for that locale virtually impossible?
As it is, we don't even see very space-inefficient languages,
nor do we see very wordy languages (except in their
elaborate honorific modes.)  German gets a little scary,
if you're doing anything very columnar in layout, but that's
about it, in my experience.

There are economizing forces at work on both languages
and writing systems.  (Real writing systems, anyway.
Nobody "wrote" all that florid Mayan "script"--it was chiseled
in stone in the spare time between planting and harvest.)  It
takes effort and resources to communicate.

> Don't mean to split hairs here, but what are we trying to compare,
> writing systems or communication systems?

Writing systems are textual communication standards for
languages.  So let's compare them in actual performance:
that is, human communication.

You don't like this?

OK, then forget "Canada", let's go for the purest test possible,
AND do it over a large set of data to smooth out statistical
variations.

Let's take a long random string of bits, pretend it's compressed
text (not knowing which language), decompress it, and then
interpret it as ASCII *and* SJIS, and see how much space is
taken up on the screen by the two resulting batches of garbage.
(How do you think I generated this reply, anyway? ;-)

After all, from an information-theoretic point of view, they'll
have just about the same amount of pure information.  (Or
pure noise, if you want to get theological about it.)

You don't like this test either?

Well, as I said in my post, there's just no pleasing some people.

> Now even if we are going to be mushy with our categories and
> say we are comparing the spatial characteristics of the two
> languages in general, then Japanese will win out in many, if not most, 
> cases.  Kanji obey an entirely different (and more efficient) spatial
> logic than devices for spelling.  I think you will not find many
> examples were the kanji takes up more space than the English
> word, supposing that a typical kanji is roughly the same space
> as 2 or 3 roman characters.  But then again there is no point
> in comparing love with 愛 since love could have been l'amour.

If you go word by word, you can find all kind of Japanese
miracles.  Especially if you pick your words carefully.
(Do you know there's a word for "in the palm of one's
hand"?  "Tanagokoro".  And a SINGLE CHARACTER for
it?  I love this one.)

Go sentence by sentence, however, and these differences
dwindle.  Do as I did, and measure the most useful units
of meaning in running text (paragraphs), and somehow
they come out about the same. At least they do in media
that have been optimized for text layout for the respective
languages, please note.  By why compare any other way?

If it's not about meaning, what is it about?

> I feel like I am missing something here. 

Well, try reading my crack-addled post again.  I really
was saying something there, drug-induced or not.
Paul Lester thought so, anyway.  (Hmm, a guy who
asks god@heaven about the fate of his dead pet
catfish?  Maybe I'm grasping at straws. :-)

> か and ka - - sure the kana is a lot bigger than
> a single letter, but it is not equivalent
> to one letter.  It is a complex (actually two) phoneme.

Yeah, but look at a common use of "ka": question marking.
In English text, that's just one character.  (In English
speech, it's not even phonetic--it's one of several
possible tones, and understood from context.)

Voila!  English in roman characters wins!!  By a factor
of TWO!

You can prove anything with isolated examples.

(Though in this case I didn't, because a space
should follow the '?', shouldn't it :-( )

> Another very good point brought up earlier by
> Curt S. is that the roman set requires spaces
> between words.  Dead, wasteful, space
> where nothing happens.

SOMETHING is happening, because English used to
be written without spaces--and in larger characters,
taking up more space on the page.

Break sentences (meanings) up, spatially, in words,
and suddenly you can cue off word-*shape* much
more easily.  And you can read text written in
smaller characters, because you don't need to
see each character perfectly.

I'll say that again (while you squint):

Yeo dan't noed te sae auch cheroctur porfuctly

 (OCR preclassifiers have been using
this for a decade or more.)

Japanese doesn't need this breakup (so much,
anyway) because the alternation of kana and
kanji help distinguish beginnings and endings.

Chinese doesn't need word-spacing so much
(despite being all kanji) because it's an isolating
language--it's almost like each character IS a
word, already.

And anyway, in both Japanese and Chinese,
the shape-recognition mostly takes place
*within* the character rectangle.  You *do*
need to see each character pretty well,
in these writing systems.  So the characters
tend to be bigger, naturally.

You have to look at the whole perceptual
and linguistic picture with writing systems.
Atomistic analysis doesn't work.  If it worked,
why did I find what I found with my Emacs
parallel corpora experiment?

OK--to be keitai-related here, because I
promised:

The real question is NOT "is Japanese text more
economical [for both input and display] than
English text."

The real question is: "is Japanese text, as
it's handled in Japan on keitai, more economical,
given the tiny screens and thumb-driven input
systems?"

For display, I'd say yes, but only as an historical
accident related to how *uneconomical* Japanese
is in certain respects, giving rise to a wider
set of "idiomatic" abbreviations.

One of these is stroke count.  The other is
Confucian bureaucracy.  In Kanji, the stroke
count will be higher than in roman characters,
for an equivalent meaning.  Confucian bureacracy
means, yes, more blanks to fill in, and more
"meta-data" crammed onto the page.

So, for example, "name".  8 strokes in English
writing.  But *15* for Japanese "名前".  These
aren't even terribly complex kanji, as kanji go.
But, "名” often suffices, in context.  6 strokes.

Japanese has tons of these.  They're great
for keitai screens.  But they're not enough--
it's idiomatic, not systematic, abbreviation.
Why do they have all those emoji and other
icons on i-mode phones, if you can do so
much with plain kanji?

On the input side: maybe.  But note that
interfaces like Tegic's are pretty fast, too.
It's just that there's not much precedent
for them.

There was kanji henkan in the wapuro world
for two decades before keitai.  In the West,
we've never really had an equivalent inter-
face.  It's not apples and oranges--it's more
like we, in (from) the West, have never had
apples in the first place.

-michael turner
leap@gol.com

[ Need archives? How to unsubscribe? http://www.appelsiini.net/keitai-l/ ]