Hironori Sakamoto <hsaka@mth.biglobe.ne.jp> wrote:
> >> > * UTF-8 is not used as display coding system.
> >> This should be added.
>
> Because w3m is a text browser, it requires the information
> how long columns(1 or 2, ...) a character uses before display.
Yes. Supporting characters of different width in a character cell
application is a pain. I was under the impression, though, that
this was also required for some of the various Asian character sets
w3m already supports. Handling multibyte characters in the first
place is already difficult, but this is again just a variation for
what's already required for CJK support.
> But, Unicode is not defined characters width.
While I'm by no means an expert on Unicode, I'm quite certain that
it indeed does define character width. In fact, Markus Kuhn's "UTF-8
and Unicode FAQ for Unix/Linux" (see below) lists a C function to
implement such a check.
> For exapmle. Most Europen and American think charcaters which
> code range are 0xa0-0xff on Unicode/ISO-8859-1 use 1 columns.
Yes.
> BUT!, on the character sets of Japan/Korea/China these use 2 columns.
Hmmm.
But this is a characteristic of those character sets, then. These
characters could be single width on a UTF-8 display and double
width on a CJK one.
> Therefore, I think UTF-8 as display coding system is not useful for w3m.
I respectfully disagree.
Let me give you a European perspective on this. Europe is linguistically
quite diverse. Nearly all of those languages require extensions to
the basic latin alphabet, i.e. characters with accents of various
shapes, as well as occasional additional characters. There are also
some separate alphabets such as the Greek and Cyrillic one. In
consequence, a whole lot of 8-bit character sets have been designed,
e.g. the ISO 8859-x series etc. I'm sure you already knew all this.
Now, there is a key limitation. All of those 8-bit character sets
are useful for writing in one or at most in a few languages. However,
there are many combinations of languages for which no character
sets exist. In various branches of the humanities, multilingual
documents are common. However, there is *no* display character set
to display a document that has e.g. both French and Polish text,
or German and Russian, etc. The *only* display encoding that combines
all these character sets is Unicode/UTF-8. Somewhere under my
homepage I mention the authors Henry Charriére and Stanisław
Lem on the same HTML page. The only way to display this without
substitution characters is by way of UTF-8.
Another aspect of this is that from a European point of view support
for only a *subset* of Unicode will suffice to solve the most
pressing issues. In fact, some subsets already have been formally
defined. Few people in this part of the world need CJK characters,
and combining characters are only required for scientific applications
(mathematics, International Phonetic Alphabet).
Of course I cannot demand that you implement UTF-8 display capability
in w3m, but I assure you that it would indeed be very useful, even
if limited.
> I think when I will get a terminal emulator which fully support Unicode
> and fully Unicode fonts, I will may start the support of displaying
> with UTF-8.
Well, what is "full" Unicode support? Double-width characters?
Combining characters? Mixing left-to-right and right-to-left writing?
Automatic linking and breaking of ligatures?
I'll conclude this lengthy message by pointing you to some resources
regarding Unicode/UTF-8 under Unix/X11.
[1] As I mentioned previously, generic xterm is growing Unicode/UTF-8
support. You can get the current version at
http://www.clark.net/pub/dickey/xterm/xterm.html
[2] As a perfect companion to the updated xterm, Markus Kuhn has
created extended Unicode versions of the classic "-misc-fixed-*"
X11 fonts:
http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html
[3] For FreeBSD, I also have proper ports of both the new xterm
and the ucs-fonts, which unfortunately haven't been committed yet.
Interested parties can get them from my home page:
http://home.pages.de/~naddy/unix/index.html#FreeBSD
[4] Two excellent resources are Markus Kuhn's "UTF-8 and Unicode
FAQ for Unix/Linux"
http://www.cl.cam.ac.uk/~mgk25/unicode.html
[5] and Bruno Haible's "Unicode HOWTO"
ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
respectively.
[5] Finally, the linux-utf8 mailing list at majordomo@nl.linux.org
is very helpful for discussing implementation issues. Despite the
Linux name it is also applicable to (at least) other Unix operating
systems.
-- Christian "naddy" Weisgerber naddy@mips.rhein-neckar.de
This archive was generated by hypermail 2b29 : Wed Jul 19 2000 - 10:30:43 CDT