summaryrefslogtreecommitdiffstats
path: root/man7/charsets.7
diff options
context:
space:
mode:
Diffstat (limited to 'man7/charsets.7')
-rw-r--r--man7/charsets.7322
1 files changed, 322 insertions, 0 deletions
diff --git a/man7/charsets.7 b/man7/charsets.7
new file mode 100644
index 000000000..83fb41f90
--- /dev/null
+++ b/man7/charsets.7
@@ -0,0 +1,322 @@
+.\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com>
+.\" and Andries Brouwer <aeb@cwi.nl>
+.\"
+.\" This is free documentation; you can redistribute it and/or
+.\" modify it under the terms of the GNU General Public License as
+.\" published by the Free Software Foundation; either version 2 of
+.\" the License, or (at your option) any later version.
+.\"
+.\" This is combined from many sources, including notes by aeb and
+.\" research by esr. Portions derive from a writeup by Roman Czyborra.
+.\"
+.\" Last changed by David Starner <dstarner98@aasaa.ofe.org>.
+.TH CHARSETS 7 2001-05-07 "Linux" "Linux Programmer's Manual"
+.SH NAME
+charsets \- programmer's view of character sets and internationalization
+.SH DESCRIPTION
+Linux is an international operating system. Various of its utilities
+and device drivers (including the console driver) support multilingual
+character sets including Latin-alphabet letters with diacritical
+marks, accents, ligatures, and entire non-Latin alphabets including
+Greek, Cyrillic, Arabic, and Hebrew.
+.LP
+This manual page presents a programmer's-eye view of different
+character-set standards and how they fit together on Linux. Standards
+discussed include ASCII, ISO 8859, KOI8-R, Unicode, ISO 2022 and
+ISO 4873. The primary emphasis is on character sets actually used as
+locale character sets, not the myriad others that can be found in data
+from other systems.
+.LP
+A complete list of charsets used in a officially supported locale in glibc
+2.2.3 is: ISO-8859-{1,2,3,5,6,7,8,9,13,15}, CP1251, UTF-8, EUC-{KR,JP,TW},
+KOI8-{R,U}, GB2312, GB18030, GBK, BIG5, BIG5-HKSCS and TIS-620 (in no
+particular order.) (Romanian may be switching to ISO-8859-16.)
+
+.SH ASCII
+ASCII (American Standard Code For Information Interchange) is the original
+7-bit character set, originally designed for American English. It is
+currently described by the ECMA-6 standard.
+.LP
+Various ASCII variants replacing the dollar sign with other currency
+symbols and replacing punctuation with non-English alphabetic characters
+to cover German, French, Spanish and others in 7 bits exist. All are
+deprecated; GNU libc doesn't support locales whose character sets aren't
+true supersets of ASCII. (These sets are also known as ISO-646, a close
+relative of ASCII that permitted replacing these characters.)
+.LP
+As Linux was written for hardware designed in the US, it natively
+supports ASCII.
+
+.SH ISO 8859
+ISO 8859 is a series of 15 8-bit character sets all of which have US
+ASCII in their low (7-bit) half, invisible control characters in
+positions 128 to 159, and 96 fixed-width graphics in positions 160-255.
+.LP
+Of these, the most important is ISO 8859-1 (Latin-1). It is natively
+supported in the Linux console driver, fairly well supported in X11R6,
+and is the base character set of HTML.
+.LP
+Console support for the other 8859 character sets is available under
+Linux through user-mode utilities (such as
+.BR setfont (8))
+.\" // some distributions still have the deprecated consolechars
+that modify keyboard bindings and the EGA graphics
+table and employ the "user mapping" font table in the console
+driver.
+.LP
+Here are brief descriptions of each set:
+.TP
+8859-1 (Latin-1)
+Latin-1 covers most Western European languages such as Albanian, Catalan,
+Danish, Dutch, English, Faroese, Finnish, French, German, Galician,
+Irish, Icelandic, Italian, Norwegian, Portuguese, Spanish, and
+Swedish. The lack of the ligatures Dutch ij, French oe and old-style
+,,German`` quotation marks is considered tolerable.
+.TP
+8859-2 (Latin-2)
+Latin-2 supports most Latin-written Slavic and Central European
+languages: Croatian, Czech, German, Hungarian, Polish, Rumanian,
+Slovak, and Slovene.
+.TP
+8859-3 (Latin-3)
+Latin-3 is popular with authors of Esperanto, Galician, and Maltese.
+(Turkish is now written with 8859-9 instead.)
+.TP
+8859-4 (Latin-4)
+Latin-4 introduced letters for Estonian, Latvian, and Lithuanian. It
+is essentially obsolete; see 8859-10 (Latin-6) and 8859-13 (Latin-7).
+.TP
+8859-5
+Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian,
+Russian, Serbian and Ukrainian. Ukrainians read the letter `ghe'
+with downstroke as `heh' and would need a ghe with upstroke to write a
+correct ghe. See the discussion of KOI8-R below.
+.TP
+8859-6
+Supports Arabic. The 8859-6 glyph table is a fixed font of separate
+letter forms, but a proper display engine should combine these
+using the proper initial, medial, and final forms.
+.TP
+8859-7
+Supports Modern Greek.
+.TP
+8859-8
+Supports modern Hebrew without niqud (punctuation signs). Niqud
+and full-fledged Biblical Hebrew are outside the scope of this
+character set; under Linux, UTF-8 is the preferred encoding for
+these.
+.TP
+8859-9 (Latin-5)
+This is a variant of Latin-1 that replaces Icelandic letters with
+Turkish ones.
+.TP
+8859-10 (Latin-6)
+Latin 6 adds the last Inuit (Greenlandic) and Sami (Lappish) letters
+that were missing in Latin 4 to cover the entire Nordic area. RFC
+1345 listed a preliminary and different `latin6'. Skolt Sami still
+needs a few more accents than these.
+.TP
+8859-11
+This only exists as a rejected draft standard. The draft standard
+was identical to TIS-620, which is used under Linux for Thai.
+.TP
+8859-12
+This set does not exist. While Vietnamese has been suggested for this
+space, it does not fit within the 96 (non-combining) characters ISO
+8859 offers. UTF-8 is the preferred character set for Vietnamese use
+under Linux.
+.TP
+8859-13 (Latin-7)
+Supports the Baltic Rim languages; in particular, it includes Latvian
+characters not found in Latin-4.
+.TP
+8859-14 (Latin-8)
+This is the Celtic character set, covering Gaelic and Welsh.
+This charset also contains the dotted characters needed for Old Irish.
+.TP
+8859-15 (Latin-9)
+This adds the Euro sign and French and Finnish letters that were missing in
+Latin-1.
+.TP
+8859-16 (Latin-10)
+This set covers many of the languages covered by 8859-2, and supports
+Romanian more completely then that set does.
+.SH KOI8-R
+KOI8-R is a non-ISO character set popular in Russia. The lower half
+is US ASCII; the upper is a Cyrillic character set somewhat better
+designed than ISO 8859-5. KOI8-U is a common character set, based off
+KOI8-R, that has better support for Ukrainian. Neither of these sets
+are ISO-2022 compatible, unlike the ISO-8859 series.
+.LP
+Console support for KOI8-R is available under Linux through user-mode
+utilities that modify keyboard bindings and the EGA graphics table,
+and employ the "user mapping" font table in the console driver.
+
+.\" Thanks to Tomohiro KUBOTA for the following sections about
+.\" national standards.
+.SH JIS X 0208
+JIS X 0208 is a Japanese national standard character set. Though
+there are some more Japanese national standard character sets (like
+JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important
+one. Characters are mapped into a 94x94 two-byte matrix,
+whose each byte is in the range 0x21-0x7e. Note that JIS X 0208
+is a character set, not an encoding. This means that JIS X 0208
+itself is not used for expressing text data. JIS X 0208 is used
+as a component to construct encodings such as EUC-JP, Shift_JIS,
+and ISO-2022-JP. EUC-JP is the most important encoding for Linux
+and includes US ASCII and JIS X 0208. In EUC-JP, JIS X 0208
+characters are expressed in two bytes, each of which is the
+JIS X 0208 code plus 0x80.
+
+.SH KS X 1001
+KS X 1001 is a Korean national standard character set. Just as
+JIS X 0208, characters are mapped into a 94x94 two-byte matrix.
+KS X 1001 is used like JIS X 0208, as a component
+to construct encodings such as EUC-KR, Johab, and ISO-2022-KR.
+EUC-KR is the most important encoding for Linux and includes
+US ASCII and KS X 1001. KS C 5601 is an older name for KS X 1001.
+
+.SH GB 2312
+GB 2312 is a mainland Chinese national standard character set used
+to express simplified Chinese. Just like JIS X 0208, characters are
+mapped into a 94x94 two-byte matrix used to construct EUC-CN. EUC-CN
+is the most important encoding for Linux and includes US ASCII and
+GB 2312. Note that EUC-CN is often called as GB, GB 2312, or CN-GB.
+
+.SH Big5
+Big5 is a popular character set in Taiwan to express traditional
+Chinese. (Big5 is both a character set and an encoding.) It is a
+superset of US ASCII. Non-ASCII characters are expressed in two
+bytes. Bytes 0xa1-0xfe are used as leading bytes for two-byte
+characters. Big5 and its extension is widely used in Taiwan and Hong
+Kong. It is not ISO 2022-compliant.
+
+.SH TIS 620
+TIS 620 is a Thai national standard character set and a superset
+of US ASCII. Like ISO 8859 series, Thai characters are mapped into
+0xa1-0xfe. TIS 620 is the only commonly used character set under
+Linux besides UTF-8 to have combining characters.
+
+.SH UNICODE
+Unicode (ISO 10646) is a standard which aims to unambiguously represent every
+character in every human language. Unicode's structure permits 20.1 bits
+to encode every character. Since most computers don't include 20.1-bit
+integers, Unicode is usually encoded as 32-bit integers internally and
+either a series of 16-bit integers (UTF-16) (needing two 16-bit integers
+only when encoding certain rare characters) or a series of 8-bit bytes
+(UTF-8). Information on Unicode is available at <http://www.unicode.com>.
+.LP
+Linux represents Unicode using the 8-bit Unicode Transformation Format
+(UTF-8). UTF-8 is a variable length encoding of Unicode. It uses 1
+byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes
+for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits.
+.LP
+Let 0,1,x stand for a zero, one, or arbitrary bit. A byte 0xxxxxxx
+stands for the Unicode 00000000 0xxxxxxx which codes the same symbol
+as the ASCII 0xxxxxxx. Thus, ASCII goes unchanged into UTF-8, and
+people using only ASCII do not notice any change: not in code, and not
+in file size.
+.LP
+A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy
+is assembled into 00000xxx xxyyyyyy. A byte 1110xxxx is the start
+of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled
+into xxxxyyyy yyzzzzzz.
+(When UTF-8 is used to code the 31-bit ISO 10646
+then this progression continues up to 6-byte codes.)
+.LP
+For most people who use ISO-8859 character sets, this means that the
+characters outside of ASCII are now coded with two bytes. This tends
+to expand ordinary text files by only one or two percent. For Russian
+or Greek users, this expands ordinary text files by 100%, since text in
+those languages is mostly outside of ASCII. For Japanese users this means
+that the 16-bit codes now in common use will take three bytes. While there
+are algorithmic conversions from some character sets (esp. ISO-8859-1) to
+Unicode, general conversion requires carrying around conversion tables,
+which can be quite large for 16-bit codes.
+.LP
+Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other
+byte is the head of a code. Note that the only way ASCII bytes occur
+in a UTF-8 stream, is as themselves. In particular, there are no
+embedded NULs or '/'s that form part of some larger code.
+.LP
+Since ASCII, and, in particular, NUL and '/', are unchanged, the
+kernel does not notice that UTF-8 is being used. It does not care at
+all what the bytes it is handling stand for.
+.LP
+Rendering of Unicode data streams is typically handled through
+`subfont' tables which map a subset of Unicode to glyphs. Internally
+the kernel uses Unicode to describe the subfont loaded in video RAM.
+This means that in UTF-8 mode one can use a character set with 512
+different symbols. This is not enough for Japanese, Chinese and
+Korean, but it is enough for most other purposes.
+.LP
+At the current time, the console driver does not handle combining
+characters. So Thai, Sioux and any other script needing combining
+characters can't be handled on the console.
+
+.SH "ISO 2022 AND ISO 4873"
+The ISO 2022 and 4873 standards describe a font-control model
+based on VT100 practice. This model is (partially) supported
+by the Linux kernel and by
+.BR xterm (1).
+It is popular in Japan and Korea.
+.LP
+There are 4 graphic character sets, called G0, G1, G2 and G3,
+and one of them is the current character set for codes with
+high bit zero (initially G0), and one of them is the current
+character set for codes with high bit one (initially G1).
+Each graphic character set has 94 or 96 characters, and is
+essentially a 7-bit character set. It uses codes either
+040-0177 (041-0176) or 0240-0377 (0241-0376).
+G0 always has size 94 and uses codes 041-0176.
+.LP
+Switching between character sets is done using the shift functions
+^N (SO or LS1), ^O (SI or LS0), ESC n (LS2), ESC o (LS3),
+ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R), ESC | (LS3R).
+The function LS\fIn\fP makes character set G\fIn\fP the current one
+for codes with high bit zero.
+The function LS\fIn\fPR makes character set G\fIn\fP the current one
+for codes with high bit one.
+The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3)
+the current one for the next character only (regardless of the value
+of its high order bit).
+.LP
+A 94-character set is designated as G\fIn\fP character set
+by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1),
+ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol
+or a pair of symbols found in the ISO 2375 International
+Register of Coded Character Sets.
+For example, ESC ( @ selects the ISO 646 character set as G0,
+ESC ( A selects the UK standard character set (with pound
+instead of number sign), ESC ( B selects ASCII (with dollar
+instead of currency sign), ESC ( M selects a character set
+for African languages, ESC ( ! A selects the Cuban character
+set, etc. etc.
+.LP
+A 96-character set is designated as G\fIn\fP character set
+by an escape sequence ESC - xx (for G1), ESC . xx (for G2)
+or ESC / xx (for G3).
+For example, ESC - G selects the Hebrew alphabet as G1.
+.LP
+A multibyte character set is designated as G\fIn\fP character set
+by an escape sequence ESC $ xx or ESC $ ( xx (for G0),
+ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3).
+For example, ESC $ ( C selects the Korean character set for G0.
+The Japanese character set selected by ESC $ B has a more
+recent version selected by ESC & @ ESC $ B.
+.LP
+ISO 4873 stipulates a narrower use of character sets, where G0
+is fixed (always ASCII), so that G1, G2 and G3
+can only be invoked for codes with the high order bit set.
+In particular, ^N and ^O are not used anymore, ESC ( xx
+can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx
+are equivalent to ESC - xx, ESC . xx, ESC / xx, respectively.
+
+.SH "SEE ALSO"
+.BR console (4),
+.BR console_codes (4),
+.BR console_ioctl (4),
+.BR ascii (7),
+.BR iso_8859-1 (7),
+.BR unicode (7),
+.BR utf-8 (7)