summaryrefslogtreecommitdiffstats
path: root/man7/unicode.7
diff options
context:
space:
mode:
Diffstat (limited to 'man7/unicode.7')
-rw-r--r--man7/unicode.7246
1 files changed, 0 insertions, 246 deletions
diff --git a/man7/unicode.7 b/man7/unicode.7
deleted file mode 100644
index 6bdf0bad8..000000000
--- a/man7/unicode.7
+++ /dev/null
@@ -1,246 +0,0 @@
-.\" Copyright (C) Markus Kuhn, 1995, 2001
-.\"
-.\" SPDX-License-Identifier: GPL-2.0-or-later
-.\"
-.\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
-.\" First version written
-.\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk>
-.\" Update
-.\"
-.TH unicode 7 (date) "Linux man-pages (unreleased)"
-.SH NAME
-unicode \- universal character set
-.SH DESCRIPTION
-The international standard ISO/IEC 10646 defines the
-Universal Character Set (UCS).
-UCS contains all characters of all other character set standards.
-It also guarantees "round-trip compatibility";
-in other words,
-conversion tables can be built such that no information is lost
-when a string is converted from any other encoding to UCS and back.
-.P
-UCS contains the characters required to represent practically all
-known languages.
-This includes not only the Latin, Greek, Cyrillic,
-Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese,
-Japanese and Korean Han ideographs as well as scripts such as
-Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati,
-Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo,
-Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian,
-Ogham, Myanmar, Sinhala, Thaana, Yi, and others.
-For scripts not yet
-covered, research on how to best encode them for computer usage is
-still going on and they will be added eventually.
-This might
-eventually include not only Hieroglyphs and various historic
-Indo-European languages, but even some selected artistic scripts such
-as Tengwar, Cirth, and Klingon.
-UCS also covers a large number of
-graphical, typographical, mathematical, and scientific symbols,
-including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows,
-Macintosh, OCR fonts, as well as many word processing and publishing
-systems, and more are being added.
-.P
-The UCS standard (ISO/IEC 10646) describes a
-31-bit character set architecture
-consisting of 128 24-bit
-.IR groups ,
-each divided into 256 16-bit
-.I planes
-made up of 256 8-bit
-.I rows
-with 256
-.I column
-positions, one for each character.
-Part 1 of the standard (ISO/IEC 10646-1)
-defines the first 65534 code positions (0x0000 to 0xfffd), which form
-the
-.I Basic Multilingual Plane
-(BMP), that is plane 0 in group 0.
-Part 2 of the standard (ISO/IEC 10646-2)
-adds characters to group 0 outside the BMP in several
-.I "supplementary planes"
-in the range 0x10000 to 0x10ffff.
-There are no plans to add characters
-beyond 0x10ffff to the standard, therefore of the entire code space,
-only a small fraction of group 0 will ever be actually used in the
-foreseeable future.
-The BMP contains all characters found in the
-commonly used other character sets.
-The supplemental planes added by
-ISO/IEC 10646-2 cover only more exotic characters for special scientific,
-dictionary printing, publishing industry, higher-level protocol and
-enthusiast needs.
-.P
-The representation of each UCS character as a 2-byte word is referred
-to as the UCS-2 form (only for BMP characters),
-whereas UCS-4 is the representation of each character by a 4-byte word.
-In addition, there exist two encoding forms UTF-8
-for backward compatibility with ASCII processing software and UTF-16
-for the backward-compatible handling of non-BMP characters up to
-0x10ffff by UCS-2 software.
-.P
-The UCS characters 0x0000 to 0x007f are identical to those of the
-classic US-ASCII
-character set and the characters in the range 0x0000 to 0x00ff
-are identical to those in
-ISO/IEC\~8859-1 (Latin-1).
-.SS Combining characters
-Some code points in UCS
-have been assigned to
-.IR "combining characters" .
-These are similar to the nonspacing accent keys on a typewriter.
-A combining character just adds an accent to the previous character.
-The most important accented characters have codes of their own in UCS,
-however, the combining character mechanism allows us to add accents
-and other diacritical marks to any character.
-The combining characters
-always follow the character which they modify.
-For example, the German
-character Umlaut-A ("Latin capital letter A with diaeresis") can
-either be represented by the precomposed UCS code 0x00c4, or
-alternatively as the combination of a normal "Latin capital letter A"
-followed by a "combining diaeresis": 0x0041 0x0308.
-.P
-Combining characters are essential for instance for encoding the Thai
-script or for mathematical typesetting and users of the International
-Phonetic Alphabet.
-.SS Implementation levels
-As not all systems are expected to support advanced mechanisms like
-combining characters, ISO/IEC 10646-1 specifies the following three
-.I implementation levels
-of UCS:
-.TP 0.9i
-Level 1
-Combining characters and Hangul Jamo
-(a variant encoding of the Korean script, where a Hangul syllable
-glyph is coded as a triplet or pair of vowel/consonant codes) are not
-supported.
-.TP
-Level 2
-In addition to level 1, combining characters are now allowed for some
-languages where they are essential (e.g., Thai, Lao, Hebrew,
-Arabic, Devanagari, Malayalam).
-.TP
-Level 3
-All UCS characters are supported.
-.P
-The Unicode 3.0 Standard
-published by the Unicode Consortium
-contains exactly the UCS Basic Multilingual Plane
-at implementation level 3, as described in ISO/IEC 10646-1:2000.
-Unicode 3.1 added the supplemental planes of ISO/IEC 10646-2.
-The Unicode standard and
-technical reports published by the Unicode Consortium provide much
-additional information on the semantics and recommended usages of
-various characters.
-They provide guidelines and algorithms for
-editing, sorting, comparing, normalizing, converting, and displaying
-Unicode strings.
-.SS Unicode under Linux
-Under GNU/Linux, the C type
-.I wchar_t
-is a signed 32-bit integer type.
-Its values are always interpreted
-by the C library as UCS
-code values (in all locales), a convention that is signaled by the GNU
-C library to applications by defining the constant
-.B __STDC_ISO_10646__
-as specified in the ISO C99 standard.
-.P
-UCS/Unicode can be used just like ASCII in input/output streams,
-terminal communication, plaintext files, filenames, and environment
-variables in the ASCII compatible UTF-8 multibyte encoding.
-To signal the use of UTF-8 as the character
-encoding to all applications, a suitable
-.I locale
-has to be selected via environment variables (e.g.,
-"LANG=en_GB.UTF-8").
-.P
-The
-.B nl_langinfo(CODESET)
-function returns the name of the selected encoding.
-Library functions such as
-.BR wctomb (3)
-and
-.BR mbsrtowcs (3)
-can be used to transform the internal
-.I wchar_t
-characters and strings into the system character encoding and back
-and
-.BR wcwidth (3)
-tells how many positions (0\[en]2) the cursor is advanced by the
-output of a character.
-.SS Private Use Areas (PUA)
-In the Basic Multilingual Plane,
-the range 0xe000 to 0xf8ff will never be assigned to any characters by
-the standard and is reserved for private usage.
-For the Linux
-community, this private area has been subdivided further into the
-range 0xe000 to 0xefff which can be used individually by any end-user
-and the Linux zone in the range 0xf000 to 0xf8ff where extensions are
-coordinated among all Linux users.
-The registry of the characters
-assigned to the Linux zone is maintained by LANANA and the registry
-itself is
-.I Documentation/admin\-guide/unicode.rst
-in the Linux kernel sources
-.\" commit 9d85025b0418163fae079c9ba8f8445212de8568
-(or
-.I Documentation/unicode.txt
-before Linux 4.10).
-.P
-Two other planes are reserved for private usage, plane 15
-(Supplementary Private Use Area-A, range 0xf0000 to 0xffffd)
-and plane 16 (Supplementary Private Use Area-B, range
-0x100000 to 0x10fffd).
-.SS Literature
-.IP \[bu] 3
-Information technology \[em] Universal Multiple-Octet Coded Character
-Set (UCS) \[em] Part 1: Architecture and Basic Multilingual Plane.
-International Standard ISO/IEC 10646-1, International Organization
-for Standardization, Geneva, 2000.
-.IP
-This is the official specification of UCS.
-Available from
-.UR http://www.iso.ch/
-.UE .
-.IP \[bu]
-The Unicode Standard, Version 3.0.
-The Unicode Consortium, Addison-Wesley,
-Reading, MA, 2000, ISBN 0-201-61633-5.
-.IP \[bu]
-S.\& Harbison, G.\& Steele. C: A Reference Manual. Fourth edition,
-Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.
-.IP
-A good reference book about the C programming language.
-The fourth
-edition covers the 1994 Amendment 1 to the ISO C90 standard, which
-adds a large number of new C library functions for handling wide and
-multibyte character encodings, but it does not yet cover ISO C99,
-which improved wide and multibyte character support even further.
-.IP \[bu]
-Unicode Technical Reports.
-.RS
-.UR http://www.unicode.org\:/reports/
-.UE
-.RE
-.IP \[bu]
-Markus Kuhn: UTF-8 and Unicode FAQ for UNIX/Linux.
-.RS
-.UR http://www.cl.cam.ac.uk\:/\[ti]mgk25\:/unicode.html
-.UE
-.RE
-.IP \[bu]
-Bruno Haible: Unicode HOWTO.
-.RS
-.UR http://www.tldp.org\:/HOWTO\:/Unicode\-HOWTO.html
-.UE
-.RE
-.\" .SH AUTHOR
-.\" Markus Kuhn <mgk25@cl.cam.ac.uk>
-.SH SEE ALSO
-.BR locale (1),
-.BR setlocale (3),
-.BR charsets (7),
-.BR utf\-8 (7)