summaryrefslogtreecommitdiffstats
path: root/man/man7/utf-8.7
diff options
context:
space:
mode:
Diffstat (limited to 'man/man7/utf-8.7')
-rw-r--r--man/man7/utf-8.7205
1 files changed, 205 insertions, 0 deletions
diff --git a/man/man7/utf-8.7 b/man/man7/utf-8.7
new file mode 100644
index 000000000..d6f539ab2
--- /dev/null
+++ b/man/man7/utf-8.7
@@ -0,0 +1,205 @@
+.\" Copyright (C) Markus Kuhn, 1996, 2001
+.\"
+.\" SPDX-License-Identifier: GPL-2.0-or-later
+.\"
+.\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
+.\" First version written
+.\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk>
+.\" Update
+.\"
+.TH UTF-8 7 (date) "Linux man-pages (unreleased)"
+.SH NAME
+UTF-8 \- an ASCII compatible multibyte Unicode encoding
+.SH DESCRIPTION
+The Unicode 3.0 character set occupies a 16-bit code space.
+The most obvious
+Unicode encoding (known as UCS-2)
+consists of a sequence of 16-bit words.
+Such strings can contain\[em]as part of many 16-bit characters\[em]bytes
+such as \[aq]\e0\[aq] or \[aq]/\[aq], which have a
+special meaning in filenames and other C library function arguments.
+In addition, the majority of UNIX tools expect ASCII files and can't
+read 16-bit words as characters without major modifications.
+For these reasons,
+UCS-2 is not a suitable external encoding of Unicode
+in filenames, text files, environment variables, and so on.
+The ISO/IEC 10646 Universal Character Set (UCS),
+a superset of Unicode, occupies an even larger code
+space\[em]31\ bits\[em]and the obvious
+UCS-4 encoding for it (a sequence of 32-bit words) has the same problems.
+.P
+The UTF-8 encoding of Unicode and UCS
+does not have these problems and is the common way in which
+Unicode is used on UNIX-style operating systems.
+.SS Properties
+The UTF-8 encoding has the following nice properties:
+.IP \[bu] 3
+UCS
+characters 0x00000000 to 0x0000007f (the classic US-ASCII
+characters) are encoded simply as bytes 0x00 to 0x7f (ASCII
+compatibility).
+This means that files and strings which contain only
+7-bit ASCII characters have the same encoding under both
+ASCII
+and
+UTF-8.
+.IP \[bu]
+All UCS characters greater than 0x7f are encoded as a multibyte sequence
+consisting only of bytes in the range 0x80 to 0xfd, so no ASCII
+byte can appear as part of another character and there are no
+problems with, for example, \[aq]\e0\[aq] or \[aq]/\[aq].
+.IP \[bu]
+The lexicographic sorting order of UCS-4 strings is preserved.
+.IP \[bu]
+All possible 2\[ha]31 UCS codes can be encoded using UTF-8.
+.IP \[bu]
+The bytes 0xc0, 0xc1, 0xfe, and 0xff are never used in the UTF-8 encoding.
+.IP \[bu]
+The first byte of a multibyte sequence which represents a single non-ASCII
+UCS character is always in the range 0xc2 to 0xfd and indicates how long
+this multibyte sequence is.
+All further bytes in a multibyte sequence
+are in the range 0x80 to 0xbf.
+This allows easy resynchronization and
+makes the encoding stateless and robust against missing bytes.
+.IP \[bu]
+UTF-8 encoded UCS characters may be up to six bytes long, however the
+Unicode standard specifies no characters above 0x10ffff, so Unicode characters
+can be only up to four bytes long in
+UTF-8.
+.SS Encoding
+The following byte sequences are used to represent a character.
+The sequence to be used depends on the UCS code number of the character:
+.TP
+0x00000000 \- 0x0000007F:
+.RI 0 xxxxxxx
+.TP
+0x00000080 \- 0x000007FF:
+.RI 110 xxxxx
+.RI 10 xxxxxx
+.TP
+0x00000800 \- 0x0000FFFF:
+.RI 1110 xxxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.TP
+0x00010000 \- 0x001FFFFF:
+.RI 11110 xxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.TP
+0x00200000 \- 0x03FFFFFF:
+.RI 111110 xx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.TP
+0x04000000 \- 0x7FFFFFFF:
+.RI 1111110 x
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.P
+The
+.I xxx
+bit positions are filled with the bits of the character code number in
+binary representation, most significant bit first (big-endian).
+Only the shortest possible multibyte sequence
+which can represent the code number of the character can be used.
+.P
+The UCS code values 0xd800\[en]0xdfff (UTF-16 surrogates) as well as 0xfffe and
+0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams.
+According to RFC 3629 no point above U+10FFFF should be used,
+which limits characters to four bytes.
+.SS Example
+The Unicode character 0xa9 = 1010 1001 (the copyright sign) is encoded
+in UTF-8 as
+.P
+.RS
+11000010 10101001 = 0xc2 0xa9
+.RE
+.P
+and character 0x2260 = 0010 0010 0110 0000 (the "not equal" symbol) is
+encoded as:
+.P
+.RS
+11100010 10001001 10100000 = 0xe2 0x89 0xa0
+.RE
+.SS Application notes
+Users have to select a UTF-8 locale, for example with
+.P
+.RS
+export LANG=en_GB.UTF-8
+.RE
+.P
+in order to activate the UTF-8 support in applications.
+.P
+Application software that has to be aware of the used character
+encoding should always set the locale with for example
+.P
+.RS
+setlocale(LC_CTYPE, "")
+.RE
+.P
+and programmers can then test the expression
+.P
+.RS
+strcmp(nl_langinfo(CODESET), "UTF-8") == 0
+.RE
+.P
+to determine whether a UTF-8 locale has been selected and whether
+therefore all plaintext standard input and output, terminal
+communication, plaintext file content, filenames, and environment
+variables are encoded in UTF-8.
+.P
+Programmers accustomed to single-byte encodings
+such as US-ASCII or ISO/IEC\~8859
+have to be aware that two assumptions made so far are no longer valid
+in UTF-8 locales.
+Firstly, a single byte does not necessarily correspond any
+more to a single character.
+Secondly, since modern terminal emulators in UTF-8
+mode also support Chinese, Japanese, and Korean
+double-width characters as well as nonspacing combining characters,
+outputting a single character does not necessarily advance the cursor
+by one position as it did in ASCII.
+Library functions such as
+.BR mbsrtowcs (3)
+and
+.BR wcswidth (3)
+should be used today to count characters and cursor positions.
+.P
+The official ESC sequence to switch from an ISO/IEC\~2022
+encoding scheme (as used for instance by VT100 terminals) to
+UTF-8 is ESC % G
+("\ex1b%G").
+The corresponding return sequence from
+UTF-8 to ISO/IEC\~2022 is ESC % @ ("\ex1b%@").
+Other ISO/IEC\~2022 sequences (such as
+for switching the G0 and G1 sets) are not applicable in UTF-8 mode.
+.SS Security
+The Unicode and UCS standards require that producers of UTF-8
+shall use the shortest form possible, for example, producing a two-byte
+sequence with first byte 0xc0 is nonconforming.
+Unicode 3.1 has added the requirement that conforming programs must not accept
+non-shortest forms in their input.
+This is for security reasons: if
+user input is checked for possible security violations, a program
+might check only for the ASCII
+version of "/../" or ";" or NUL and overlook that there are many
+non-ASCII ways to represent these things in a non-shortest UTF-8
+encoding.
+.SS Standards
+ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 3629, Plan 9.
+.\" .SH AUTHOR
+.\" Markus Kuhn <mgk25@cl.cam.ac.uk>
+.SH SEE ALSO
+.BR locale (1),
+.BR nl_langinfo (3),
+.BR setlocale (3),
+.BR charsets (7),
+.BR unicode (7)