summaryrefslogtreecommitdiffstats
path: root/man7/utf-8.7
diff options
context:
space:
mode:
Diffstat (limited to 'man7/utf-8.7')
-rw-r--r--man7/utf-8.7285
1 files changed, 285 insertions, 0 deletions
diff --git a/man7/utf-8.7 b/man7/utf-8.7
new file mode 100644
index 000000000..e364c2e05
--- /dev/null
+++ b/man7/utf-8.7
@@ -0,0 +1,285 @@
+.\" Hey Emacs! This file is -*- nroff -*- source.
+.\"
+.\" Copyright (C) Markus Kuhn, 1996, 2001
+.\"
+.\" This is free documentation; you can redistribute it and/or
+.\" modify it under the terms of the GNU General Public License as
+.\" published by the Free Software Foundation; either version 2 of
+.\" the License, or (at your option) any later version.
+.\"
+.\" The GNU General Public License's references to "object code"
+.\" and "executables" are to be interpreted as the output of any
+.\" document formatting or typesetting system, including
+.\" intermediate and printed output.
+.\"
+.\" This manual is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public
+.\" License along with this manual; if not, write to the Free
+.\" Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111,
+.\" USA.
+.\"
+.\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
+.\" First version written
+.\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk>
+.\" Update
+.\"
+.TH UTF-8 7 2001-05-11 "GNU" "Linux Programmer's Manual"
+.SH NAME
+UTF-8 \- an ASCII compatible multi-byte Unicode encoding
+.SH DESCRIPTION
+The
+.B Unicode 3.0
+character set occupies a 16-bit code space. The most obvious
+Unicode encoding (known as
+.BR UCS-2 )
+consists of a sequence of 16-bit words. Such strings can contain as
+parts of many 16-bit characters bytes like '\\0' or '/' which have a
+special meaning in filenames and other C library function parameters.
+In addition, the majority of UNIX tools expects ASCII files and can't
+read 16-bit words as characters without major modifications. For these
+reasons,
+.B UCS-2
+is not a suitable external encoding of
+.B Unicode
+in filenames, text files, environment variables, etc. The
+.BR "ISO 10646 Universal Character Set (UCS)" ,
+a superset of Unicode, occupies even a 31-bit code space and the obvious
+.B UCS-4
+encoding for it (a sequence of 32-bit words) has the same problems.
+
+The
+.B UTF-8
+encoding of
+.B Unicode
+and
+.B UCS
+does not have these problems and is the common way in which
+.B Unicode
+is used on Unix-style operating systems.
+.SH PROPERTIES
+The
+.B UTF-8
+encoding has the following nice properties:
+.TP 0.2i
+*
+.B UCS
+characters 0x00000000 to 0x0000007f (the classic
+.B US-ASCII
+characters) are encoded simply as bytes 0x00 to 0x7f (ASCII
+compatibility). This means that files and strings which contain only
+7-bit ASCII characters have the same encoding under both
+.B ASCII
+and
+.BR UTF-8 .
+.TP
+*
+All
+.B UCS
+characters > 0x7f are encoded as a multi-byte sequence
+consisting only of bytes in the range 0x80 to 0xfd, so no ASCII
+byte can appear as part of another character and there are no
+problems with e.g. '\\0' or '/'.
+.TP
+*
+The lexicographic sorting order of
+.B UCS-4
+strings is preserved.
+.TP
+*
+All possible 2^31 UCS codes can be encoded using
+.BR UTF-8 .
+.TP
+*
+The bytes 0xfe and 0xff are never used in the
+.B UTF-8
+encoding.
+.TP
+*
+The first byte of a multi-byte sequence which represents a single non-ASCII
+.B UCS
+character is always in the range 0xc0 to 0xfd and indicates how long
+this multi-byte sequence is. All further bytes in a multi-byte sequence
+are in the range 0x80 to 0xbf. This allows easy resynchronization and
+makes the encoding stateless and robust against missing bytes.
+.TP
+*
+.B UTF-8
+encoded
+.B UCS
+characters may be up to six bytes long, however the
+.B Unicode
+standard specifies no characters above 0x10ffff, so Unicode characters
+can only be up to four bytes long in
+.BR UTF-8 .
+.SH ENCODING
+The following byte sequences are used to represent a character. The
+sequence to be used depends on the UCS code number of the character:
+.TP 0.4i
+0x00000000 - 0x0000007F:
+.RI 0 xxxxxxx
+.TP
+0x00000080 - 0x000007FF:
+.RI 110 xxxxx
+.RI 10 xxxxxx
+.TP
+0x00000800 - 0x0000FFFF:
+.RI 1110 xxxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.TP
+0x00010000 - 0x001FFFFF:
+.RI 11110 xxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.TP
+0x00200000 - 0x03FFFFFF:
+.RI 111110 xx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.TP
+0x04000000 - 0x7FFFFFFF:
+.RI 1111110 x
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.RI 10 xxxxxx
+.PP
+The
+.I xxx
+bit positions are filled with the bits of the character code number in
+binary representation. Only the shortest possible multi-byte sequence
+which can represent the code number of the character can be used.
+.PP
+The
+.B UCS
+code values 0xd800\(en0xdfff (UTF-16 surrogates) as well as 0xfffe and
+0xffff (UCS non-characters) should not appear in conforming
+.B UTF-8
+streams.
+.SH EXAMPLES
+The
+.B Unicode
+character 0xa9 = 1010 1001 (the copyright sign) is encoded
+in UTF-8 as
+.PP
+.RS
+11000010 10101001 = 0xc2 0xa9
+.RE
+.PP
+and character 0x2260 = 0010 0010 0110 0000 (the "not equal" symbol) is
+encoded as:
+.PP
+.RS
+11100010 10001001 10100000 = 0xe2 0x89 0xa0
+.RE
+.SH "APPLICATION NOTES"
+Users have to select a
+.B UTF-8
+locale, for example with
+.PP
+.RS
+export LANG=en_GB.UTF-8
+.RE
+.PP
+in order to activate the
+.B UTF-8
+support in applications.
+.PP
+Application software that has to be aware of the used character
+encoding should always set the locale with for example
+.PP
+.RS
+setlocale(LC_CTYPE, "")
+.RE
+.PP
+and programmers can then test the expression
+.PP
+.RS
+strcmp(nl_langinfo(CODESET), "UTF-8") == 0
+.RE
+.PP
+to determine whether a
+.B UTF-8
+locale has been selected and whether
+therefore all plaintext standard input and output, terminal
+communication, plaintext file content, filenames and environment
+variables are encoded in
+.BR UTF-8 .
+.PP
+Programmers accustomed to single-byte encodings such as
+.B US-ASCII
+or
+.B ISO 8859
+have to be aware that two assumptions made so far are no longer valid
+in
+.B UTF-8
+locales. Firstly, a single byte does not necessarily correspond any
+more to a single character. Secondly, since modern terminal emulators
+in
+.B UTF-8
+mode also support Chinese, Japanese, and Korean
+.B double-width characters
+as well as non-spacing
+.BR "combining characters" ,
+outputting a single character does not necessarily advance the cursor
+by one position as it did in
+.BR ASCII .
+Library functions such as
+.BR mbsrtowcs (3)
+and
+.BR wcswidth (3)
+should be used today to count characters and cursor positions.
+.PP
+The official ESC sequence to switch from an
+.B ISO 2022
+encoding scheme (as used for instance by VT100 terminals) to
+.B UTF-8
+is ESC % G
+("\\x1b%G"). The corresponding return sequence from
+.B UTF-8
+to ISO 2022 is ESC % @ ("\\x1b%@"). Other ISO 2022 sequences (such as
+for switching the G0 and G1 sets) are not applicable in UTF-8 mode.
+.PP
+It can be hoped that in the foreseeable future,
+.B UTF-8
+will replace
+.B ASCII
+and
+.B ISO 8859
+at all levels as the common character encoding on POSIX systems,
+leading to a significantly richer environment for handling plain text.
+.SH SECURITY
+The
+.BR Unicode " and " UCS
+standards require that producers of
+.B UTF-8
+shall use the shortest form possible, e.g., producing a two-byte
+sequence with first byte 0xc0 is non-conforming.
+.B Unicode 3.1
+has added the requirement that conforming programs must not accept
+non-shortest forms in their input. This is for security reasons: if
+user input is checked for possible security violations, a program
+might check only for the
+.B ASCII
+version of "/../" or ";" or NUL and overlook that there are many
+.RB non- ASCII
+ways to represent these things in a non-shortest
+.B UTF-8
+encoding.
+.SH STANDARDS
+ISO/IEC 10646-1:2000, Unicode 3.1, RFC 2279, Plan 9.
+.SH AUTHOR
+Markus Kuhn <mgk25@cl.cam.ac.uk>
+.SH "SEE ALSO"
+.BR nl_langinfo (3),
+.BR setlocale (3),
+.BR charsets (7),
+.BR unicode (7)