diff options
Diffstat (limited to 'man1p/lex.1p')
-rw-r--r-- | man1p/lex.1p | 954 |
1 files changed, 954 insertions, 0 deletions
diff --git a/man1p/lex.1p b/man1p/lex.1p new file mode 100644 index 000000000..6d84cb9fa --- /dev/null +++ b/man1p/lex.1p @@ -0,0 +1,954 @@ +.\" Copyright (c) 2001-2003 The Open Group, All Rights Reserved +.TH "LEX" P 2003 "IEEE/The Open Group" "POSIX Programmer's Manual" +.\" lex +.SH NAME +lex \- generate programs for lexical tasks (\fBDEVELOPMENT\fP) +.SH SYNOPSIS +.LP +\fBlex\fP \fB[\fP\fB-t\fP\fB][\fP\fB-n|-v\fP\fB][\fP\fIfile\fP \fB...\fP\fB]\fP\fB\fP +.SH DESCRIPTION +.LP +The \fIlex\fP utility shall generate C programs to be used in lexical +processing of character input, and that can be used as an +interface to \fIyacc\fP. The C programs shall be generated from \fIlex\fP +source code and +conform to the ISO\ C standard. Usually, the \fIlex\fP utility shall +write the program it generates to the file +\fBlex.yy.c\fP; the state of this file is unspecified if \fIlex\fP +exits with a non-zero exit status. See the EXTENDED +DESCRIPTION section for a complete description of the \fIlex\fP input +language. +.SH OPTIONS +.LP +The \fIlex\fP utility shall conform to the Base Definitions volume +of IEEE\ Std\ 1003.1-2001, Section 12.2, Utility Syntax Guidelines. +.LP +The following options shall be supported: +.TP 7 +\fB-n\fP +Suppress the summary of statistics usually written with the \fB-v\fP +option. If no table sizes are specified in the \fIlex\fP +source code and the \fB-v\fP option is not specified, then \fB-n\fP +is implied. +.TP 7 +\fB-t\fP +Write the resulting program to standard output instead of \fBlex.yy.c\fP. +.TP 7 +\fB-v\fP +Write a summary of \fIlex\fP statistics to the standard output. (See +the discussion of \fIlex\fP table sizes in Definitions in lex .) If +the \fB-t\fP option is specified and \fB-n\fP is not specified, this +report shall +be written to standard error. If table sizes are specified in the +\fIlex\fP source code, and if the \fB-n\fP option is not +specified, the \fB-v\fP option may be enabled. +.sp +.SH OPERANDS +.LP +The following operand shall be supported: +.TP 7 +\fIfile\fP +A pathname of an input file. If more than one such \fIfile\fP is specified, +all files shall be concatenated to produce a +single \fIlex\fP program. If no \fIfile\fP operands are specified, +or if a \fIfile\fP operand is \fB'-'\fP , the standard +input shall be used. +.sp +.SH STDIN +.LP +The standard input shall be used if no \fIfile\fP operands are specified, +or if a \fIfile\fP operand is \fB'-'\fP . See +INPUT FILES. +.SH INPUT FILES +.LP +The input files shall be text files containing \fIlex\fP source code, +as described in the EXTENDED DESCRIPTION section. +.SH ENVIRONMENT VARIABLES +.LP +The following environment variables shall affect the execution of +\fIlex\fP: +.TP 7 +\fILANG\fP +Provide a default value for the internationalization variables that +are unset or null. (See the Base Definitions volume of +IEEE\ Std\ 1003.1-2001, Section 8.2, Internationalization Variables +for +the precedence of internationalization variables used to determine +the values of locale categories.) +.TP 7 +\fILC_ALL\fP +If set to a non-empty string value, override the values of all the +other internationalization variables. +.TP 7 +\fILC_COLLATE\fP +.sp +Determine the locale for the behavior of ranges, equivalence classes, +and multi-character collating elements within regular +expressions. If this variable is not set to the POSIX locale, the +results are unspecified. +.TP 7 +\fILC_CTYPE\fP +Determine the locale for the interpretation of sequences of bytes +of text data as characters (for example, single-byte as +opposed to multi-byte characters in arguments and input files), and +the behavior of character classes within regular expressions. +If this variable is not set to the POSIX locale, the results are unspecified. +.TP 7 +\fILC_MESSAGES\fP +Determine the locale that should be used to affect the format and +contents of diagnostic messages written to standard +error. +.TP 7 +\fINLSPATH\fP +Determine the location of message catalogs for the processing of \fILC_MESSAGES +\&.\fP +.sp +.SH ASYNCHRONOUS EVENTS +.LP +Default. +.SH STDOUT +.LP +If the \fB-t\fP option is specified, the text file of C source code +output of \fIlex\fP shall be written to standard +output. +.LP +If the \fB-t\fP option is not specified: +.IP " *" 3 +Implementation-defined informational, error, and warning messages +concerning the contents of \fIlex\fP source code input shall +be written to either the standard output or standard error. +.LP +.IP " *" 3 +If the \fB-v\fP option is specified and the \fB-n\fP option is not +specified, \fIlex\fP statistics shall also be written to +either the standard output or standard error, in an implementation-defined +format. These statistics may also be generated if table +sizes are specified with a \fB'%'\fP operator in the \fIDefinitions\fP +section, as long as the \fB-n\fP option is not +specified. +.LP +.SH STDERR +.LP +If the \fB-t\fP option is specified, implementation-defined informational, +error, and warning messages concerning the contents +of \fIlex\fP source code input shall be written to the standard error. +.LP +If the \fB-t\fP option is not specified: +.IP " 1." 4 +Implementation-defined informational, error, and warning messages +concerning the contents of \fIlex\fP source code input shall +be written to either the standard output or standard error. +.LP +.IP " 2." 4 +If the \fB-v\fP option is specified and the \fB-n\fP option is not +specified, \fIlex\fP statistics shall also be written to +either the standard output or standard error, in an implementation-defined +format. These statistics may also be generated if table +sizes are specified with a \fB'%'\fP operator in the \fIDefinitions\fP +section, as long as the \fB-n\fP option is not +specified. +.LP +.SH OUTPUT FILES +.LP +A text file containing C source code shall be written to \fBlex.yy.c\fP, +or to the standard output if the \fB-t\fP option is +present. +.SH EXTENDED DESCRIPTION +.LP +Each input file shall contain \fIlex\fP source code, which is a table +of regular expressions with corresponding actions in the +form of C program fragments. +.LP +When \fBlex.yy.c\fP is compiled and linked with the \fIlex\fP library +(using the \fB-l\ l\fP operand with \fIc99\fP), the resulting program +shall read character input from the standard input and shall +partition it into strings that match the given expressions. +.LP +When an expression is matched, these actions shall occur: +.IP " *" 3 +The input string that was matched shall be left in \fIyytext\fP as +a null-terminated string; \fIyytext\fP shall either be an +external character array or a pointer to a character string. As explained +in Definitions in lex , +the type can be explicitly selected using the \fB%array\fP or \fB%pointer\fP +declarations, but the default is +implementation-defined. +.LP +.IP " *" 3 +The external \fBint\fP \fIyyleng\fP shall be set to the length of +the matching string. +.LP +.IP " *" 3 +The expression's corresponding program fragment, or action, shall +be executed. +.LP +.LP +During pattern matching, \fIlex\fP shall search the set of patterns +for the single longest possible match. Among rules that +match the same number of characters, the rule given first shall be +chosen. +.LP +The general format of \fIlex\fP source shall be: +.sp +.RS +.nf + +\fIDefinitions\fP +\fB%%\fP +\fIRules\fP +\fB%%\fP +\fIUser\fPSubroutines +.fi +.RE +.LP +The first \fB"%%"\fP is required to mark the beginning of the rules +(regular expressions and actions); the second +\fB"%%"\fP is required only if user subroutines follow. +.LP +Any line in the \fIDefinitions\fP section beginning with a <blank> +shall be assumed to be a C program fragment and shall +be copied to the external definition area of the \fBlex.yy.c\fP file. +Similarly, anything in the \fIDefinitions\fP section +included between delimiter lines containing only \fB"%{"\fP and \fB"%}"\fP +shall also be copied unchanged to the external +definition area of the \fBlex.yy.c\fP file. +.LP +Any such input (beginning with a <blank> or within \fB"%{"\fP and +\fB"%}"\fP delimiter lines) appearing at the +beginning of the \fIRules\fP section before any rules are specified +shall be written to \fBlex.yy.c\fP after the declarations of +variables for the \fIyylex\fP() function and before the first line +of code in \fIyylex\fP(). Thus, user variables local to +\fIyylex\fP() can be declared here, as well as application code to +execute upon entry to \fIyylex\fP(). +.LP +The action taken by \fIlex\fP when encountering any input beginning +with a <blank> or within \fB"%{"\fP and +\fB"%}"\fP delimiter lines appearing in the \fIRules\fP section but +coming after one or more rules is undefined. The presence +of such input may result in an erroneous definition of the \fIyylex\fP() +function. +.SS Definitions in lex +.LP +\fIDefinitions\fP appear before the first \fB"%%"\fP delimiter. Any +line in this section not contained between \fB"%{"\fP +and \fB"%}"\fP lines and not beginning with a <blank> shall be assumed +to define a \fIlex\fP substitution string. The +format of these lines shall be: +.sp +.RS +.nf + +\fIname substitute\fP +.fi +.RE +.LP +If a \fIname\fP does not meet the requirements for identifiers in +the ISO\ C standard, the result is undefined. The string +\fIsubstitute\fP shall replace the string { \fIname\fP} when it is +used in a rule. The \fIname\fP string shall be recognized in +this context only when the braces are provided and when it does not +appear within a bracket expression or within double-quotes. +.LP +In the \fIDefinitions\fP section, any line beginning with a \fB'%'\fP +(percent sign) character and followed by an +alphanumeric word beginning with either \fB's'\fP or \fB'S'\fP shall +define a set of start conditions. Any line beginning +with a \fB'%'\fP followed by a word beginning with either \fB'x'\fP +or \fB'X'\fP shall define a set of exclusive start +conditions. When the generated scanner is in a \fB%s\fP state, patterns +with no state specified shall be also active; in a +\fB%x\fP state, such patterns shall not be active. The rest of the +line, after the first word, shall be considered to be one or +more <blank>-separated names of start conditions. Start condition +names shall be constructed in the same way as definition +names. Start conditions can be used to restrict the matching of regular +expressions to one or more states as described in Regular Expressions +in lex . +.LP +Implementations shall accept either of the following two mutually-exclusive +declarations in the \fIDefinitions\fP section: +.TP 7 +\fB%array\fP +Declare the type of \fIyytext\fP to be a null-terminated character +array. +.TP 7 +\fB%pointer\fP +Declare the type of \fIyytext\fP to be a pointer to a null-terminated +character string. +.sp +.LP +The default type of \fIyytext\fP is implementation-defined. If an +application refers to \fIyytext\fP outside of the scanner +source file (that is, via an \fBextern\fP), the application shall +include the appropriate \fB%array\fP or \fB%pointer\fP +declaration in the scanner source file. +.LP +Implementations shall accept declarations in the \fIDefinitions\fP +section for setting certain internal table sizes. The +declarations are shown in the following table. +.sp +.ce 1 +\fBTable: Table Size Declarations in \fIlex\fP\fP +.TS C +center; l2 l2 l. +\fBDeclaration\fP \fBDescription\fP \fBMinimum Value\fP +%\fBp\fP \fIn\fP Number of positions 2500 +%\fBn\fP \fIn\fP Number of states 500 +%\fBa\fP \fIn\fP Number of transitions 2000 +%\fBe\fP \fIn\fP Number of parse tree nodes 1000 +%\fBk\fP \fIn\fP Number of packed character classes 1000 +%\fBo\fP \fIn\fP Size of the output array 3000 +.TE +.LP +In the table, \fIn\fP represents a positive decimal integer, preceded +by one or more <blank>s. The exact meaning of these +table size numbers is implementation-defined. The implementation shall +document how these numbers affect the \fIlex\fP utility and +how they are related to any output that may be generated by the implementation +should limitations be encountered during the +execution of \fIlex\fP. It shall be possible to determine from this +output which of the table size values needs to be modified to +permit \fIlex\fP to successfully generate tables for the input language. +The values in the column Minimum Value represent the +lowest values conforming implementations shall provide. +.SS Rules in lex +.LP +The rules in \fIlex\fP source files are a table in which the left +column contains regular expressions and the right column +contains actions (C program fragments) to be executed when the expressions +are recognized. +.sp +.RS +.nf + +\fIERE action +ERE action\fP\fB... +\fP +.fi +.RE +.LP +The extended regular expression (ERE) portion of a row shall be separated +from \fIaction\fP by one or more <blank>s. A +regular expression containing <blank>s shall be recognized under one +of the following conditions: +.IP " *" 3 +The entire expression appears within double-quotes. +.LP +.IP " *" 3 +The <blank>s appear within double-quotes or square brackets. +.LP +.IP " *" 3 +Each <blank> is preceded by a backslash character. +.LP +.SS User Subroutines in lex +.LP +Anything in the user subroutines section shall be copied to \fBlex.yy.c\fP +following \fIyylex\fP(). +.SS Regular Expressions in lex +.LP +The \fIlex\fP utility shall support the set of extended regular expressions +(see the Base Definitions volume of +IEEE\ Std\ 1003.1-2001, Section 9.4, Extended Regular Expressions), +with the following additions and exceptions to the syntax: +.TP 7 +\fB"..."\fP +Any string enclosed in double-quotes shall represent the characters +within the double-quotes as themselves, except that +backslash escapes (which appear in the following table) shall be recognized. +Any backslash-escape sequence shall be terminated by +the closing quote. For example, \fB"\\01"\fP \fB"1"\fP represents +a single string: the octal value 1 followed by the character +\fB'1'\fP . +.TP 7 +<\fIstate\fP>\fIr\fP,\ <\fIstate1,state2,\fP...>\fIr\fP +.sp +The regular expression \fIr\fP shall be matched only when the program +is in one of the start conditions indicated by \fIstate\fP, +\fIstate1\fP, and so on; see Actions in lex . (As an exception to +the typographical conventions of +the rest of this volume of IEEE\ Std\ 1003.1-2001, in this case <\fIstate\fP> +does not represent a metavariable, but +the literal angle-bracket characters surrounding a symbol.) The start +condition shall be recognized as such only at the beginning +of a regular expression. +.TP 7 +\fIr\fP/\fIx\fP +The regular expression \fIr\fP shall be matched only if it is followed +by an occurrence of regular expression \fIx\fP ( +\fIx\fP is the instance of trailing context, further defined below). +The token returned in \fIyytext\fP shall only match +\fIr\fP. If the trailing portion of \fIr\fP matches the beginning +of \fIx\fP, the result is unspecified. The \fIr\fP expression +cannot include further trailing context or the \fB'$'\fP (match-end-of-line) +operator; \fIx\fP cannot include the \fB'^'\fP +(match-beginning-of-line) operator, nor trailing context, nor the +\fB'$'\fP operator. That is, only one occurrence of trailing +context is allowed in a \fIlex\fP regular expression, and the \fB'^'\fP +operator only can be used at the beginning of such an +expression. +.TP 7 +{\fIname\fP} +When \fIname\fP is one of the substitution symbols from the \fIDefinitions\fP +section, the string, including the enclosing +braces, shall be replaced by the \fIsubstitute\fP value. The \fIsubstitute\fP +value shall be treated in the extended regular +expression as if it were enclosed in parentheses. No substitution +shall occur if { \fIname\fP} occurs within a bracket expression +or within double-quotes. +.sp +.LP +Within an ERE, a backslash character shall be considered to begin +an escape sequence as specified in the table in the Base +Definitions volume of IEEE\ Std\ 1003.1-2001, Chapter 5, File Format +Notation ( +\fB'\\\\'\fP , \fB'\\a'\fP , \fB'\\b'\fP , \fB'\\f'\fP , \fB'\\n'\fP +, \fB'\\r'\fP , \fB'\\t'\fP , \fB'\\v'\fP ). In +addition, the escape sequences in the following table shall be recognized. +.LP +A literal <newline> cannot occur within an ERE; the escape sequence +\fB'\\n'\fP can be used to represent a +<newline>. A <newline> shall not be matched by a period operator. +.br +.sp +.ce 1 +\fBTable: Escape Sequences in \fIlex\fP\fP +.TS C +center; l1 lw(30)1 lw(30). +\fBEscape\fP T{ +.na +\fB\ \fP +.ad +T} T{ +.na +\fB\ \fP +.ad +T} +\fBSequence\fP T{ +.na +\fBDescription\fP +.ad +T} T{ +.na +\fBMeaning\fP +.ad +T} +\\\fIdigits\fP T{ +.na +A backslash character followed by the longest sequence of one, two, or three octal-digit characters (01234567). If all of the digits are 0 (that is, representation of the NUL character), the behavior is undefined. +.ad +T} T{ +.na +The character whose encoding is represented by the one, two, or three-digit octal integer. If the size of a byte on the system is greater than nine bits, the valid escape sequence used to represent a byte is implementation-defined. Multi-byte characters require multiple, concatenated escape sequences of this type, including the leading \fB'\\'\fP for each byte. +.ad +T} +\\x\fIdigits\fP T{ +.na +A backslash character followed by the longest sequence of hexadecimal-digit characters (01234567abcdefABCDEF). If all of the digits are 0 (that is, representation of the NUL character), the behavior is undefined. +.ad +T} T{ +.na +The character whose encoding is represented by the hexadecimal integer. +.ad +T} +\\c T{ +.na +A backslash character followed by any character not described in this table or in the table in the Base Definitions volume of IEEE\ Std\ 1003.1-2001, Chapter 5, File Format Notation ( \fB'\\\\'\fP , \fB'\\a'\fP , \fB'\\b'\fP , \fB'\\f'\fP , \fB'\\n'\fP , \fB'\\r'\fP , \fB'\\t'\fP , \fB'\\v'\fP ). +.ad +T} T{ +.na +The character \fB'c'\fP , unchanged. +.ad +T} +.TE +.TP 7 +\fBNote:\fP +If a \fB'\\x'\fP sequence needs to be immediately followed by a hexadecimal +digit character, a sequence such as +\fB"\\x1"\fP \fB"1"\fP can be used, which represents a character containing +the value 1, followed by the character +\fB'1'\fP . +.sp +.LP +The order of precedence given to extended regular expressions for +\fIlex\fP differs from that specified in the Base Definitions +volume of IEEE\ Std\ 1003.1-2001, Section 9.4, Extended Regular +Expressions. The order of precedence for \fIlex\fP shall be as shown +in the following table, from high to low. +.TP 7 +\fBNote:\fP +The escaped characters entry is not meant to imply that these are +operators, but they are included in the table to show their +relationships to the true operators. The start condition, trailing +context, and anchoring notations have been omitted from the +table because of the placement restrictions described in this section; +they can only appear at the beginning or ending of an +ERE. +.sp +.sp +.sp +.ce 1 +\fBTable: ERE Precedence in \fIlex\fP\fP +.TS C +center; l2 l. +\fBExtended Regular Expression\fP \fBPrecedence\fP +collation-related bracket symbols [= =] [: :] [. .] +escaped characters \\<\fIspecial character\fP> +bracket expression [ ] +quoting "..." +grouping ( ) +definition {\fIname\fP} +single-character RE duplication * + ? +concatenation \ +interval expression {m,n} +alternation | +.TE +.LP +The ERE anchoring operators \fB'^'\fP and \fB'$'\fP do not appear +in the table. With \fIlex\fP regular expressions, these +operators are restricted in their use: the \fB'^'\fP operator can +only be used at the beginning of an entire regular expression, +and the \fB'$'\fP operator only at the end. The operators apply to +the entire regular expression. Thus, for example, the pattern +\fB"(^abc)|(def$)"\fP is undefined; it can instead be written as two +separate rules, one with the regular expression +\fB"^abc"\fP and one with \fB"def$"\fP , which share a common action +via the special \fB'|'\fP action (see below). If the +pattern were written \fB"^abc|def$"\fP , it would match either \fB"abc"\fP +or \fB"def"\fP on a line by itself. +.LP +Unlike the general ERE rules, embedded anchoring is not allowed by +most historical \fIlex\fP implementations. An example of +embedded anchoring would be for patterns such as \fB"(^|\ )foo(\ |$)"\fP +to match \fB"foo"\fP when it exists as a +complete word. This functionality can be obtained using existing \fIlex\fP +features: +.sp +.RS +.nf + +\fB^foo/[ \\n] | +" foo"/[ \\n] /* Found foo as a separate word. */ +\fP +.fi +.RE +.LP +Note also that \fB'$'\fP is a form of trailing context (it is equivalent +to \fB"/\\n"\fP ) and as such cannot be used with +regular expressions containing another instance of the operator (see +the preceding discussion of trailing context). +.LP +The additional regular expressions trailing-context operator \fB'/'\fP +can be used as an ordinary character if presented +within double-quotes, \fB"/"\fP ; preceded by a backslash, \fB"\\/"\fP +; or within a bracket expression, \fB"[/]"\fP . The +start-condition \fB'<'\fP and \fB'>'\fP operators shall be special +only in a start condition at the beginning of a +regular expression; elsewhere in the regular expression they shall +be treated as ordinary characters. +.SS Actions in lex +.LP +The action to be taken when an ERE is matched can be a C program fragment +or the special actions described below; the program +fragment can contain one or more C statements, and can also include +special actions. The empty C statement \fB';'\fP shall be a +valid action; any string in the \fBlex.yy.c\fP input that matches +the pattern portion of such a rule is effectively ignored or +skipped. However, the absence of an action shall not be valid, and +the action \fIlex\fP takes in such a condition is +undefined. +.LP +The specification for an action, including C statements and special +actions, can extend across several lines if enclosed in +braces: +.sp +.RS +.nf + +\fIERE\fP \fB<\fP\fIone or more blanks\fP\fB> {\fP \fIprogram statement + program statement\fP \fB} +\fP +.fi +.RE +.LP +The default action when a string in the input to a \fBlex.yy.c\fP +program is not matched by any expression shall be to copy the +string to the output. Because the default behavior of a program generated +by \fIlex\fP is to read the input and copy it to the +output, a minimal \fIlex\fP source program that has just \fB"%%"\fP +shall generate a C program that simply copies the input to +the output unchanged. +.LP +Four special actions shall be available: +.sp +.RS +.nf + +\fB| ECHO; REJECT; BEGIN +\fP +.fi +.RE +.TP 7 +\fB|\fP +The action \fB'|'\fP means that the action for the next rule is the +action for this rule. Unlike the other three actions, +\fB'|'\fP cannot be enclosed in braces or be semicolon-terminated; +the application shall ensure that it is specified alone, with +no other actions. +.TP 7 +\fBECHO;\fP +Write the contents of the string \fIyytext\fP on the output. +.TP 7 +\fBREJECT;\fP +Usually only a single expression is matched by a given string in the +input. \fBREJECT\fP means "continue to the next +expression that matches the current input", and shall cause whatever +rule was the second choice after the current rule to be +executed for the same input. Thus, multiple rules can be matched and +executed for one input string or overlapping input strings. +For example, given the regular expressions \fB"xyz"\fP and \fB"xy"\fP +and the input \fB"xyz"\fP , usually only the regular +expression \fB"xyz"\fP would match. The next attempted match would +start after \fBz.\fP If the last action in the +\fB"xyz"\fP rule is \fBREJECT\fP, both this rule and the \fB"xy"\fP +rule would be executed. The \fBREJECT\fP action may be +implemented in such a fashion that flow of control does not continue +after it, as if it were equivalent to a \fBgoto\fP to another +part of \fIyylex\fP(). The use of \fBREJECT\fP may result in somewhat +larger and slower scanners. +.TP 7 +\fBBEGIN\fP +The action: +.sp +.RS +.nf + +\fBBEGIN\fP \fInewstate\fP\fB; +\fP +.fi +.RE +.LP +switches the state (start condition) to \fInewstate\fP. If the string +\fInewstate\fP has not been declared previously as a +start condition in the \fIDefinitions\fP section, the results are +unspecified. The initial state is indicated by the digit +\fB'0'\fP or the token \fBINITIAL\fP. +.sp +.LP +The functions or macros described below are accessible to user code +included in the \fIlex\fP input. It is unspecified whether +they appear in the C code output of \fIlex\fP, or are accessible only +through the \fB-l\ l\fP operand to \fIc99\fP (the \fIlex\fP library). +.TP 7 +\fBint\ \fP \fIyylex\fP(\fBvoid\fP) +.sp +Performs lexical analysis on the input; this is the primary function +generated by the \fIlex\fP utility. The function shall return +zero when the end of input is reached; otherwise, it shall return +non-zero values (tokens) determined by the actions that are +selected. +.TP 7 +\fBint\ \fP \fIyymore\fP(\fBvoid\fP) +.sp +When called, indicates that when the next input string is recognized, +it is to be appended to the current value of \fIyytext\fP +rather than replacing it; the value in \fIyyleng\fP shall be adjusted +accordingly. +.TP 7 +\fBint\ \fP \fIyyless\fP(\fBint\ \fP \fIn\fP) +.sp +Retains \fIn\fP initial characters in \fIyytext\fP, NUL-terminated, +and treats the remaining characters as if they had not been +read; the value in \fIyyleng\fP shall be adjusted accordingly. +.TP 7 +\fBint\ \fP \fIinput\fP(\fBvoid\fP) +.sp +Returns the next character from the input, or zero on end-of-file. +It shall obtain input from the stream pointer \fIyyin\fP, +although possibly via an intermediate buffer. Thus, once scanning +has begun, the effect of altering the value of \fIyyin\fP is +undefined. The character read shall be removed from the input stream +of the scanner without any processing by the scanner. +.TP 7 +\fBint\ \fP \fIunput\fP(\fBint\ \fP \fIc\fP) +.sp +Returns the character \fB'c'\fP to the input; \fIyytext\fP and \fIyyleng\fP +are undefined until the next expression is +matched. The result of using \fIunput\fP() for more characters than +have been input is unspecified. +.sp +.LP +The following functions shall appear only in the \fIlex\fP library +accessible through the \fB-l\ l\fP operand; they can +therefore be redefined by a conforming application: +.TP 7 +\fBint\ \fP \fIyywrap\fP(\fBvoid\fP) +.sp +Called by \fIyylex\fP() at end-of-file; the default \fIyywrap\fP() +shall always return 1. If the application requires +\fIyylex\fP() to continue processing with another source of input, +then the application can include a function \fIyywrap\fP(), +which associates another file with the external variable \fBFILE *\fP +\fIyyin\fP and shall return a value of zero. +.TP 7 +\fBint\ \fP \fImain\fP(\fBint\ \fP \fIargc\fP, \fBchar *\fP\fIargv\fP[]) +.sp +Calls \fIyylex\fP() to perform lexical analysis, then exits. The user +code can contain \fImain\fP() to perform +application-specific operations, calling \fIyylex\fP() as applicable. +.sp +.LP +Except for \fIinput\fP(), \fIunput\fP(), and \fImain\fP(), all external +and static names generated by \fIlex\fP shall begin +with the prefix \fByy\fP or \fBYY\fP. +.SH EXIT STATUS +.LP +The following exit values shall be returned: +.TP 7 +\ 0 +Successful completion. +.TP 7 +>0 +An error occurred. +.sp +.SH CONSEQUENCES OF ERRORS +.LP +Default. +.LP +\fIThe following sections are informative.\fP +.SH APPLICATION USAGE +.LP +Conforming applications are warned that in the \fIRules\fP section, +an ERE without an action is not acceptable, but need not be +detected as erroneous by \fIlex\fP. This may result in compilation +or runtime errors. +.LP +The purpose of \fIinput\fP() is to take characters off the input stream +and discard them as far as the lexical analysis is +concerned. A common use is to discard the body of a comment once the +beginning of a comment is recognized. +.LP +The \fIlex\fP utility is not fully internationalized in its treatment +of regular expressions in the \fIlex\fP source code or +generated lexical analyzer. It would seem desirable to have the lexical +analyzer interpret the regular expressions given in the +\fIlex\fP source according to the environment specified when the lexical +analyzer is executed, but this is not possible with the +current \fIlex\fP technology. Furthermore, the very nature of the +lexical analyzers produced by \fIlex\fP must be closely tied to +the lexical requirements of the input language being described, which +is frequently locale-specific anyway. (For example, writing +an analyzer that is used for French text is not automatically useful +for processing other languages.) +.SH EXAMPLES +.LP +The following is an example of a \fIlex\fP program that implements +a rudimentary scanner for a Pascal-like syntax: +.sp +.RS +.nf + +\fB%{ +/* Need this for the call to atof() below. */ +#include <math.h> +/* Need this for printf(), fopen(), and stdin below. */ +#include <stdio.h> +%} +.sp + +DIGIT [0-9] +ID [a-z][a-z0-9]* +.sp + +%% +.sp + +{DIGIT}+ { + printf("An integer: %s (%d)\\n", yytext, + atoi(yytext)); + } +.sp + +{DIGIT}+"."{DIGIT}* { + printf("A float: %s (%g)\\n", yytext, + atof(yytext)); + } +.sp + +if|then|begin|end|procedure|function { + printf("A keyword: %s\\n", yytext); + } +.sp + +{ID} printf("An identifier: %s\\n", yytext); +.sp + +"+"|"-"|"*"|"/" printf("An operator: %s\\n", yytext); +.sp + +"{"[^}\\n]*"}" /* Eat up one-line comments. */ +.sp + +[ \\t\\n]+ /* Eat up white space. */ +.sp + +\&. printf("Unrecognized character: %s\\n", yytext); +.sp + +%% +.sp + +int main(int argc, char *argv[]) +{ + ++argv, --argc; /* Skip over program name. */ + if (argc > 0) + yyin = fopen(argv[0], "r"); + else + yyin = stdin; +.sp + + yylex(); +} +\fP +.fi +.RE +.SH RATIONALE +.LP +Even though the \fB-c\fP option and references to the C language are +retained in this description, \fIlex\fP may be +generalized to other languages, as was done at one time for EFL, the +Extended FORTRAN Language. Since the \fIlex\fP input +specification is essentially language-independent, versions of this +utility could be written to produce Ada, Modula-2, or Pascal +code, and there are known historical implementations that do so. +.LP +The current description of \fIlex\fP bypasses the issue of dealing +with internationalized EREs in the \fIlex\fP source code or +generated lexical analyzer. If it follows the model used by \fIawk\fP +(the source code is +assumed to be presented in the POSIX locale, but input and output +are in the locale specified by the environment variables), then +the tables in the lexical analyzer produced by \fIlex\fP would interpret +EREs specified in the \fIlex\fP source in terms of the +environment variables specified when \fIlex\fP was executed. The desired +effect would be to have the lexical analyzer interpret +the EREs given in the \fIlex\fP source according to the environment +specified when the lexical analyzer is executed, but this is +not possible with the current \fIlex\fP technology. +.LP +The description of octal and hexadecimal-digit escape sequences agrees +with the ISO\ C standard use of escape sequences. See +the RATIONALE for \fIed\fP for a discussion of bytes larger than 9 +bits being represented by octal values. +Hexadecimal values can represent larger bytes and multi-byte characters +directly, using as many digits as required. +.LP +There is no detailed output format specification. The observed behavior +of \fIlex\fP under four different historical +implementations was that none of these implementations consistently +reported the line numbers for error and warning messages. +Furthermore, there was a desire that \fIlex\fP be allowed to output +additional diagnostic messages. Leaving message formats +unspecified avoids these formatting questions and problems with internationalization. +.LP +Although the \fB%x\fP specifier for \fIexclusive\fP start conditions +is not historical practice, it is believed to be a +minor change to historical implementations and greatly enhances the +usability of \fIlex\fP programs since it permits an +application to obtain the expected functionality with fewer statements. +.LP +The \fB%array\fP and \fB%pointer\fP declarations were added as a compromise +between historical systems. The System V-based +\fIlex\fP copies the matched text to a \fIyytext\fP array. The \fIflex\fP +program, supported in BSD and GNU systems, uses a +pointer. In the latter case, significant performance improvements +are available for some scanners. Most historical programs should +require no change in porting from one system to another because the +string being referenced is null-terminated in both cases. (The +method used by \fIflex\fP in its case is to null-terminate the token +in place by remembering the character that used to come right +after the token and replacing it before continuing on to the next +scan.) Multi-file programs with external references to +\fIyytext\fP outside the scanner source file should continue to operate +on their historical systems, but would require one of the +new declarations to be considered strictly portable. +.LP +The description of EREs avoids unnecessary duplication of ERE details +because their meanings within a \fIlex\fP ERE are the +same as that for the ERE in this volume of IEEE\ Std\ 1003.1-2001. +.LP +The reason for the undefined condition associated with text beginning +with a <blank> or within \fB"%{"\fP and +\fB"%}"\fP delimiter lines appearing in the \fIRules\fP section is +historical practice. Both the BSD and System V \fIlex\fP +copy the indented (or enclosed) input in the \fIRules\fP section (except +at the beginning) to unreachable areas of the +\fIyylex\fP() function (the code is written directly after a \fIbreak\fP +statement). In some cases, the System V \fIlex\fP generates an error +message or a syntax error, depending on the form of indented +input. +.LP +The intention in breaking the list of functions into those that may +appear in \fBlex.yy.c\fP \fIversus\fP those that only +appear in \fBlibl.a\fP is that only those functions in \fBlibl.a\fP +can be reliably redefined by a conforming application. +.LP +The descriptions of standard output and standard error are somewhat +complicated because historical \fIlex\fP implementations +chose to issue diagnostic messages to standard output (unless \fB-t\fP +was given). IEEE\ Std\ 1003.1-2001 allows this +behavior, but leaves an opening for the more expected behavior of +using standard error for diagnostics. Also, the System V behavior +of writing the statistics when any table sizes are given is allowed, +while BSD-derived systems can avoid it. The programmer can +always precisely obtain the desired results by using either the \fB-t\fP +or \fB-n\fP options. +.LP +The OPERANDS section does not mention the use of \fB-\fP as a synonym +for standard input; not all historical implementations +support such usage for any of the \fIfile\fP operands. +.LP +A description of the \fItranslation table\fP was deleted from early +proposals because of its relatively low usage in historical +applications. +.LP +The change to the definition of the \fIinput\fP() function that allows +buffering of input presents the opportunity for major +performance gains in some applications. +.LP +The following examples clarify the differences between \fIlex\fP regular +expressions and regular expressions appearing +elsewhere in this volume of IEEE\ Std\ 1003.1-2001. For regular expressions +of the form \fB"r/x"\fP , the string +matching \fIr\fP is always returned; confusion may arise when the +beginning of \fIx\fP matches the trailing portion of \fIr\fP. +For example, given the regular expression \fB"a*b/cc"\fP and the input +\fB"aaabcc"\fP , \fIyytext\fP would contain the +string \fB"aaab"\fP on this match. But given the regular expression +\fB"x*/xy"\fP and the input \fB"xxxy"\fP , the token +\fBxxx\fP, not \fBxx\fP, is returned by some implementations because +\fBxxx\fP matches \fB"x*"\fP . +.LP +In the rule \fB"ab*/bc"\fP , the \fB"b*"\fP at the end of \fIr\fP +extends \fIr\fP's match into the beginning of the +trailing context, so the result is unspecified. If this rule were +\fB"ab/bc"\fP , however, the rule matches the text +\fB"ab"\fP when it is followed by the text \fB"bc"\fP . In this latter +case, the matching of \fIr\fP cannot extend into the +beginning of \fIx\fP, so the result is specified. +.SH FUTURE DIRECTIONS +.LP +None. +.SH SEE ALSO +.LP +\fIc99\fP , \fIed\fP , \fIyacc\fP +.SH COPYRIGHT +Portions of this text are reprinted and reproduced in electronic form +from IEEE Std 1003.1, 2003 Edition, Standard for Information Technology +-- Portable Operating System Interface (POSIX), The Open Group Base +Specifications Issue 6, Copyright (C) 2001-2003 by the Institute of +Electrical and Electronics Engineers, Inc and The Open Group. In the +event of any discrepancy between this version and the original IEEE and +The Open Group Standard, the original IEEE and The Open Group Standard +is the referee document. The original Standard can be obtained online at +http://www.opengroup.org/unix/online.html . |