diff options
Diffstat (limited to 'man7/tcp.7')
-rw-r--r-- | man7/tcp.7 | 711 |
1 files changed, 711 insertions, 0 deletions
diff --git a/man7/tcp.7 b/man7/tcp.7 new file mode 100644 index 000000000..d28b81876 --- /dev/null +++ b/man7/tcp.7 @@ -0,0 +1,711 @@ +.\" This man page is Copyright (C) 1999 Andi Kleen <ak@muc.de>. +.\" Permission is granted to distribute possibly modified copies +.\" of this page provided the header is included verbatim, +.\" and in case of nontrivial modification author and date +.\" of the modification is added to the header. +.\" +.\" 2.4 Updates by Nivedita Singhvi 4/20/02 <nivedita@us.ibm.com>. +.\" +.TH TCP 7 2003-08-21 "Linux Man Page" "Linux Programmer's Manual" +.SH NAME +tcp \- TCP protocol +.SH SYNOPSIS +.B #include <sys/socket.h> +.br +.B #include <netinet/in.h> +.br +.B #include <netinet/tcp.h> +.br +.B tcp_socket = socket(PF_INET, SOCK_STREAM, 0); +.SH DESCRIPTION +This is an implementation of the TCP protocol defined in +RFC793, RFC1122 and RFC2001 with the NewReno and SACK +extensions. It provides a reliable, stream oriented, full +duplex connection between two sockets on top of +.BR ip (7), +for both v4 and v6 versions. +TCP guarantees that the data arrives in order and +retransmits lost packets. It generates and checks a per +packet checksum to catch transmission errors. TCP does not +preserve record boundaries. + +A fresh TCP socket has no remote or local address and is not +fully specified. To create an outgoing TCP connection use +.BR connect (2) +to establish a connection to another TCP socket. +To receive new incoming connections +.BR bind (2) +the socket first to a local address and port and then call +.BR listen (2) +to put the socket into listening state. After that a new +socket for each incoming connection can be accepted +using +.BR accept (2). +A socket which has had +.B accept +or +.B connect +successfully called on it is fully specified and may +transmit data. Data cannot be transmitted on listening or +not yet connected sockets. + +Linux supports RFC1323 TCP high performance +extensions. These include Protection Against Wrapped +Sequence Numbers (PAWS), Window Scaling and +Timestamps. Window scaling allows the use +of large (> 64K) TCP windows in order to support links with high +latency or bandwidth. To make use of them, the send and +receive buffer sizes must be increased. +They can be set globally with the +.B net.ipv4.tcp_wmem +and +.B net.ipv4.tcp_rmem +sysctl variables, or on individual sockets by using the +.B SO_SNDBUF +and +.B SO_RCVBUF +socket options with the +.BR setsockopt (2) +call. + +The maximum sizes for socket buffers declared via the +.B SO_SNDBUF +and +.B SO_RCVBUF +mechanisms are limited by the global +.B net.core.rmem_max +and +.B net.core.wmem_max +sysctls. Note that TCP actually allocates twice the size of +the buffer requested in the +.BR setsockopt (2) +call, and so a succeeding +.BR getsockopt (2) +call will not return the same size of buffer as requested +in the +.BR setsockopt (2) +call. TCP uses this for administrative purposes and internal +kernel structures, and the sysctl variables reflect the +larger sizes compared to the actual TCP windows. +On individual connections, the socket buffer size must be +set prior to the +.B listen() +or +.B connect() +calls in order to have it take effect. See +.BR socket (7) +for more information. +.PP +TCP supports urgent data. Urgent data is used to signal the +receiver that some important message is part of the data +stream and that it should be processed as soon as possible. +To send urgent data specify the +.B MSG_OOB +option to +.BR send (2). +When urgent data is received, the kernel sends a +.B SIGURG +signal to the reading process or the process or process +group that has been set for the socket using the +.B SIOCSPGRP +or +.B FIOSETOWN +ioctls. When the +.B SO_OOBINLINE +socket option is enabled, urgent data is put into the normal +data stream (and can be tested for by the +.B SIOCATMARK +ioctl), +otherwise it can be only received when the +.B MSG_OOB +flag is set for +.BR sendmsg (2). + +Linux 2.4 introduced a number of changes for improved +throughput and scaling, as well as enhanced functionality. +Some of these features include support for zerocopy +.BR sendfile (2), +Explicit Congestion Notification, new +management of TIME_WAIT sockets, keep-alive socket options +and support for Duplicate SACK extensions. +.SH "ADDRESS FORMATS" +TCP is built on top of IP (see +.BR ip (7)). +The address formats defined by +.BR ip (7) +apply to TCP. TCP only supports point-to-point +communication; broadcasting and multicasting are not +supported. +.SH SYSCTLS +These variables can be accessed by the +.B /proc/sys/net/ipv4/* +files or with the +.BR sysctl (2) +interface. In addition, most IP sysctls also apply to TCP; see +.BR ip (7). +.TP +.B tcp_abort_on_overflow +Enable resetting connections if the listening service is too +slow and unable to keep up and accept them. It is not +enabled by default. It means that if overflow occurred due +to a burst, the connection will recover. Enable this option +_only_ if you are really sure that the listening daemon +cannot be tuned to accept connections faster. Enabling this +option can harm the clients of your server. +.TP +.B tcp_adv_win_scale +Count buffering overhead as bytes/2^tcp_adv_win_scale +(if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), +if it is <= 0. The default is 2. + +The socket receive buffer space is shared between the +application and kernel. TCP maintains part of the buffer as +the TCP window, this is the size of the receive window +advertised to the other end. The rest of the space is used +as the "application" buffer, used to isolate the network +from scheduling and application latencies. The +.B tcp_adv_win_scale +default value of 2 implies that the space +used for the application buffer is one fourth that of the +total. +.TP +.B tcp_app_win +This variable defines how many +bytes of the TCP window are reserved for buffering +overhead. + +A maximum of (window/2^tcp_app_win, mss) bytes in the window +are reserved for the application buffer. A value of 0 +implies that no amount is reserved. The default value is 31. +.TP +.B tcp_dsack +Enable RFC2883 TCP Duplicate SACK support. +It is enabled by default. +.TP +.B tcp_ecn +Enable RFC2884 Explicit Congestion Notification. It is not +enabled by default. When enabled, connectivity to some +destinations could be affected due to older, misbehaving +routers along the path causing connections to be dropped. +.TP +.B tcp_fack +Enable TCP Forward Acknowledgement support. It is enabled by +default. +.TP +.B tcp_fin_timeout +How many seconds to wait for a final FIN packet before the +socket is forcibly closed. This is strictly a violation of +the TCP specification, but required to prevent +denial-of-service (DoS) attacks. The default value in 2.4 +kernels is 60, down from 180 in 2.2. +.TP +.B tcp_keepalive_intvl +The number of seconds between TCP keep-alive probes. +The default value is 75 seconds. +.TP +.B tcp_keepalive_probes +The maximum number of TCP keep-alive probes to send +before giving up and killing the connection if +no response is obtained from the other end. +The default value is 9. +.TP +.B tcp_keepalive_time +The number of seconds a connection needs to be idle +before TCP begins sending out keep-alive probes. +Keep-alives are only sent when the +.B SO_KEEPALIVE +socket option is enabled. The default value is 7200 seconds +(2 hours). An idle connection is terminated after +approximately an additional 11 minutes (9 probes an interval +of 75 seconds apart) when keep-alive is enabled. + +Note that underlying connection tracking mechanisms and +application timeouts may be much shorter. +.TP +.B tcp_max_orphans +The maximum number of orphaned (not attached to any user file +handle) TCP sockets allowed in the system. When this number +is exceeded, the orphaned connection is reset and a warning +is printed. This limit exists only to prevent simple DoS +attacks. Lowering this limit is not recommended. Network +conditions might require you to increase the number of +orphans allowed, but note that each orphan can eat up to ~64K +of unswappable memory. The default initial value is set +equal to the kernel parameter NR_FILE. This initial default +is adjusted depending on the memory in the system. +.TP +.B tcp_max_syn_backlog +The maximum number of queued connection requests which have +still not received an acknowledgement from the connecting +client. If this number is exceeded, the kernel will begin +dropping requests. The default value of 256 is increased to +1024 when the memory present in the system is adequate or +greater (>= 128Mb), and reduced to 128 for those systems with +very low memory (<= 32Mb). It is recommended that if this +needs to be increased above 1024, TCP_SYNQ_HSIZE in +include/net/tcp.h be modified to keep +TCP_SYNQ_HSIZE*16<=tcp_max_syn_backlog, and the kernel be +recompiled. +.TP +.B tcp_max_tw_buckets +The maximum number of sockets in TIME_WAIT state allowed in +the system. This limit exists only to prevent simple DoS +attacks. The default value of NR_FILE*2 is adjusted +depending on the memory in the system. If this number is +exceeded, the socket is closed and a warning is printed. +.TP +.B tcp_mem +This is a vector of 3 integers: [low, pressure, high]. These +bounds are used by TCP to track its memory usage. The +defaults are calculated at boot time from the amount of +available memory. + +.I low +- TCP doesn't regulate its memory allocation when the number +of pages it has allocated globally is below this number. + +.I pressure +- when the amount of memory allocated by TCP +exceeds this number of pages, TCP moderates its memory +consumption. This memory pressure state is exited +once the number of pages allocated falls below +the +.B low +mark. + +.I high +- the maximum number of pages, globally, that TCP +will allocate. This value overrides any other limits +imposed by the kernel. +.TP +.B tcp_orphan_retries +The maximum number of attempts made to probe the other +end of a connection which has been closed by our end. +The default value is 8. +.TP +.B tcp_reordering +The maximum a packet can be reordered in a TCP packet stream +without TCP assuming packet loss and going into slow start. +The default is 3. It is not advisable to change this number. +This is a packet reordering detection metric designed to +minimize unnecessary back off and retransmits provoked by +reordering of packets on a connection. +.TP +.B tcp_retrans_collapse +Try to send full-sized packets during retransmit. +This is enabled by default. +.TP +.B tcp_retries1 +The number of times TCP will attempt to retransmit a +packet on an established connection normally, +without the extra effort of getting the network +layers involved. Once we exceed this number of +retransmits, we first have the network layer +update the route if possible before each new retransmit. +The default is the RFC specified minimum of 3. +.TP +.B tcp_retries2 +The maximum number of times a TCP packet is retransmitted +in established state before giving up. The default +value is 15, which corresponds to a duration of +approximately between 13 to 30 minutes, depending +on the retransmission timeout. The RFC1122 specified +minimum limit of 100 seconds is typically deemed too +short. +.TP +.B tcp_rfc1337 +Enable TCP behaviour conformant with RFC 1337. +This is not enabled by default. When not enabled, +if a RST is received in TIME_WAIT state, we close +the socket immediately without waiting for the end +of the TIME_WAIT period. +.TP +.B tcp_rmem +This is a vector of 3 integers: [min, default, +max]. These parameters are used by TCP to regulate receive +buffer sizes. TCP dynamically adjusts the size of the +receive buffer from the defaults listed below, in the range +of these sysctl variables, depending on memory available +in the system. + +.I min +- minimum size of the receive buffer used by each TCP +socket. The default value is 4K, and is lowered to +PAGE_SIZE bytes in low memory systems. This value +is used to ensure that in memory pressure mode, +allocations below this size will still succeed. This is not +used to bound the size of the receive buffer declared +using +.B SO_RCVBUF +on a socket. + +.I default +- the default size of the receive buffer for a TCP socket. +This value overwrites the initial default buffer size from +the generic global +.B net.core.rmem_default +defined for all protocols. The default value is 87380 +bytes, and is lowered to 43689 in low memory systems. If +larger receive buffer sizes are desired, this value should +be increased (to affect all sockets). To employ large TCP +windows, the +.B net.ipv4.tcp_window_scaling +must be enabled (default). + +.I max +- the maximum size of the receive buffer used by +each TCP socket. This value does not override the global +.BR net.core.rmem_max . +This is not used to limit the size of the receive buffer +declared using +.B SO_RCVBUF +on a socket. +The default value of 87380*2 bytes is lowered to 87380 +in low memory systems. +.TP +.B tcp_sack +Enable RFC2018 TCP Selective Acknowledgements. +It is enabled by default. +.TP +.B tcp_stdurg +Enable the strict RFC793 interpretation of the TCP +urgent-pointer field. The default is to use the +BSD-compatible interpretation of the urgent-pointer, pointing +to the first byte after the urgent data. The RFC793 +interpretation is to have it point to the last byte of urgent +data. Enabling this option may lead to interoperatibility +problems. +.TP +.B tcp_synack_retries +The maximum number of times a SYN/ACK segment +for a passive TCP connection will be retransmitted. +This number should not be higher than 255. The default +value is 5. +.TP +.B tcp_syncookies +Enable TCP syncookies. The kernel must be compiled with +.BR CONFIG_SYN_COOKIES . +Send out syncookies when the syn backlog queue of a socket +overflows. The syncookies feature attempts to protect a +socket from a SYN flood attack. This should be used as a +last resort, if at all. This is a violation of the TCP +protocol, and conflicts with other areas of TCP such as TCP +extensions. It can cause problems for clients and relays. +It is not recommended as a tuning mechanism for heavily +loaded servers to help with overloaded or misconfigured +conditions. For recommended alternatives see +.BR tcp_max_syn_backlog , +.BR tcp_synack_retries , +.BR tcp_abort_on_overflow . +.TP +.B tcp_syn_retries +The maximum number of times initial SYNs for an active TCP +connection attempt will be retransmitted. This value should +not be higher than 255. The default value is 5, which +corresponds to approximately 180 seconds. +.TP +.B tcp_timestamps +Enable RFC1323 TCP timestamps. This is enabled +by default. +.TP +.B tcp_tw_recycle +Enable fast recycling of TIME-WAIT sockets. It is +not enabled by default. Enabling this option is not +recommended since this causes problems when working +with NAT (Network Address Translation). +.TP +.B tcp_window_scaling +Enable RFC1323 TCP window scaling. It is enabled by +default. This feature allows the use of a large window +(> 64K) on a TCP connection, should the other end support it. +Normally, the 16 bit window length field in the TCP header +limits the window size to less than 64K bytes. If larger +windows are desired, applications can increase the size of +their socket buffers and the window scaling option will be +employed. If +.B tcp_window_scaling +is disabled, TCP will not negotiate the use of window +scaling with the other end during connection setup. +.TP +.B tcp_wmem +This is a vector of 3 integers: [min, default, max]. These +parameters are used by TCP to regulate send buffer sizes. +TCP dynamically adjusts the size of the send buffer from the +default values listed below, in the range of these sysctl +variables, depending on memory available. + +.I min +- minimum size of the send buffer used by each TCP socket. +The default value is 4K bytes. +This value is used to ensure that in memory pressure mode, +allocations below this size will still succeed. This is not +used to bound the size of the send buffer declared +using +.B SO_SNDBUF +on a socket. + +.I default +- the default size of the send buffer for a TCP socket. +This value overwrites the initial default buffer size from +the generic global +.B net.core.wmem_default +defined for all protocols. The default value is 16K bytes. +If larger send buffer sizes are desired, this value +should be increased (to affect all sockets). To employ +large TCP windows, the sysctl variable +.B net.ipv4.tcp_window_scaling +must be enabled (default). + +.I max +- the maximum size of the send buffer used by +each TCP socket. This value does not override the global +.BR net.core.wmem_max . +This is not used to limit the size of the send buffer +declared using +.B SO_SNDBUF +on a socket. +The default value is 128K bytes. It is lowered to 64K +depending on the memory available in the system. +.SH "SOCKET OPTIONS" +To set or get a TCP socket option, call +.BR getsockopt (2) +to read or +.BR setsockopt (2) +to write the option with the option level argument set to +.BR SOL_TCP. +In addition, +most +.B SOL_IP +socket options are valid on TCP sockets. For more +information see +.BR ip (7). +.TP +.B TCP_CORK +If set, don't send out partial frames. All queued +partial frames are sent when the option is cleared again. +This is useful for prepending headers before calling +.BR sendfile (2), +or for throughput optimization. This option cannot be +combined with +.BR TCP_NODELAY. +This option should not be used in code intended to be +portable. +.TP +.B TCP_DEFER_ACCEPT +Allows a listener to be awakened only when data arrives on +the socket. Takes an integer value (seconds), this can +bound the maximum number of attempts TCP will make to +complete the connection. This option should not be used in +code intended to be portable. +.TP +.B TCP_INFO +Used to collect information about this socket. The kernel +returns a struct tcp_info as defined in the file +/usr/include/linux/tcp.h. This option should not be used in +code intended to be portable. +.TP +.B TCP_KEEPCNT +The maximum number of keepalive probes TCP should send +before dropping the connection. This option should not be +used in code intended to be portable. +.TP +.B TCP_KEEPIDLE +The time (in seconds) the connection needs to remain idle +before TCP starts sending keepalive probes, if the socket +option SO_KEEPALIVE has been set on this socket. This +option should not be used in code intended to be portable. +.TP +.B TCP_KEEPINTVL +The time (in seconds) between individual keepalive probes. +This option should not be used in code intended to be +portable. +.TP +.B TCP_LINGER2 +The lifetime of orphaned FIN_WAIT2 state sockets. This +option can be used to override the system wide sysctl +.B tcp_fin_timeout +on this socket. This is not to be confused with the +.BR socket (7) +level option +.BR SO_LINGER . +This option should not be used in code intended to be +portable. +.TP +.B TCP_MAXSEG +The maximum segment size for outgoing TCP packets. If this +option is set before connection establishment, it also +changes the MSS value announced to the other end in the +initial packet. Values greater than the (eventual) +interface MTU have no effect. TCP will also impose +its minimum and maximum bounds over the value provided. +.TP +.B TCP_NODELAY +If set, disable the Nagle algorithm. This means that segments +are always sent as soon as possible, even if there is only a +small amount of data. When not set, data is buffered until there +is a sufficient amount to send out, thereby avoiding the +frequent sending of small packets, which results in poor +utilization of the network. This option cannot be used +at the same time as the option +.BR TCP_CORK . +.TP +.B TCP_QUICKACK +Enable quickack mode if set or disable quickack +mode if cleared. In quickack mode, acks are sent +immediately, rather than delayed if needed in accordance +to normal TCP operation. This flag is not permanent, +it only enables a switch to or from quickack mode. +Subsequent operation of the TCP protocol will +once again enter/leave quickack mode depending on +internal protocol processing and factors such as +delayed ack timeouts occurring and data transfer. +This option should not be used in code intended to be +portable. +.TP +.B TCP_SYNCNT +Set the number of SYN retransmits that TCP should send before +aborting the attempt to connect. It cannot exceed 255. +This option should not be used in code intended to be +portable. +.TP +.B TCP_WINDOW_CLAMP +Bound the size of the advertised window to this value. The +kernel imposes a minimum size of SOCK_MIN_RCVBUF/2. +This option should not be used in code intended to be +portable. +.SH IOCTLS +These ioctls can be accessed using +.BR ioctl (2). +The correct syntax is: +.PP +.RS +.nf +.BI int " value"; +.IB error " = ioctl(" tcp_socket ", " ioctl_type ", &" value ");" +.fi +.RE +.TP +.BR SIOCINQ +Returns the amount of queued unread data in the receive +buffer. Argument is a pointer to an integer. The socket +must not be in LISTEN state, otherwise an error (EINVAL) +is returned. +.TP +.B SIOCATMARK +Returns true when the all urgent data has been already +received by the user program. This is used together with +.BR SO_OOBINLINE . +Argument is an pointer to an integer for the test result. +.TP +.B SIOCOUTQ +Returns the amount of unsent data in the socket send queue +in the passed integer value pointer. The socket must not +be in LISTEN state, otherwise an error (EINVAL) +is returned. +.SH "ERROR HANDLING" +When a network error occurs, TCP tries to resend the +packet. If it doesn't succeed after some time, either +.B ETIMEDOUT +or the last received error on this connection is reported. +.PP +Some applications require a quicker error notification. +This can be enabled with the +.B SOL_IP +level +.B IP_RECVERR +socket option. When this option is enabled, all incoming +errors are immediately passed to the user program. Use this +option with care \- it makes TCP less tolerant to routing +changes and other normal network conditions. +.SH NOTES +When an error occurs doing a connection setup occurring in a +socket write +.B SIGPIPE +is only raised when the +.B SO_KEEPALIVE +socket option is set. +.PP +TCP has no real out-of-band data; it has urgent data. In +Linux this means if the other end sends newer out-of-band +data the older urgent data is inserted as normal data into +the stream (even when +.B SO_OOBINLINE +is not set). This differs from BSD based stacks. +.PP +Linux uses the BSD compatible interpretation of the urgent +pointer field by default. This violates RFC1122, but is +required for interoperability with other stacks. It can be +changed by the +.B tcp_stdurg +sysctl. +.SH ERRORS +.TP +.B EPIPE +The other end closed the socket unexpectedly or a read is +executed on a shut down socket. +.TP +.B ETIMEDOUT +The other end didn't acknowledge retransmitted data after +some time. +.TP +.B EAFNOTSUPPORT +Passed socket address type in +.I sin_family +was not +.BR AF_INET . +.PP +Any errors defined for +.BR ip (7) +or the generic socket layer may also be returned for TCP. +.SH BUGS +Not all errors are documented. +.br +IPv6 is not described. +.\" Only a single Linux kernel version is described +.\" Info for 2.2 was lost. Should be added again, +.\" or put into a separate page. +.SH VERSIONS +Support for Explicit Congestion Notification, zerocopy +sendfile, reordering support and some SACK extensions +(DSACK) were introduced in 2.4. +Support for forward acknowledgement (FACK), TIME_WAIT recycling, +per connection keepalive socket options and sysctls +were introduced in 2.3. + +The default values and descriptions for the sysctl variables +given above are applicable for the 2.4 kernel. +.SH AUTHORS +This man page was originally written by Andi Kleen. +It was updated for 2.4 by Nivedita Singhvi with input from +Alexey Kuznetsov's Documentation/networking/ip-sysctls.txt +document. +.SH "SEE ALSO" +.BR accept (2), +.BR bind (2), +.BR connect (2), +.BR getsockopt (2), +.BR listen (2), +.BR recvmsg (2), +.BR sendfile (2), +.BR sendmsg (2), +.BR socket (2), +.BR sysctl (2), +.BR ip (7), +.BR socket (7) +.sp +RFC793 for the TCP specification. +.br +RFC1122 for the TCP requirements and a description +of the Nagle algorithm. +.br +RFC1323 for TCP timestamp and window scaling options. +.br +RFC1644 for a description of TIME_WAIT assassination +hazards. +.br +RFC2481 for a description of Explicit Congestion +Notification. +.br +RFC2581 for TCP congestion control algorithms. +.br +RFC2018 and RFC2883 for SACK and extensions to SACK. |