diff options
Diffstat (limited to 'man4/epoll.4')
-rw-r--r-- | man4/epoll.4 | 370 |
1 files changed, 370 insertions, 0 deletions
diff --git a/man4/epoll.4 b/man4/epoll.4 new file mode 100644 index 000000000..3357d9af2 --- /dev/null +++ b/man4/epoll.4 @@ -0,0 +1,370 @@ +.\" +.\" epoll by Davide Libenzi ( efficient event notification retrieval ) +.\" Copyright (C) 2003 Davide Libenzi +.\" +.\" This program is free software; you can redistribute it and/or modify +.\" it under the terms of the GNU General Public License as published by +.\" the Free Software Foundation; either version 2 of the License, or +.\" (at your option) any later version. +.\" +.\" This program is distributed in the hope that it will be useful, +.\" but WITHOUT ANY WARRANTY; without even the implied warranty of +.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +.\" GNU General Public License for more details. +.\" +.\" You should have received a copy of the GNU General Public License +.\" along with this program; if not, write to the Free Software +.\" Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +.\" +.\" Davide Libenzi <davidel@xmailserver.org> +.\" +.\" +.TH EPOLL 4 "2002-10-23" Linux "Linux Programmer's Manual" +.SH NAME +epoll \- I/O event notification facility +.SH SYNOPSIS +.B #include <sys/epoll.h> +.SH DESCRIPTION +.B epoll +is a variant of +.BR poll (2) +that can be used either as Edge or Level Triggered interface and scales +well to large numbers of watched fds. Three system calls are provided to +set up and control an +.B epoll +set: +.BR epoll_create (2), +.BR epoll_ctl (2), +.BR epoll_wait (2). + +An +.B epoll +set is connected to a file descriptor created by +.BR epoll_create (2). +Interest for certain file descriptors is then registered via +.BR epoll_ctl (2). +Finally, the actual wait is started by +.BR epoll_wait (2). + +.SH NOTES +The +.B epoll +event distribution interface is able to behave both as Edge Triggered +( ET ) and Level Triggered ( LT ). The difference between ET and LT +event distribution mechanism can be described as follows. Suppose that +this scenario happens : +.TP +.B 1 +The file descriptor that represents the read side of a pipe ( +.B RFD +) is added inside the +.B epoll +device. +.TP +.B 2 +Pipe writer writes 2Kb of data on the write side of the pipe. +.TP +.B 3 +A call to +.BR epoll_wait (2) +is done that will return +.B RFD +as ready file descriptor. +.TP +.B 4 +The pipe reader reads 1Kb of data from +.BR RFD . +.TP +.B 5 +A call to +.BR epoll_wait (2) +is done. +.PP + +If the +.B RFD +file descriptor has been added to the +.B epoll +interface using the +.B EPOLLET +flag, the call to +.BR epoll_wait (2) +done in step +.B 5 +will probably hang because of the available data still present in the file +input buffers and the remote peer might be expecting a response based on the +data it already sent. The reason for this is that Edge Triggered event +distribution delivers events only when events happens on the monitored file. +So, in step +.B 5 +the caller might end up waiting for some data that is already present inside +the input buffer. In the above example, an event on +.B RFD +will be generated because of the write done in +.B 2 +, and the event is consumed in +.BR 3 . +Since the read operation done in +.B 4 +does not consume the whole buffer data, the call to +.BR epoll_wait (2) +done in step +.B 5 +might lock indefinitely. The +.B epoll +interface, when used with the +.B EPOLLET +flag ( Edge Triggered ) +should use non-blocking file descriptors to avoid having a blocking +read or write starve the task that is handling multiple file descriptors. +The suggested way to use +.B epoll +as an Edge Triggered ( +.B EPOLLET +) interface is below, and possible pitfalls to avoid follow. +.RS +.TP +.B i +with non-blocking file descriptors +.TP +.B ii +by going to wait for an event only after +.BR read (2) +or +.BR write (2) +return EAGAIN +.RE +.PP +On the contrary, when used as a Level Triggered interface, +.B epoll +is by all means a faster +.BR poll (2), +and can be used wherever the latter is used since it shares the +same semantics. Since even with the Edge Triggered +.B epoll +multiple events can be generated up on receival of multiple chunks of data, +the caller has the option to specify the +.B EPOLLONESHOT +flag, to tell +.B epoll +to disable the associated file descriptor after the receival of an event with +.BR epoll_wait (2). +When the +.B EPOLLONESHOT +flag is specified, it is caller responsibility to rearm the file descriptor using +.BR epoll_ctl (2) +with +.BR EPOLL_CTL_MOD . + +.SH EXAMPLE FOR SUGGESTED USAGE + +While the usage of +.B epoll +when employed like a Level Triggered interface does have the same +semantics of +.BR poll (2), +an Edge Triggered usage requires more clarifiction to avoid stalls +in the application event loop. In this example, listener is a +non-blocking socket on which +.BR listen (2) +has been called. The function do_use_fd() uses the new ready +file descriptor until EAGAIN is returned by either +.BR read (2) +or +.BR write (2). +An event driven state machine application should, after having received +EAGAIN, record its current state so that at the next call to do_use_fd() +it will continue to +.BR read (2) +or +.BR write (2) +from where it stopped before. + +.nf +struct epoll_event ev, *events; + +for(;;) { + nfds = epoll_wait(kdpfd, events, maxevents, -1); + + for(n = 0; n < nfds; ++n) { + if(events[n].data.fd == listener) { + client = accept(listener, (struct sockaddr *) &local, + &addrlen); + if(client < 0){ + perror("accept"); + continue; + } + setnonblocking(client); + ev.events = EPOLLIN | EPOLLET; + ev.data.fd = client; + if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, client, &ev) < 0) { + fprintf(stderr, "epoll set insertion error: fd=%d\n", + client); + return -1; + } + } + else + do_use_fd(events[n].data.fd); + } +} +.fi + +When used as an Edge triggered interface, for performance reasons, it is +possible to add the file descriptor inside the epoll interface ( +.B EPOLL_CTL_ADD +) once by specifying ( +.BR EPOLLIN | EPOLLOUT +). This allows you to avoid +continuously switching between +.B EPOLLIN +and +.B EPOLLOUT +calling +.BR epoll_ctl (2) +with +.BR EPOLL_CTL_MOD . + +.SH QUESTIONS AND ANSWERS (from linux-kernel) + +.RS +.TP +.B Q1 +What happens if you add the same fd to an epoll_set twice? +.TP +.B A1 +You will probably get EEXIST. However, it is possible that two +threads may add the same fd twice. This is a harmless condition. +.TP +.B Q2 +Can two +.B epoll +sets wait for the same fd? If so, are events reported +to both +.B epoll +sets fds? +.TP +.B A2 +Yes. However, it is not recommended. Yes it would be reported to both. +.TP +.B Q3 +Is the +.B epoll +fd itself poll/epoll/selectable? +.TP +.B A3 +Yes. +.TP +.B Q4 +What happens if the +.B epoll +fd is put into its own fd set? +.TP +.B A4 +It will fail. However, you can add an +.B epoll +fd inside another epoll fd set. +.TP +.B Q5 +Can I send the +.B epoll +fd over a unix-socket to another process? +.TP +.B A5 +No. +.TP +.B Q6 +Will the close of an fd cause it to be removed from all +.B epoll +sets automatically? +.TP +.B A6 +Yes. +.TP +.B Q7 +If more than one event comes in between +.BR epoll_wait (2) +calls, are they combined or reported separately? +.TP +.B A7 +They will be combined. +.TP +.B Q8 +Does an operation on an fd affect the already collected but not yet reported +events? +.TP +.B A8 +You can do two operations on an existing fd. Remove would be meaningless for +this case. Modify will re-read available I/O. +.TP +.B Q9 +Do I need to continuously read/write an fd until EAGAIN when using the +.B EPOLLET +flag ( Edge Triggered behaviour ) ? +.TP +.B A9 +No you don't. Receiving an event from +.BR epoll_wait (2) +should suggest to you that such file descriptor is ready for the requested I/O +operation. You have simply to consider it ready until you will receive the +next EAGAIN. When and how you will use such file descriptor is entirely up +to you. Also, the condition that the read/write I/O space is exhausted can +be detected by checking the amount of data read/write from/to the target +file descriptor. For example, if you call +.BR read (2) +by asking to read a certain amount of data and +.BR read (2) +returns a lower number of bytes, you can be sure to have exhausted the read +I/O space for such file descriptor. Same is valid when writing using the +.BR write (2) +function. +.RE + +.SH POSSIBLE PITFALLS AND WAYS TO AVOID THEM +.RS +.TP +.B o Starvation ( Edge Triggered ) +.PP +If there is a large amount of I/O space, it is possible that by trying to drain +it the other files will not get processed causing starvation. This +is not specific to +.BR epoll . +.PP +.PP +The solution is to maintain a ready list and mark the file descriptor as ready +in its associated data structure, thereby allowing the application to +remember which files need to be processed but still round robin amongst +all the ready files. This also supports ignoring subsequent events you +receive for fd's that are already ready. +.PP + +.TP +.B o If using an event cache... +.PP +If you use an event cache or store all the fd's returned from +.BR epoll_wait (2), +then make sure to provide a way to mark its closure dynamically (ie- caused by +a previous event's processing). Suppose you receive 100 events from +.BR epoll_wait (2), +and in eventi #47 a condition causes event #13 to be closed. +If you remove the structure and close() the fd for event #13, then your +event cache might still say there are events waiting for that fd causing +confusion. +.PP +.PP +One solution for this is to call, during the processing of event 47, +.BR epoll_ctl ( EPOLL_CTL_DEL ) +to delete fd 13 and close(), then mark its associated +data structure as removed and link it to a cleanup list. If you find another +event for fd 13 in your batch processing, you will discover the fd had been +previously removed and there will be no confusion. +.PP + +.RE +.SH CONFORMING TO +.BR epoll (4) +is a new API introduced in Linux kernel 2.5.44. +Its interface should be finalized in Linux kernel 2.5.66. +.SH "SEE ALSO" +.BR epoll_create (2), +.BR epoll_ctl (2), +.BR epoll_wait (2) |