summaryrefslogtreecommitdiffstats
path: root/man4/epoll.4
diff options
context:
space:
mode:
Diffstat (limited to 'man4/epoll.4')
-rw-r--r--man4/epoll.4370
1 files changed, 370 insertions, 0 deletions
diff --git a/man4/epoll.4 b/man4/epoll.4
new file mode 100644
index 000000000..3357d9af2
--- /dev/null
+++ b/man4/epoll.4
@@ -0,0 +1,370 @@
+.\"
+.\" epoll by Davide Libenzi ( efficient event notification retrieval )
+.\" Copyright (C) 2003 Davide Libenzi
+.\"
+.\" This program is free software; you can redistribute it and/or modify
+.\" it under the terms of the GNU General Public License as published by
+.\" the Free Software Foundation; either version 2 of the License, or
+.\" (at your option) any later version.
+.\"
+.\" This program is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public License
+.\" along with this program; if not, write to the Free Software
+.\" Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+.\"
+.\" Davide Libenzi <davidel@xmailserver.org>
+.\"
+.\"
+.TH EPOLL 4 "2002-10-23" Linux "Linux Programmer's Manual"
+.SH NAME
+epoll \- I/O event notification facility
+.SH SYNOPSIS
+.B #include <sys/epoll.h>
+.SH DESCRIPTION
+.B epoll
+is a variant of
+.BR poll (2)
+that can be used either as Edge or Level Triggered interface and scales
+well to large numbers of watched fds. Three system calls are provided to
+set up and control an
+.B epoll
+set:
+.BR epoll_create (2),
+.BR epoll_ctl (2),
+.BR epoll_wait (2).
+
+An
+.B epoll
+set is connected to a file descriptor created by
+.BR epoll_create (2).
+Interest for certain file descriptors is then registered via
+.BR epoll_ctl (2).
+Finally, the actual wait is started by
+.BR epoll_wait (2).
+
+.SH NOTES
+The
+.B epoll
+event distribution interface is able to behave both as Edge Triggered
+( ET ) and Level Triggered ( LT ). The difference between ET and LT
+event distribution mechanism can be described as follows. Suppose that
+this scenario happens :
+.TP
+.B 1
+The file descriptor that represents the read side of a pipe (
+.B RFD
+) is added inside the
+.B epoll
+device.
+.TP
+.B 2
+Pipe writer writes 2Kb of data on the write side of the pipe.
+.TP
+.B 3
+A call to
+.BR epoll_wait (2)
+is done that will return
+.B RFD
+as ready file descriptor.
+.TP
+.B 4
+The pipe reader reads 1Kb of data from
+.BR RFD .
+.TP
+.B 5
+A call to
+.BR epoll_wait (2)
+is done.
+.PP
+
+If the
+.B RFD
+file descriptor has been added to the
+.B epoll
+interface using the
+.B EPOLLET
+flag, the call to
+.BR epoll_wait (2)
+done in step
+.B 5
+will probably hang because of the available data still present in the file
+input buffers and the remote peer might be expecting a response based on the
+data it already sent. The reason for this is that Edge Triggered event
+distribution delivers events only when events happens on the monitored file.
+So, in step
+.B 5
+the caller might end up waiting for some data that is already present inside
+the input buffer. In the above example, an event on
+.B RFD
+will be generated because of the write done in
+.B 2
+, and the event is consumed in
+.BR 3 .
+Since the read operation done in
+.B 4
+does not consume the whole buffer data, the call to
+.BR epoll_wait (2)
+done in step
+.B 5
+might lock indefinitely. The
+.B epoll
+interface, when used with the
+.B EPOLLET
+flag ( Edge Triggered )
+should use non-blocking file descriptors to avoid having a blocking
+read or write starve the task that is handling multiple file descriptors.
+The suggested way to use
+.B epoll
+as an Edge Triggered (
+.B EPOLLET
+) interface is below, and possible pitfalls to avoid follow.
+.RS
+.TP
+.B i
+with non-blocking file descriptors
+.TP
+.B ii
+by going to wait for an event only after
+.BR read (2)
+or
+.BR write (2)
+return EAGAIN
+.RE
+.PP
+On the contrary, when used as a Level Triggered interface,
+.B epoll
+is by all means a faster
+.BR poll (2),
+and can be used wherever the latter is used since it shares the
+same semantics. Since even with the Edge Triggered
+.B epoll
+multiple events can be generated up on receival of multiple chunks of data,
+the caller has the option to specify the
+.B EPOLLONESHOT
+flag, to tell
+.B epoll
+to disable the associated file descriptor after the receival of an event with
+.BR epoll_wait (2).
+When the
+.B EPOLLONESHOT
+flag is specified, it is caller responsibility to rearm the file descriptor using
+.BR epoll_ctl (2)
+with
+.BR EPOLL_CTL_MOD .
+
+.SH EXAMPLE FOR SUGGESTED USAGE
+
+While the usage of
+.B epoll
+when employed like a Level Triggered interface does have the same
+semantics of
+.BR poll (2),
+an Edge Triggered usage requires more clarifiction to avoid stalls
+in the application event loop. In this example, listener is a
+non-blocking socket on which
+.BR listen (2)
+has been called. The function do_use_fd() uses the new ready
+file descriptor until EAGAIN is returned by either
+.BR read (2)
+or
+.BR write (2).
+An event driven state machine application should, after having received
+EAGAIN, record its current state so that at the next call to do_use_fd()
+it will continue to
+.BR read (2)
+or
+.BR write (2)
+from where it stopped before.
+
+.nf
+struct epoll_event ev, *events;
+
+for(;;) {
+ nfds = epoll_wait(kdpfd, events, maxevents, -1);
+
+ for(n = 0; n < nfds; ++n) {
+ if(events[n].data.fd == listener) {
+ client = accept(listener, (struct sockaddr *) &local,
+ &addrlen);
+ if(client < 0){
+ perror("accept");
+ continue;
+ }
+ setnonblocking(client);
+ ev.events = EPOLLIN | EPOLLET;
+ ev.data.fd = client;
+ if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, client, &ev) < 0) {
+ fprintf(stderr, "epoll set insertion error: fd=%d\n",
+ client);
+ return -1;
+ }
+ }
+ else
+ do_use_fd(events[n].data.fd);
+ }
+}
+.fi
+
+When used as an Edge triggered interface, for performance reasons, it is
+possible to add the file descriptor inside the epoll interface (
+.B EPOLL_CTL_ADD
+) once by specifying (
+.BR EPOLLIN | EPOLLOUT
+). This allows you to avoid
+continuously switching between
+.B EPOLLIN
+and
+.B EPOLLOUT
+calling
+.BR epoll_ctl (2)
+with
+.BR EPOLL_CTL_MOD .
+
+.SH QUESTIONS AND ANSWERS (from linux-kernel)
+
+.RS
+.TP
+.B Q1
+What happens if you add the same fd to an epoll_set twice?
+.TP
+.B A1
+You will probably get EEXIST. However, it is possible that two
+threads may add the same fd twice. This is a harmless condition.
+.TP
+.B Q2
+Can two
+.B epoll
+sets wait for the same fd? If so, are events reported
+to both
+.B epoll
+sets fds?
+.TP
+.B A2
+Yes. However, it is not recommended. Yes it would be reported to both.
+.TP
+.B Q3
+Is the
+.B epoll
+fd itself poll/epoll/selectable?
+.TP
+.B A3
+Yes.
+.TP
+.B Q4
+What happens if the
+.B epoll
+fd is put into its own fd set?
+.TP
+.B A4
+It will fail. However, you can add an
+.B epoll
+fd inside another epoll fd set.
+.TP
+.B Q5
+Can I send the
+.B epoll
+fd over a unix-socket to another process?
+.TP
+.B A5
+No.
+.TP
+.B Q6
+Will the close of an fd cause it to be removed from all
+.B epoll
+sets automatically?
+.TP
+.B A6
+Yes.
+.TP
+.B Q7
+If more than one event comes in between
+.BR epoll_wait (2)
+calls, are they combined or reported separately?
+.TP
+.B A7
+They will be combined.
+.TP
+.B Q8
+Does an operation on an fd affect the already collected but not yet reported
+events?
+.TP
+.B A8
+You can do two operations on an existing fd. Remove would be meaningless for
+this case. Modify will re-read available I/O.
+.TP
+.B Q9
+Do I need to continuously read/write an fd until EAGAIN when using the
+.B EPOLLET
+flag ( Edge Triggered behaviour ) ?
+.TP
+.B A9
+No you don't. Receiving an event from
+.BR epoll_wait (2)
+should suggest to you that such file descriptor is ready for the requested I/O
+operation. You have simply to consider it ready until you will receive the
+next EAGAIN. When and how you will use such file descriptor is entirely up
+to you. Also, the condition that the read/write I/O space is exhausted can
+be detected by checking the amount of data read/write from/to the target
+file descriptor. For example, if you call
+.BR read (2)
+by asking to read a certain amount of data and
+.BR read (2)
+returns a lower number of bytes, you can be sure to have exhausted the read
+I/O space for such file descriptor. Same is valid when writing using the
+.BR write (2)
+function.
+.RE
+
+.SH POSSIBLE PITFALLS AND WAYS TO AVOID THEM
+.RS
+.TP
+.B o Starvation ( Edge Triggered )
+.PP
+If there is a large amount of I/O space, it is possible that by trying to drain
+it the other files will not get processed causing starvation. This
+is not specific to
+.BR epoll .
+.PP
+.PP
+The solution is to maintain a ready list and mark the file descriptor as ready
+in its associated data structure, thereby allowing the application to
+remember which files need to be processed but still round robin amongst
+all the ready files. This also supports ignoring subsequent events you
+receive for fd's that are already ready.
+.PP
+
+.TP
+.B o If using an event cache...
+.PP
+If you use an event cache or store all the fd's returned from
+.BR epoll_wait (2),
+then make sure to provide a way to mark its closure dynamically (ie- caused by
+a previous event's processing). Suppose you receive 100 events from
+.BR epoll_wait (2),
+and in eventi #47 a condition causes event #13 to be closed.
+If you remove the structure and close() the fd for event #13, then your
+event cache might still say there are events waiting for that fd causing
+confusion.
+.PP
+.PP
+One solution for this is to call, during the processing of event 47,
+.BR epoll_ctl ( EPOLL_CTL_DEL )
+to delete fd 13 and close(), then mark its associated
+data structure as removed and link it to a cleanup list. If you find another
+event for fd 13 in your batch processing, you will discover the fd had been
+previously removed and there will be no confusion.
+.PP
+
+.RE
+.SH CONFORMING TO
+.BR epoll (4)
+is a new API introduced in Linux kernel 2.5.44.
+Its interface should be finalized in Linux kernel 2.5.66.
+.SH "SEE ALSO"
+.BR epoll_create (2),
+.BR epoll_ctl (2),
+.BR epoll_wait (2)