UNIX Socket FAQ

A forum for questions and answers about network programming on Linux and all other Unix-like systems

You are not logged in.

  • Index
  • » Processes
  • » What can cause a spontaneous EPIPE error without either end calling close or crash?

#1 2010-02-10 12:04 PM

FooBarWidget
Member
Registered: 2010-02-10
Posts: 6

Re: What can cause a spontaneous EPIPE error without either end calling close or crash?

I have an application that consists of two processes (let's call them A and B), connected to each other through Unix sockets. Most of the time it works fine, but some users report the following behavior:
*  A sends a request to B. This works. A now starts reading the reply from B.
*  B sends a reply to A. The corresponding write() call returns an EPIPE error, and as a result B close() the socket. However, A did not close() the socket, nor did it crash.
*  A's read() call returns 0, indicating end-of-file. A thinks that B prematurely closed the connection.


Users have also reported variations of this behavior, e.g.:
*  A sends a request to B. This works partially, but before the entire request is sent A's write() call returns EPIPE, and as a result A close() the socket. However B did not close() the socket, nor did it crash.
*  B reads a partial request and then suddenly gets an EOF.


The problem is I cannot reproduce this behavior locally at all. I've tried OS X and Linux. The users are on a variety of systems, mostly OS X and Linux.

Things that I've already tried and considered:

    *  Double close() bugs (close() is called twice on the same file descriptor): probably not as that would result in EBADF errors, but I haven't seen them.
    *  Increasing the maximum file descriptor limit. One user reported that this worked for him, the rest reported that it did not.

What else can possibly cause behavior like this? I know for certain that neither A nor B close() the socket prematurely, and I know for certain that neither of them have crashed because both A and B were able to report the error. It is as if the kernel suddenly decided to pull the plug from the socket for some reason.

Offline

#2 2010-02-10 02:39 PM

RobSeace
Administrator
From: Boston, MA
Registered: 2002-06-12
Posts: 3,826
Website

Re: What can cause a spontaneous EPIPE error without either end calling close or crash?

Hmmmm...  Well, if it were a TCP socket, I'd say it was probably just network flakiness
terminating the connection abnormally or something...  But, with a Unix domain socket,
that can't be it...

You say you're sure you don't close() the socket incorrectly on either end when it
happens, but what about shutdown()?  Do you call that at all in any circumstance on
either end?  Because, that could have the same effect as a close(), but still leave the
FD valid and open, as it seems to be in your case...

Aside from that, my only real guess is some sort of subtle memory corruption going
on...  Eg: such that the variable holding your socket FD gets overwritten, and so you
try to write() to the wrong FD, or something...  Seems unlikely though that it'd get
overwritten with another valid, open FD, which just happens to be a pipe or socket
that has no reader, such that it'd generate the EPIPE error...

I'd say just add as much debug logging as you can, and see if users can duplicate
it and send you the logs...  Log all connects and disconnects (normal and abnormal),
the FDs in use at all times, etc...  And, when you get an EPIPE, log the FD, and try
to obtain as much info as you can about that open FD before throwing it away...
Do getsockname() and getpeername() on it, look it up in "/proc/self/fd/" (in fact,
maybe dump the whole set of currently open FDs from there), and cross-reference
the inode# for the socket from there with "/proc/net/unix" to pull up more info...  Do
getsockopt(SO_PEERCRED) to obtain PID and UID of your connecting peers, and
poke into their "/proc/<pid>/fd/"s, too (assuming your server has perms to peek in
there, anyway)...  Etc...  Basically, just try to log everything you can, and hopefully
something will stand out if/when someone duplicates the problem in the future...

Offline

#3 2010-02-10 02:42 PM

FooBarWidget
Member
Registered: 2010-02-10
Posts: 6

Re: What can cause a spontaneous EPIPE error without either end calling close or crash?

Nope, no shutdown() anywhere.

Memory corruption is not out of the question, but unlikely. I've tested stuff with Valgrind and I've never seen any EBADF errors.

Offline

#4 2010-02-10 02:51 PM

RobSeace
Administrator
From: Boston, MA
Registered: 2002-06-12
Posts: 3,826
Website

Re: What can cause a spontaneous EPIPE error without either end calling close or crash?

through Unix sockets

Just re-read that, and realize I mentally inserted a "domain" in there...  So, now I'm not
really sure if you actually mean Unix domain sockets or something else (TCP?)...  If
Unix domain (AF_UNIX/AF_LOCAL), then what I said above applies...  If TCP, then
the first likely candicate is network flakiness between the peers...  If they're on the
same host, connected through localhost/loopback, then it basically goes back to the
same situation as Unix domain sockets, since network issues are taken out of the
picture...  But, the info you can obtain about AF_INET sockets will differ from that you
can about AF_UNIX sockets...  You can look those up in "/proc/net/tcp"...  You might
also want to getsockopt(SO_ERROR) before closing the connection...

Offline

#5 2010-02-10 03:00 PM

FooBarWidget
Member
Registered: 2010-02-10
Posts: 6

Re: What can cause a spontaneous EPIPE error without either end calling close or crash?

Yes I am talking about AF_LOCAL. All processes are running on localhost.

Offline

#6 2010-02-11 06:31 AM

i3839
Oddministrator
From: Amsterdam
Registered: 2003-06-07
Posts: 2,230

Re: What can cause a spontaneous EPIPE error without either end calling close or crash?

It really looks like a subtle bug in your code. It's a lot easier to help if we
see your code.

Offline

#7 2010-02-11 10:07 AM

FooBarWidget
Member
Registered: 2010-02-10
Posts: 6

Re: What can cause a spontaneous EPIPE error without either end calling close or crash?

The code is here: http://github.com/FooBarWidget/passenger
The bug reports:
http://code.google.com/p/phusion-passen … %20Summary
http://code.google.com/p/phusion-passen … %20Summary

Component A is the Apache module (most code in ext/apache2/Hooks.cpp) while component B is the ApplicationPoolServerExecutable (in ext/common/ApplicationPoolServerExecutable.cpp)

Offline

#8 2010-02-11 03:30 PM

RobSeace
Administrator
From: Boston, MA
Registered: 2002-06-12
Posts: 3,826
Website

Re: What can cause a spontaneous EPIPE error without either end calling close or crash?

Ugh...  More C++ code... ;-/

Well, from what I could see, the real guts of the actual low-level socket handling are
burried in "ext/common/MessageChannel.h"...  Is that right, or am I looking at the
wrong thing?

Anyway, one thing I don't like: for {read,write}Scalar() you use a 32-bit size header,
while for plain {read,write}() you use a 16-bit one...  It would only matter if reader and
writer disagreed on which method they should be using to read/write at the same
time, but still, I can't see much reason not to use the same sized header for both...
Also, you don't seem to be handling EINTR as non-fatal in any of your syscall read()'s
or write()'s...  And, why not have your read() call readRaw() like readScalar() does,
instead of rolling its own low-level syscall reading?

It's really hard to follow everything that's happening through all the layers of C++
classes and stuff, so I'm not sure what the real problem is...  I might try to take
another look and see if I can figure out WTF is going on, though...  I'm a straight C
coder myself though, so it hurts my damn brain to twist through all that wacky C++
abstraction and obfuscation... ;-)

Offline

#9 2010-02-11 04:22 PM

FooBarWidget
Member
Registered: 2010-02-10
Posts: 6

Re: What can cause a spontaneous EPIPE error without either end calling close or crash?

Low-level socket operations are actually handled in ext/oxt/system_calls.{cpp,hpp}. EINTR is handled, in fact we have an entire system call interruption framework based on consistent EINTR handling throughout the entire process.

MessageChannel is for slightly higher level socket operations. It wraps a socket and allows you to send and receive messages that follow a certain protocol format, allowing one to concentrate on the messaging logic instead of having to deal with constructing and parsing messages all the time. The 16-bit header for array messages is to conserve space. No specific reason why readRaw() isn't used in read(), but as far as I know the logic is entirely correct.

The higher-level application logic that's relevant to this problem is in Hooks.cpp and ApplicationPoolServerExecutable.cpp.

So far I've identified at least two causes that can cause EPIPE:

    *  Mac OS X kernel bugs, triggered by passing a client socket through a Unix domain socket. I'm working around this in the next version by directly connecting to the server in the source process.
    *  Safari keep-alive bugs.

There seems to be other causes still, and I'm trying to find out what they are.

Offline

#10 2010-02-12 02:33 PM

RobSeace
Administrator
From: Boston, MA
Registered: 2002-06-12
Posts: 3,826
Website

Re: What can cause a spontaneous EPIPE error without either end calling close or crash?

Low-level socket operations are actually handled in ext/oxt/system_calls.{cpp,hpp}.

Ah, I guess I didn't burrow far enough down the rabbit-hole of abstraction to find
that... ;-)  When I actually found stuff that seemed to be calling real C syscalls that
looked familiar to me, I stopped, and was happy to finally understand WTF it was I
was reading... ;-)

The 16-bit header for array messages is to conserve space.

And, you really need 32-bit for the "scalar" one?  My only concern is that one side
might send in one format, while the other side tries to read the other format, and so
reads a bogus size header...  I have no idea if such a situation is truly possible in
the code, so it may not really be an issue in real life...  I'm just saying IF it were to
occur, it'd be a very BAD thing...  At the very least, it could certainly result in one side
terminating the connection because it doesn't think they sent enough data...

On a side-note: while generally doing hton{s,l}()/ntoh{s,l}() on sent/received int values
is commendable and recommended, it's honestly not necessary when you're sending
them over a Unix domain socket, since by definition you're talking to the same host,
so the receiver MUST have the same exact endianess as the sender! ;-)  But, hey,
it probably isn't any perceivable overhead, and will be a good thing if you ever change
things so that sender and receiver are on different hosts talking over a TCP socket
or something...

No specific reason why readRaw() isn't used in read(), but as far as I know the logic is entirely correct.

Yeah, I didn't see any problem...  It just worries me when I see code seemingly
duplicated needlessly...  Either there's some subtle difference in what it's doing that
prevents it from using the generalized function that everyone else is calling, or it's
just reinventing the wheel for no reason and risking introducing its own problems
(if not now, in the future, when a bug fix is made to the generalized code, but not
copied to the duplicate code)...

Safari keep-alive bugs.

How would that cause EPIPE on your interprocess Unix domain socket?  Unless
that takes down one of your processes, anyway...  But, you claimed at the start of
this thread that you were certain both processes were alive and well at the time of
the EPIPEs...

Offline

#11 2010-02-12 03:14 PM

FooBarWidget
Member
Registered: 2010-02-10
Posts: 6

Re: What can cause a spontaneous EPIPE error without either end calling close or crash?

And, you really need 32-bit for the "scalar" one?


Yes.

My only concern is that one side
might send in one format, while the other side tries to read the other format, and so
reads a bogus size header...

Yes, that would be an issue, but that would be a problem regardless of the header size. All processes must read and write exactly the message format that's expected by the peer, anything else will result in things borking real badly.

But suppose that A sends an array message and B receives a scalar message. B's readScalar() will throw a proper exception upon seeing that there is less data than its expected 32-bit header indicates. It shouldn't cause EPIPE or anything like that.

How would that cause EPIPE on your interprocess Unix domain socket? Unless
that takes down one of your processes, anyway... But, you claimed at the start of
this thread that you were certain both processes were alive and well at the time of
the EPIPEs...

Process A is a web server. The Safari bug causes it to suddenly close the connection with the web server. When this happens the web server will close the connection to B too, causing an EPIPE when B tries to send something back. Something similar happens if the user clicked on Stop, but this is legit behavior.

I'm concerned with the rest of the cases, where the browser didn't close the connection but the EPIPE error still occurs.

Offline

#12 2010-02-12 10:41 PM

RobSeace
Administrator
From: Boston, MA
Registered: 2002-06-12
Posts: 3,826
Website

Re: What can cause a spontaneous EPIPE error without either end calling close or crash?

B's readScalar() will throw a proper exception upon seeing that there is less data than its expected 32-bit header indicates.

Yeah, but in the opposite case (sending a 32-bit header, reading it as 16-bit), things
could get really wacky...  It might read out part of the message, then leave the rest to
parse out as a separate message, using part of the data as another length header...
Probably still not a real-world concern, in actual practice, but if it were ME, I'd probably
have gone with a fixed-size header for all messages, containing a length and a
message type...  Then, the receiver could always be sure it was reading the correct
message type that it's expecting, too... *shrug*

It shouldn't cause EPIPE or anything like that.

Probably not, as long as you don't allow multiple outstanding unreplied-to messages
in the pipe...  If you did, then the sender could theoretically try to send one after the
receiver choked on the seemingly short message and closed its end...

I'm concerned with the rest of the cases, where the browser didn't close the connection but the EPIPE error still occurs.

And, you're sure these cases actually exist, and aren't just users misreporting the
already identified cases?  If so, I'm really not sure...  It doesn't really sound
possible, barring kernel bugs...  The receiver simply must've closed (or shutdown)
its end for the sender to get EPIPE...  If you can verify through logging that both
are alive and neither have closed, then I'd just throw in the tons of extra debug
logging I mentioned to try to see WTF is going on...  Also, does it only happen on
a certain platform?  If so, kernel bug becomes more likely; if not, I've got to go with
some hidden close going on somewhere in your code, even if you don't think you're
doing one...  (And, the spurious EOF you originally mentioned the receiver getting
at the same time the sender got EPIPE could theoretically be possible if it opened
another socket/file/whatever in between the time it did the spurious close and when
it did the read(), since FDs get reused, so it's perfectly possible something else
could be residing at the old FD#...  If the code is multi-threaded, this becomes much
more likely, as you're sharing the same FD space among all threads, so one of
them could easily close and reopen something else over the top of one of your
FDs...  That's why a dump of all your open FDs and what they really represent
would possibly be enlightening, when this condition occurs...)

Offline

  • Index
  • » Processes
  • » What can cause a spontaneous EPIPE error without either end calling close or crash?

Board footer

Powered by FluxBB