Squeak socket problem ... help!

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Squeak socket problem ... help!

Göran Krampe
 
Hi guys!

Long email but... work for hire here! :)

In short:

Ron and I (3DICC) have a problem with the Unix VM networking and I am
reaching out before burning too many hours on something one of you
C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your
trouble.

In looong:

So... we are running a large deployment of Terf (yay!) that is "pushing"
the server side VMs a bit. The large load has caused us to experience
some issues.


Our Possibly Faulted Analysis So Far
====================================

One of our server side VMs, the one that unfortunately is a crucial
resource, locks up and doesn't respond on its most important listening
Socket port. VM does not crash however. We reboot it, because its a
stressful situation with LOTS of users being affected, so we haven't
looked "inside".

Unfortunately the shell script starting the VMs wasn't catching stderr
to a log file (Argh! Now it does though so we will see if we get more
info later) so we have missed some info here but Ron "got lucky" and saw
this on his terminal (stderr of the VM going to his terminal instead of
log file):

"errno 9
select: Bad file descriptor"

It took us quite some time before we realized this was indeed Squeak
talking, and that it was from inside aio.c - a call from aioPoll():
        perror("select")

...ok. Some important background info:

Before this we hit the default ulimit of 1024 per user (duh!), causing
"Too Many Files Open", so we raised them silly high. That did make the
system handle itself - but in retrospect we think another issue (related
to SAML auth) caused tons of client requests getting spawned from this
VM and thus is what made us reach the limit in the first place. It may
also have been the factor (=many many sockets) that in the end caused
the errno 9 described above - see below for reasoning.

After perusing the IMHO highly confusing :) Socket code (no offense of
course, its probably trivial to a Unix C-guru) at least we understand
that the code uses select() and not a more modern poll() or epoll(). In
fact there is also a call to select() in sqUnixSocket.c, but... probably
not relevant.

Yeah, epoll() is not portable etc, we know, but frankly we only care for
Linux here.

Googling shows us further that select() has issues, I mean, yikes. And
the thing I think we might be hitting here is the fact that select()
doesn't handle more than 1024 file descriptors!!! (as far as I can
understand the writing on the Internet) and to make it worse, it seems
to be able to go bananas if you push it there...

Now... again, Internet seems to imply that the "usual" cause of "errno
9" is doing a select on an fd that has already been closed. Typical bug
causing this is accidentally closing an fd twice - and thus, if you are
unlucky, accidentally the second time closing the fd when it actually
has already managed to get reused and thus is open agian. Oops.

But it seems unreasonable to think that *such* a bug exists in this code
after all these years. And I am simply guessing its the >1024 fd problem
biting us, but yeah, we don't know. And I also guess it was that SAML
issue, in combination with raised ulimits, that made us even get over 1024.

Things we are planning:

- Come up with a test case showing it blowing up. Will try to do that
next week.
- Start looking at how to use poll() or epoll() instead, because we need
to be SOLID here and we can't afford the 1024 limit.

So... anyone interested? Any brilliant thoughts? AFAICT we can rest
assured that there is a bug here somewhere, because otherwise we
wouldn't be able to get "errno 9 Bad file descriptor", right?

regards, Göran

PS. Googling this in relation to the Squeak VM didn't show any recent
hits, only some from Stephen Pair fixing a bug in X11 code etc.
Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

Eliot Miranda-2
 
Hi Both,

On Thu, Oct 9, 2014 at 3:05 PM, Göran Krampe <[hidden email]> wrote:

Hi guys!

Long email but... work for hire here! :)

In short:

Ron and I (3DICC) have a problem with the Unix VM networking and I am reaching out before burning too many hours on something one of you C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your trouble.

Cool.  This is likely easy to fix.  Your image is running out of file descriptors.  Track open and close calls, e.g. add logging around at least StandardFileStream>>#primOpen:writable: , AsyncFile>>#primOpen:forWrite:semaIndex:, Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:readSemaIndex:writeSemaIndex: and their associated close calls and see what's being opened without being closed.  It shoudl be easy to track=down, but may be more difficult to fix.

Good luck!


In looong:

So... we are running a large deployment of Terf (yay!) that is "pushing" the server side VMs a bit. The large load has caused us to experience some issues.


Our Possibly Faulted Analysis So Far
====================================

One of our server side VMs, the one that unfortunately is a crucial resource, locks up and doesn't respond on its most important listening Socket port. VM does not crash however. We reboot it, because its a stressful situation with LOTS of users being affected, so we haven't looked "inside".

Unfortunately the shell script starting the VMs wasn't catching stderr to a log file (Argh! Now it does though so we will see if we get more info later) so we have missed some info here but Ron "got lucky" and saw this on his terminal (stderr of the VM going to his terminal instead of log file):

"errno 9
select: Bad file descriptor"

It took us quite some time before we realized this was indeed Squeak talking, and that it was from inside aio.c - a call from aioPoll():
        perror("select")

...ok. Some important background info:

Before this we hit the default ulimit of 1024 per user (duh!), causing "Too Many Files Open", so we raised them silly high. That did make the system handle itself - but in retrospect we think another issue (related to SAML auth) caused tons of client requests getting spawned from this VM and thus is what made us reach the limit in the first place. It may also have been the factor (=many many sockets) that in the end caused the errno 9 described above - see below for reasoning.

After perusing the IMHO highly confusing :) Socket code (no offense of course, its probably trivial to a Unix C-guru) at least we understand that the code uses select() and not a more modern poll() or epoll(). In fact there is also a call to select() in sqUnixSocket.c, but... probably not relevant.

Yeah, epoll() is not portable etc, we know, but frankly we only care for Linux here.

Googling shows us further that select() has issues, I mean, yikes. And the thing I think we might be hitting here is the fact that select() doesn't handle more than 1024 file descriptors!!! (as far as I can understand the writing on the Internet) and to make it worse, it seems to be able to go bananas if you push it there...

Now... again, Internet seems to imply that the "usual" cause of "errno 9" is doing a select on an fd that has already been closed. Typical bug causing this is accidentally closing an fd twice - and thus, if you are unlucky, accidentally the second time closing the fd when it actually has already managed to get reused and thus is open agian. Oops.

But it seems unreasonable to think that *such* a bug exists in this code after all these years. And I am simply guessing its the >1024 fd problem biting us, but yeah, we don't know. And I also guess it was that SAML issue, in combination with raised ulimits, that made us even get over 1024.

Things we are planning:

- Come up with a test case showing it blowing up. Will try to do that next week.
- Start looking at how to use poll() or epoll() instead, because we need to be SOLID here and we can't afford the 1024 limit.

So... anyone interested? Any brilliant thoughts? AFAICT we can rest assured that there is a bug here somewhere, because otherwise we wouldn't be able to get "errno 9 Bad file descriptor", right?

regards, Göran

PS. Googling this in relation to the Squeak VM didn't show any recent hits, only some from Stephen Pair fixing a bug in X11 code etc.



--
best,
Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

David T. Lewis
 
On Thu, Oct 09, 2014 at 03:27:05PM -0700, Eliot Miranda wrote:

>  
> Hi Both,
>
> On Thu, Oct 9, 2014 at 3:05 PM, G??ran Krampe <[hidden email]> wrote:
>
> >
> > Hi guys!
> >
> > Long email but... work for hire here! :)
> >
> > In short:
> >
> > Ron and I (3DICC) have a problem with the Unix VM networking and I am
> > reaching out before burning too many hours on something one of you
> > C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your
> > trouble.
> >
>
> Cool.  This is likely easy to fix.  Your image is running out of file
> descriptors.  Track open and close calls, e.g. add logging around at
> least StandardFileStream>>#primOpen:writable:
> , AsyncFile>>#primOpen:forWrite:semaIndex:,
> Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:readSemaIndex:writeSemaIndex:
> and their associated close calls and see what's being opened without being
> closed.  It shoudl be easy to track=down, but may be more difficult to fix.
>
> Good luck!

I agree with what Eliot is saying and would add a few thoughts:

- Don't fix the wrong problem (DFtWP). Unless you have some reason to
believe that this server application would realistically have a need to
handle anything close to a thousand concurrent TCP sessions, don't fix
it by raising the per-process file handle limit, and don't fix it by
reimplementing the socket listening code.

- It is entirely possible that no one before you has ever tried to run
a server application with the per-process file handle limit bumped up
above the default 1024. So if that configuration does not play nicely
with the select() mechanism, you may well be the first to have encountered
this as an issue. But see above, don't fix it if it ain't broke.

- Most "out of file descriptor" problems involve resource leaks (as Eliot
is suggesting), and in those cases you will see a gradual increase in file
descriptors in /proc/<vmpid>/fd/ over time. Eventually you run out of
descriptors and something horrible happens.

- Before doing anything else, you must confirm if this is a resource leak,
with file descriptor use continuously increasing (which is what Eliot and
I are both assuming to be the case here), or if it is a real resource
issue in which your server has a legitimate need to maintain a very large
number of TCP connections concurrently. Given that you have a running
application with real users, you will probably want to do this with something
like a shell script keeping track of the /proc/<pid>/fd/ directory for
the running VM. (In squeaksource.com, there is an undiagnosed file handle
leak similar to what I think you are experiencing. My kludgy workaround is
a process in the image that uses OSProcess to count entries in /proc/<pid>/fd/
and restart the image when the file descriptor situation becomes dire).

- Finding file (socket) handle leaks is trickly, and if you have customers
depending on this, you probably do not have the luxury of fixing it right.
Is there any way to periodically restart the server image without causing
pain to the customer? If so, consider a kludgy workaround like I did for
squeaksource. Monitor the VM process for file handle leaks and restart
it proactively rather than waiting for a catastrophic failure. You can
do this all from within the image, I will dig out my squeaksource hack if
you think it may be of any help.

- Sorry to repeat myself but this is by far the most important point: DFtWP.

Dave

>
>
> > In looong:
> >
> > So... we are running a large deployment of Terf (yay!) that is "pushing"
> > the server side VMs a bit. The large load has caused us to experience some
> > issues.
> >
> >
> > Our Possibly Faulted Analysis So Far
> > ====================================
> >
> > One of our server side VMs, the one that unfortunately is a crucial
> > resource, locks up and doesn't respond on its most important listening
> > Socket port. VM does not crash however. We reboot it, because its a
> > stressful situation with LOTS of users being affected, so we haven't looked
> > "inside".
> >
> > Unfortunately the shell script starting the VMs wasn't catching stderr to
> > a log file (Argh! Now it does though so we will see if we get more info
> > later) so we have missed some info here but Ron "got lucky" and saw this on
> > his terminal (stderr of the VM going to his terminal instead of log file):
> >
> > "errno 9
> > select: Bad file descriptor"
> >
> > It took us quite some time before we realized this was indeed Squeak
> > talking, and that it was from inside aio.c - a call from aioPoll():
> >         perror("select")
> >
> > ...ok. Some important background info:
> >
> > Before this we hit the default ulimit of 1024 per user (duh!), causing
> > "Too Many Files Open", so we raised them silly high. That did make the
> > system handle itself - but in retrospect we think another issue (related to
> > SAML auth) caused tons of client requests getting spawned from this VM and
> > thus is what made us reach the limit in the first place. It may also have
> > been the factor (=many many sockets) that in the end caused the errno 9
> > described above - see below for reasoning.
> >
> > After perusing the IMHO highly confusing :) Socket code (no offense of
> > course, its probably trivial to a Unix C-guru) at least we understand that
> > the code uses select() and not a more modern poll() or epoll(). In fact
> > there is also a call to select() in sqUnixSocket.c, but... probably not
> > relevant.
> >
> > Yeah, epoll() is not portable etc, we know, but frankly we only care for
> > Linux here.
> >
> > Googling shows us further that select() has issues, I mean, yikes. And the
> > thing I think we might be hitting here is the fact that select() doesn't
> > handle more than 1024 file descriptors!!! (as far as I can understand the
> > writing on the Internet) and to make it worse, it seems to be able to go
> > bananas if you push it there...
> >
> > Now... again, Internet seems to imply that the "usual" cause of "errno 9"
> > is doing a select on an fd that has already been closed. Typical bug
> > causing this is accidentally closing an fd twice - and thus, if you are
> > unlucky, accidentally the second time closing the fd when it actually has
> > already managed to get reused and thus is open agian. Oops.
> >
> > But it seems unreasonable to think that *such* a bug exists in this code
> > after all these years. And I am simply guessing its the >1024 fd problem
> > biting us, but yeah, we don't know. And I also guess it was that SAML
> > issue, in combination with raised ulimits, that made us even get over 1024.
> >
> > Things we are planning:
> >
> > - Come up with a test case showing it blowing up. Will try to do that next
> > week.
> > - Start looking at how to use poll() or epoll() instead, because we need
> > to be SOLID here and we can't afford the 1024 limit.
> >
> > So... anyone interested? Any brilliant thoughts? AFAICT we can rest
> > assured that there is a bug here somewhere, because otherwise we wouldn't
> > be able to get "errno 9 Bad file descriptor", right?
> >
> > regards, G??ran
> >
> > PS. Googling this in relation to the Squeak VM didn't show any recent
> > hits, only some from Stephen Pair fixing a bug in X11 code etc.
> >
>
>
>
> --
> best,
> Eliot

Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

Göran Krampe
 
Hi guys!

On 10/10/2014 01:28 AM, David T. Lewis wrote:

>>> Ron and I (3DICC) have a problem with the Unix VM networking and I am
>>> reaching out before burning too many hours on something one of you
>>> C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your
>>> trouble.
>>
>> Cool.  This is likely easy to fix.  Your image is running out of file
>> descriptors.  Track open and close calls, e.g. add logging around at
>> least StandardFileStream>>#primOpen:writable:
>> , AsyncFile>>#primOpen:forWrite:semaIndex:,
>> Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:readSemaIndex:writeSemaIndex:
>> and their associated close calls and see what's being opened without being
>> closed.  It shoudl be easy to track=down, but may be more difficult to fix.
>>
>> Good luck!

Aha. Soo... am I understanding this correctly - we are probably leaking
fds and when we go above 1024 this makes select() go bonkers and
eventually leads to the "Bad file descriptor" error?

> I agree with what Eliot is saying and would add a few thoughts:

> - Don't fix the wrong problem (DFtWP). Unless you have some reason to
> believe that this server application would realistically have a need to
> handle anything close to a thousand concurrent TCP sessions, don't fix
> it by raising the per-process file handle limit, and don't fix it by
> reimplementing the socket listening code.

We haven't done the exact numbers, but we could probably hit several
hundreds concurrent at least. 1024 seemed a bit "over the top" though :)

The system in question is meant to serve more than 1000 concurrent
users, so we are in fact moving into this territory. We have been up to
around 600 so far.

> - It is entirely possible that no one before you has ever tried to run
> a server application with the per-process file handle limit bumped up
> above the default 1024. So if that configuration does not play nicely
> with the select() mechanism, you may well be the first to have encountered
> this as an issue. But see above, don't fix it if it ain't broke.

Well, it most probably *is* broke - I mean - I haven't read anywhere
that our Socket code is limited to 1024 concurrent sockets and that
going above that limit causes the Socket code to stop working? :)

But I agree - I don't want to touch that code if we can simply avoid
this bug by making sure we stay below 1024.

But it sounds broke to me, nevertheless. ;)

> - Most "out of file descriptor" problems involve resource leaks (as Eliot
> is suggesting), and in those cases you will see a gradual increase in file
> descriptors in /proc/<vmpid>/fd/ over time. Eventually you run out of
> descriptors and something horrible happens.

We will start looking at that and other tools too.


> - Sorry to repeat myself but this is by far the most important point: DFtWP.

Sure :). This is why I posted - to get your input. And I have a
suspicion that the SAML issue I mentioned may be the code leaking, we
will start looking.

regards, Göran
Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

David T. Lewis
 
On Fri, Oct 10, 2014 at 02:09:08AM +0200, G?ran Krampe wrote:

>
> Hi guys!
>
> On 10/10/2014 01:28 AM, David T. Lewis wrote:
> >>>Ron and I (3DICC) have a problem with the Unix VM networking and I am
> >>>reaching out before burning too many hours on something one of you
> >>>C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your
> >>>trouble.
> >>
> >>Cool.  This is likely easy to fix.  Your image is running out of file
> >>descriptors.  Track open and close calls, e.g. add logging around at
> >>least StandardFileStream>>#primOpen:writable:
> >>, AsyncFile>>#primOpen:forWrite:semaIndex:,
> >>Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:readSemaIndex:writeSemaIndex:
> >>and their associated close calls and see what's being opened without being
> >>closed.  It shoudl be easy to track=down, but may be more difficult to
> >>fix.
> >>
> >>Good luck!
>
> Aha. Soo... am I understanding this correctly - we are probably leaking
> fds and when we go above 1024 this makes select() go bonkers and
> eventually leads to the "Bad file descriptor" error?
>
> >I agree with what Eliot is saying and would add a few thoughts:
>
> >- Don't fix the wrong problem (DFtWP). Unless you have some reason to
> >believe that this server application would realistically have a need to
> >handle anything close to a thousand concurrent TCP sessions, don't fix
> >it by raising the per-process file handle limit, and don't fix it by
> >reimplementing the socket listening code.
>
> We haven't done the exact numbers, but we could probably hit several
> hundreds concurrent at least. 1024 seemed a bit "over the top" though :)
>
> The system in question is meant to serve more than 1000 concurrent
> users, so we are in fact moving into this territory. We have been up to
> around 600 so far.
>
> >- It is entirely possible that no one before you has ever tried to run
> >a server application with the per-process file handle limit bumped up
> >above the default 1024. So if that configuration does not play nicely
> >with the select() mechanism, you may well be the first to have encountered
> >this as an issue. But see above, don't fix it if it ain't broke.
>
> Well, it most probably *is* broke - I mean - I haven't read anywhere
> that our Socket code is limited to 1024 concurrent sockets and that
> going above that limit causes the Socket code to stop working? :)
>
> But I agree - I don't want to touch that code if we can simply avoid
> this bug by making sure we stay below 1024.
>
> But it sounds broke to me, nevertheless. ;)

Indeed it probably is.

>
> >- Most "out of file descriptor" problems involve resource leaks (as Eliot
> >is suggesting), and in those cases you will see a gradual increase in file
> >descriptors in /proc/<vmpid>/fd/ over time. Eventually you run out of
> >descriptors and something horrible happens.
>
> We will start looking at that and other tools too.
>
>
> >- Sorry to repeat myself but this is by far the most important point:
> >DFtWP.
>
> Sure :). This is why I posted - to get your input. And I have a
> suspicion that the SAML issue I mentioned may be the code leaking, we
> will start looking.
>
> regards, G?ran

Cool. Please keep us posted on progress. In the unlikely event that free
advice turns out to be worth anything, there is plenty more where that
came from ;-)

Dave

Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

johnmci
 
If I recall correctly socket file handles are managed by the weak object logic. 
In the past I've already said: "Finalization will run later and likely kill you in a production environment" 

Given a high attack rate for acquiring sockets do you know when the socket instance goes away, and when the file descriptor is destroyed via cleanup of weak object, or the forced CG cycle?

On Thu, Oct 9, 2014 at 5:21 PM, David T. Lewis <[hidden email]> wrote:

On Fri, Oct 10, 2014 at 02:09:08AM +0200, G?ran Krampe wrote:
>
> Hi guys!
>
> On 10/10/2014 01:28 AM, David T. Lewis wrote:
> >>>Ron and I (3DICC) have a problem with the Unix VM networking and I am
> >>>reaching out before burning too many hours on something one of you
> >>>C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your
> >>>trouble.
> >>
> >>Cool.  This is likely easy to fix.  Your image is running out of file
> >>descriptors.  Track open and close calls, e.g. add logging around at
> >>least StandardFileStream>>#primOpen:writable:
> >>, AsyncFile>>#primOpen:forWrite:semaIndex:,
> >>Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:readSemaIndex:writeSemaIndex:
> >>and their associated close calls and see what's being opened without being
> >>closed.  It shoudl be easy to track=down, but may be more difficult to
> >>fix.
> >>
> >>Good luck!
>
> Aha. Soo... am I understanding this correctly - we are probably leaking
> fds and when we go above 1024 this makes select() go bonkers and
> eventually leads to the "Bad file descriptor" error?
>
> >I agree with what Eliot is saying and would add a few thoughts:
>
> >- Don't fix the wrong problem (DFtWP). Unless you have some reason to
> >believe that this server application would realistically have a need to
> >handle anything close to a thousand concurrent TCP sessions, don't fix
> >it by raising the per-process file handle limit, and don't fix it by
> >reimplementing the socket listening code.
>
> We haven't done the exact numbers, but we could probably hit several
> hundreds concurrent at least. 1024 seemed a bit "over the top" though :)
>
> The system in question is meant to serve more than 1000 concurrent
> users, so we are in fact moving into this territory. We have been up to
> around 600 so far.
>
> >- It is entirely possible that no one before you has ever tried to run
> >a server application with the per-process file handle limit bumped up
> >above the default 1024. So if that configuration does not play nicely
> >with the select() mechanism, you may well be the first to have encountered
> >this as an issue. But see above, don't fix it if it ain't broke.
>
> Well, it most probably *is* broke - I mean - I haven't read anywhere
> that our Socket code is limited to 1024 concurrent sockets and that
> going above that limit causes the Socket code to stop working? :)
>
> But I agree - I don't want to touch that code if we can simply avoid
> this bug by making sure we stay below 1024.
>
> But it sounds broke to me, nevertheless. ;)

Indeed it probably is.

>
> >- Most "out of file descriptor" problems involve resource leaks (as Eliot
> >is suggesting), and in those cases you will see a gradual increase in file
> >descriptors in /proc/<vmpid>/fd/ over time. Eventually you run out of
> >descriptors and something horrible happens.
>
> We will start looking at that and other tools too.
>
>
> >- Sorry to repeat myself but this is by far the most important point:
> >DFtWP.
>
> Sure :). This is why I posted - to get your input. And I have a
> suspicion that the SAML issue I mentioned may be the code leaking, we
> will start looking.
>
> regards, G?ran

Cool. Please keep us posted on progress. In the unlikely event that free
advice turns out to be worth anything, there is plenty more where that
came from ;-)

Dave




--
===========================================================================
John M. McIntosh <[hidden email]https://www.linkedin.com/in/smalltalk
===========================================================================
Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

Eliot Miranda-2
In reply to this post by David T. Lewis
 


On Thu, Oct 9, 2014 at 4:28 PM, David T. Lewis <[hidden email]> wrote:

On Thu, Oct 09, 2014 at 03:27:05PM -0700, Eliot Miranda wrote:
>
> Hi Both,
>
> On Thu, Oct 9, 2014 at 3:05 PM, G??ran Krampe <[hidden email]> wrote:
>
> >
> > Hi guys!
> >
> > Long email but... work for hire here! :)
> >
> > In short:
> >
> > Ron and I (3DICC) have a problem with the Unix VM networking and I am
> > reaching out before burning too many hours on something one of you
> > C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your
> > trouble.
> >
>
> Cool.  This is likely easy to fix.  Your image is running out of file
> descriptors.  Track open and close calls, e.g. add logging around at
> least StandardFileStream>>#primOpen:writable:
> , AsyncFile>>#primOpen:forWrite:semaIndex:,
> Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:readSemaIndex:writeSemaIndex:
> and their associated close calls and see what's being opened without being
> closed.  It shoudl be easy to track=down, but may be more difficult to fix.
>
> Good luck!

I agree with what Eliot is saying and would add a few thoughts:

- Don't fix the wrong problem (DFtWP). Unless you have some reason to
believe that this server application would realistically have a need to
handle anything close to a thousand concurrent TCP sessions, don't fix
it by raising the per-process file handle limit, and don't fix it by
reimplementing the socket listening code.

- It is entirely possible that no one before you has ever tried to run
a server application with the per-process file handle limit bumped up
above the default 1024. So if that configuration does not play nicely
with the select() mechanism, you may well be the first to have encountered
this as an issue. But see above, don't fix it if it ain't broke.

- Most "out of file descriptor" problems involve resource leaks (as Eliot
is suggesting), and in those cases you will see a gradual increase in file
descriptors in /proc/<vmpid>/fd/ over time. Eventually you run out of
descriptors and something horrible happens.

- Before doing anything else, you must confirm if this is a resource leak,
with file descriptor use continuously increasing (which is what Eliot and
I are both assuming to be the case here), or if it is a real resource
issue in which your server has a legitimate need to maintain a very large
number of TCP connections concurrently. Given that you have a running
application with real users, you will probably want to do this with something
like a shell script keeping track of the /proc/<pid>/fd/ directory for
the running VM. (In squeaksource.com, there is an undiagnosed file handle
leak similar to what I think you are experiencing. My kludgy workaround is
a process in the image that uses OSProcess to count entries in /proc/<pid>/fd/
and restart the image when the file descriptor situation becomes dire).

- Finding file (socket) handle leaks is trickly, and if you have customers
depending on this, you probably do not have the luxury of fixing it right.
Is there any way to periodically restart the server image without causing
pain to the customer? If so, consider a kludgy workaround like I did for
squeaksource. Monitor the VM process for file handle leaks and restart
it proactively rather than waiting for a catastrophic failure. You can
do this all from within the image, I will dig out my squeaksource hack if
you think it may be of any help.

- Sorry to repeat myself but this is by far the most important point: DFtWP.

Great message David.  You've nailed it.
 

Dave

>
>
> > In looong:
> >
> > So... we are running a large deployment of Terf (yay!) that is "pushing"
> > the server side VMs a bit. The large load has caused us to experience some
> > issues.
> >
> >
> > Our Possibly Faulted Analysis So Far
> > ====================================
> >
> > One of our server side VMs, the one that unfortunately is a crucial
> > resource, locks up and doesn't respond on its most important listening
> > Socket port. VM does not crash however. We reboot it, because its a
> > stressful situation with LOTS of users being affected, so we haven't looked
> > "inside".
> >
> > Unfortunately the shell script starting the VMs wasn't catching stderr to
> > a log file (Argh! Now it does though so we will see if we get more info
> > later) so we have missed some info here but Ron "got lucky" and saw this on
> > his terminal (stderr of the VM going to his terminal instead of log file):
> >
> > "errno 9
> > select: Bad file descriptor"
> >
> > It took us quite some time before we realized this was indeed Squeak
> > talking, and that it was from inside aio.c - a call from aioPoll():
> >         perror("select")
> >
> > ...ok. Some important background info:
> >
> > Before this we hit the default ulimit of 1024 per user (duh!), causing
> > "Too Many Files Open", so we raised them silly high. That did make the
> > system handle itself - but in retrospect we think another issue (related to
> > SAML auth) caused tons of client requests getting spawned from this VM and
> > thus is what made us reach the limit in the first place. It may also have
> > been the factor (=many many sockets) that in the end caused the errno 9
> > described above - see below for reasoning.
> >
> > After perusing the IMHO highly confusing :) Socket code (no offense of
> > course, its probably trivial to a Unix C-guru) at least we understand that
> > the code uses select() and not a more modern poll() or epoll(). In fact
> > there is also a call to select() in sqUnixSocket.c, but... probably not
> > relevant.
> >
> > Yeah, epoll() is not portable etc, we know, but frankly we only care for
> > Linux here.
> >
> > Googling shows us further that select() has issues, I mean, yikes. And the
> > thing I think we might be hitting here is the fact that select() doesn't
> > handle more than 1024 file descriptors!!! (as far as I can understand the
> > writing on the Internet) and to make it worse, it seems to be able to go
> > bananas if you push it there...
> >
> > Now... again, Internet seems to imply that the "usual" cause of "errno 9"
> > is doing a select on an fd that has already been closed. Typical bug
> > causing this is accidentally closing an fd twice - and thus, if you are
> > unlucky, accidentally the second time closing the fd when it actually has
> > already managed to get reused and thus is open agian. Oops.
> >
> > But it seems unreasonable to think that *such* a bug exists in this code
> > after all these years. And I am simply guessing its the >1024 fd problem
> > biting us, but yeah, we don't know. And I also guess it was that SAML
> > issue, in combination with raised ulimits, that made us even get over 1024.
> >
> > Things we are planning:
> >
> > - Come up with a test case showing it blowing up. Will try to do that next
> > week.
> > - Start looking at how to use poll() or epoll() instead, because we need
> > to be SOLID here and we can't afford the 1024 limit.
> >
> > So... anyone interested? Any brilliant thoughts? AFAICT we can rest
> > assured that there is a bug here somewhere, because otherwise we wouldn't
> > be able to get "errno 9 Bad file descriptor", right?
> >
> > regards, G??ran
> >
> > PS. Googling this in relation to the Squeak VM didn't show any recent
> > hits, only some from Stephen Pair fixing a bug in X11 code etc.
> >
>
>
>
> --
> best,
> Eliot




--
best,
Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

Eliot Miranda-2
In reply to this post by Göran Krampe
 


On Thu, Oct 9, 2014 at 5:09 PM, Göran Krampe <[hidden email]> wrote:

Hi guys!

On 10/10/2014 01:28 AM, David T. Lewis wrote:
Ron and I (3DICC) have a problem with the Unix VM networking and I am
reaching out before burning too many hours on something one of you
C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck for your
trouble.

Cool.  This is likely easy to fix.  Your image is running out of file
descriptors.  Track open and close calls, e.g. add logging around at
least StandardFileStream>>#primOpen:writable:
, AsyncFile>>#primOpen:forWrite:semaIndex:,
Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:readSemaIndex:writeSemaIndex:
and their associated close calls and see what's being opened without being
closed.  It shoudl be easy to track=down, but may be more difficult to fix.

Good luck!

Aha. Soo... am I understanding this correctly - we are probably leaking fds and when we go above 1024 this makes select() go bonkers and eventually leads to the "Bad file descriptor" error?

I'm not sure, but that you're needing to up the per-process file handle limit is worrying.  That you should diagnose first.  If you solve it (and its likely to be easy; you can do things like maintain sets of open sockets, etc, and close the least recently used when reaching some high-tide mark, etc) then I suspect the select problems will go away too.

I agree with what Eliot is saying and would add a few thoughts:

- Don't fix the wrong problem (DFtWP). Unless you have some reason to
believe that this server application would realistically have a need to
handle anything close to a thousand concurrent TCP sessions, don't fix
it by raising the per-process file handle limit, and don't fix it by
reimplementing the socket listening code.

We haven't done the exact numbers, but we could probably hit several hundreds concurrent at least. 1024 seemed a bit "over the top" though :)

But each connexion could have a few sockets, and then there may be file connexions etc.  Best look see what you have there and sanity-check.
 

The system in question is meant to serve more than 1000 concurrent users, so we are in fact moving into this territory. We have been up to around 600 so far.

- It is entirely possible that no one before you has ever tried to run
a server application with the per-process file handle limit bumped up
above the default 1024. So if that configuration does not play nicely
with the select() mechanism, you may well be the first to have encountered
this as an issue. But see above, don't fix it if it ain't broke.

Well, it most probably *is* broke - I mean - I haven't read anywhere that our Socket code is limited to 1024 concurrent sockets and that going above that limit causes the Socket code to stop working? :)

"Broke" is the wrong word.  You're running into a soft resource limit that you can raise.  But you should only raise it if you know the server code is correct, because raising the limit when incorrect code causes a gradual increase in open file descriptors will simply postpone the inevitable.  And the longer the server appears to run healthily the more mysterious and annoying the crash may appear :-).


But I agree - I don't want to touch that code if we can simply avoid this bug by making sure we stay below 1024.

You may have to go above 1024 when you have >= 1024 users (or 512 or what ever). But you should understand the relationship between connexions and open file descriptors and know that there are no leaks before you up the limit.

But it sounds broke to me, nevertheless. ;)

If it is broke it is the OS that is broke.  But not really.  Having a soft limit is a great way to find problems (leaks) while providing flexibility.  With no limit the system runs until catastrophe, probably bringing the OS down with it.  This is a bit like Smalltalk catching infinite recursion through the low space condition; by the time the recursion is stopped there's precious little resource to do anything about it (not to mention that the stack will be huge).  So only lift the limit when you know the code is correct and you know you need more resources to serve the anticipated number of clients.


- Most "out of file descriptor" problems involve resource leaks (as Eliot
is suggesting), and in those cases you will see a gradual increase in file
descriptors in /proc/<vmpid>/fd/ over time. Eventually you run out of
descriptors and something horrible happens.

We will start looking at that and other tools too.


- Sorry to repeat myself but this is by far the most important point: DFtWP.

Sure :). This is why I posted - to get your input. And I have a suspicion that the SAML issue I mentioned may be the code leaking, we will start looking.

regards, Göran



--
best,
Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

Göran Krampe
 
Hi!

On 10/10/2014 02:50 AM, Eliot Miranda wrote:
> On Thu, Oct 9, 2014 at 5:09 PM, Göran Krampe <[hidden email]
>     Well, it most probably *is* broke - I mean - I haven't read anywhere
>     that our Socket code is limited to 1024 concurrent sockets and that
>     going above that limit causes the Socket code to stop working? :)
>
>
> "Broke" is the wrong word.  You're running into a soft resource limit
> that you can raise.  But you should only raise it if you know the server

No, see below. When I say 1024 I am not referring to the soft limit. I
am referring to the fact that select() can not handle more than 1024,
let me quote from the net:

"The limitation comes from the way select() works, most concretely the
data type used to represent the list of file descriptors, fd_set. Let’s
take a look at fd_set in /usr/include/sys/select.h

You will see a definition pretty much like:

typedef struct  {
     long int fds_bits[32];
} fd_set;

I removed a bunch of macros to make the code more clear. As you see you
have a static array of 32 long ints. This is just a bit map, where each
bit represents a file descriptor number. 32 * sizeof(long int), for 32
bit platforms is 1024. So, if you do fd_set(&fd, 10), to add file
descriptor 10 to an fd_set, it will just set to 1 the 10th bit in the
bit map, what happens then if you do fd_set(&fd, 2000) ?, you guessed
right, unpredictable. May be some sort of array overflow (fd_set is, at
least on my system, implemented using the assembly instruction btsl, bit
test and set)."


>     But it sounds broke to me, nevertheless. ;)
>
> If it is broke it is the OS that is broke.  But not really.  Having a
> soft limit is a great way to find problems (leaks) while providing
> flexibility.  With no limit the system runs until catastrophe, probably

But... I am not sure if you are misunderstanding or not. We did raise
the limit but then we got the "Bad descriptor file" not the "Too many
files open". What I mean with "broke" is the fact that select() can not
handle more than 1024 *regardless* of limits.

So I think we can not go above 1024 if we still use select(). This is
the bug I am talking about.

regards, Göran
Reply | Threaded
Open this post in threaded view
|

RE: Squeak socket problem ... help!

Ron Teitelbaum
In reply to this post by David T. Lewis

Hi All!

Thanks for your help!!

Trying to reproduce the error I ran 1000 clients against our test cluster of
7 servers.

The fd size hovers around 250 and never made it above 490 (there were 4
service providers so the main server should only handle around 250 people).
David, Thanks for the pointers on how to find the fd size.  Very cool.
Since the fd size seemed to be well within proper limits and went both up
and down I think you are correct we may be looking at the wrong thing.
Could be that we are closing a socket twice?  (see 0xffa5e0cc M
Socket>destroy 0xd9ba02d4: a(n) Socket)  Göran mentioned finding something
that said on a very busy server you may try closing an already closed socket
that had already been reused.  On not so busy system you can get away with
it so you may not see this error until you put a very large load on it.

I captured the error:

This repeated about 1000 times (no idea why it would repeat so many times):

errno 9
select: Bad file descriptor

Then:

Segmentation fault Fri Oct 10 02:46:17 2014


Squeak VM version: 4.0-2798 #1 Wed Nov 13 12:31:45 UTC 2013 gcc 4.4.7
[Production ITHB VM]
Built from: CoInterpreter VMMaker.oscog-eem.496 uuid:
f0401045-f2f7-470e-9940-3535be9c0334 Nov 13 2013
With: StackToRegisterMappingCogit VMMaker.oscog-eem.496 uuid:
f0401045-f2f7-470e-9940-3535be9c0334 Nov 13 2013
Revision: VM: r2798 http://www.squeakvm.org/svn/squeak/branches/Cog
Plugins: r2545 http://squeakvm.org/svn/squeak/trunk/platforms/Cross/plugins
Build host: Linux cogvm 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC
2013 x86_64 x86_64 x86_64 GNU/Linux
plugin path: /home/openqwaq/server/bin/forums/../qwaqvm/plugins [default:
/home/openqwaq/server/bin/cogvm/lib/squeak/4.0-2798/]


C stack backtrace & registers:
        eax 0xffa191a4 ebx 0xffa190c0 ecx 0xffa19158 edx 0xffa1910c
        edi 0xffa18f90 esi 0xffa18f90 ebp 0xffa19028 esp 0xffa19074
        eip 0xffa19288
*[0xffa19288]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805e
54e]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805e
84f]
[0x5ad410]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(sqSock
etCloseConnection+0x26)[0x811f0a6]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(sqSock
etDestroy+0x40)[0x811fad0]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x811c
9f8]
[0xd7444848]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(interp
ret+0x76e)[0x8084eee]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(main+0
x2b4)[0x805efa4]
/lib/libc.so.6(__libc_start_main+0xe6)[0x2f4d36]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805b
d41]
[0xffa18c80]


Smalltalk stack dump:
0xffa5e0cc M Socket>destroy 0xd9ba02d4: a(n) Socket
0xffa5e0e4 M QServiceProviderClient>destroy 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e0fc M QServiceProviderClient>connectionClosed 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e114 M [] in QServiceProviderClient>processRequest 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e138 M Socket>waitForDataFor:ifClosed:ifTimedOut: 0xd9ba02d4: a(n)
Socket
0xffa5e164 M QServiceProviderClient>processRequest 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e17c M [] in QServiceProviderClient>on:in: 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e194 M [] in QSafeProcessLoop>run 0xd9c35efc: a(n) QSafeProcessLoop
0xffa5e1b0 M BlockClosure>on:do: 0xda118824: a(n) BlockClosure
0xffa5e1d0 M QSafeProcessLoop>run 0xd9c35efc: a(n) QSafeProcessLoop
0xd9c646dc s [] in QSafeProcessLoop(Object)>fork:at:
0xd9c36020 s [] in BlockClosure>newProcess

Most recent primitives
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
millisecondClockValue
millisecondClockValue
primitiveEnterCriticalSection
primitiveEnterCriticalSection
at:put:
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
primitiveExitCriticalSection
new:
at:put:
at:put:
updateCrc:from:to:in:
updateCrc:from:to:in:
bitXor:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
signal
primitiveEnterCriticalSection
primitiveEnterCriticalSection
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
at:put:
at:put:
primitiveExitCriticalSection
primSocketConnectionStatus:
primSocketSendDone:
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
millisecondClockValue
primitiveWait
primitiveWait
primitiveWait
primitiveWait
primitiveWait
primitiveWait
millisecondClockValue
millisecondClockValue
basicNew:
at:put:
basicNew:
integerAt:put:size:signed:
basicNew:
primitiveARC4Process:from:to:m:x:y:
findSubstring:in:startingAt:matchTable:
basicNew
stringHash:initialHash:
hashBytes:startingWith:
basicNew
at:put:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
stringHash:initialHash:
basicNew
stringHash:initialHash:
primitiveEnterCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
at:put:
at:put:
signal
millisecondClockValue
millisecondClockValue
primitiveEnterCriticalSection
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
primitiveExitCriticalSection
new:
at:put:
at:put:
updateCrc:from:to:in:
updateCrc:from:to:in:
bitXor:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
primitiveEnterCriticalSection
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
at:put:
at:put:
primitiveExitCriticalSection
primSocketConnectionStatus:
primSocketSendDone:
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
basicNew
basicNew
primitiveEnterCriticalSection
new:
replaceFrom:to:with:startingAt:
at:put:
new:
replaceFrom:to:with:startingAt:
at:put:
primitiveExitCriticalSection
basicNew
basicNew
primitiveEnterCriticalSection
new:
replaceFrom:to:with:startingAt:
at:put:
new:
replaceFrom:to:with:startingAt:
at:put:
primitiveExitCriticalSection
basicNew
millisecondClockValue
primitiveWait
primitiveWait
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
signal
primitiveWait
primitiveEnterCriticalSection
replaceFrom:to:with:startingAt:
primSocketConnectionStatus:
primitiveExitCriticalSection
suspend
primSocketReceiveDataAvailable:
primSocketReceiveDataAvailable:
primSocketConnectionStatus:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew
millisecondClockValue
=
digitMultiply:neg:
<
basicNew
pointsTo:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
primitiveWait
at:put:
signal
signal
basicNew
primitiveWait
at:put:
signal
signal
primSocketConnectionStatus:
primSocketDestroy:

stack page bytes 4096 available headroom 3300 minimum unused headroom 2164

        (Segmentation fault)

> From: David T. Lewis
> Sent: Thursday, October 09, 2014 8:22 PM
>
> On Fri, Oct 10, 2014 at 02:09:08AM +0200, G?ran Krampe wrote:
> >
> > Hi guys!

Hi Eliot, hope you are happy and well!

> >
> > On 10/10/2014 01:28 AM, David T. Lewis wrote:
> > >>>Ron and I (3DICC) have a problem with the Unix VM networking and I
> > >>>am reaching out before burning too many hours on something one of
> > >>>you C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck
> > >>>for your trouble.
> > >>
> > >>Cool.  This is likely easy to fix.  Your image is running out of
> > >>file descriptors.  Track open and close calls, e.g. add logging
> > >>around at least StandardFileStream>>#primOpen:writable:
> > >>, AsyncFile>>#primOpen:forWrite:semaIndex:,
> >
> >>Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:rea
> dSemaIndex:writeSemaIndex:
> > >>and their associated close calls and see what's being opened without
> > >>being closed.  It shoudl be easy to track=down, but may be more
> > >>difficult to fix.
> > >>
> > >>Good luck!
> >
> > Aha. Soo... am I understanding this correctly - we are probably
> > leaking fds and when we go above 1024 this makes select() go bonkers
> > and eventually leads to the "Bad file descriptor" error?
> >
> > >I agree with what Eliot is saying and would add a few thoughts:
> >
> > >- Don't fix the wrong problem (DFtWP). Unless you have some reason to
> > >believe that this server application would realistically have a need
> > >to handle anything close to a thousand concurrent TCP sessions, don't
> > >fix it by raising the per-process file handle limit, and don't fix it
> > >by reimplementing the socket listening code.
> >
> > We haven't done the exact numbers, but we could probably hit several
> > hundreds concurrent at least. 1024 seemed a bit "over the top" though
> > :)
> >
> > The system in question is meant to serve more than 1000 concurrent
> > users, so we are in fact moving into this territory. We have been up
> > to around 600 so far.
> >
> > >- It is entirely possible that no one before you has ever tried to
> > >run a server application with the per-process file handle limit
> > >bumped up above the default 1024. So if that configuration does not
> > >play nicely with the select() mechanism, you may well be the first to
> > >have encountered this as an issue. But see above, don't fix it if it
ain't

> broke.
> >
> > Well, it most probably *is* broke - I mean - I haven't read anywhere
> > that our Socket code is limited to 1024 concurrent sockets and that
> > going above that limit causes the Socket code to stop working? :)
> >
> > But I agree - I don't want to touch that code if we can simply avoid
> > this bug by making sure we stay below 1024.
> >
> > But it sounds broke to me, nevertheless. ;)
>
> Indeed it probably is.
>
> >
> > >- Most "out of file descriptor" problems involve resource leaks (as
> > >Eliot is suggesting), and in those cases you will see a gradual
> > >increase in file descriptors in /proc/<vmpid>/fd/ over time.

Thank you for the pointer.  That's a pretty cool directory to look at!

> > >Eventually you run out of descriptors and something horrible happens.
> >
> > We will start looking at that and other tools too.
> >
> >
> > >- Sorry to repeat myself but this is by far the most important point:
> > >DFtWP.
> >
> > Sure :). This is why I posted - to get your input. And I have a
> > suspicion that the SAML issue I mentioned may be the code leaking, we
> > will start looking.
> >
> > regards, G?ran
>
> Cool. Please keep us posted on progress. In the unlikely event that free
> advice turns out to be worth anything, there is plenty more where that
> came from ;-)
>
> Dave
>

Ron Teitelbaum


Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

Max Leske
In reply to this post by Göran Krampe
 
Just my too cents:

We ran into the “errno 9 select: Bad file descriptor” error two weaks back while playing around with systemd and socket activation.
In our case was the following:
1. send request to socket -> image starts through socket activation
2. image initializes itself with the special systemd socket handle and keeps track of it (because we want to reuse it after restarting the image)
3. save the image
4. errno 9

What we found then was, that when the image was being saved, stderr would be closed and then reopened by the system, as did the systemd socket (which was a different problem). Unfortunately we always assumed a fixed number for the systemd socket, namely 3. Now, when the image reopened stderr it could  happen that another socket had taken fd 2 and stderr would be on fd 3, thus we tried to open stderr as TCP socket.

What I want to say is, just really make sure that you’re not doing something funny with sockets (e.g. fixed fd number) before you look at the VM.

Good luck!


Cheers,
Max

On 10.10.2014, at 05:56, [hidden email] wrote:

Date: Thu, 9 Oct 2014 23:56:03 -0400
From: "Ron Teitelbaum" <[hidden email]>
Subject: RE: [Vm-dev] Squeak socket problem ... help!
To: "'Squeak Virtual Machine Development Discussion'"
<[hidden email]>
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="iso-8859-1"

Hi All!

Thanks for your help!!

Trying to reproduce the error I ran 1000 clients against our test cluster of
7 servers.

The fd size hovers around 250 and never made it above 490 (there were 4
service providers so the main server should only handle around 250 people).
David, Thanks for the pointers on how to find the fd size.  Very cool.
Since the fd size seemed to be well within proper limits and went both up
and down I think you are correct we may be looking at the wrong thing.
Could be that we are closing a socket twice?  (see 0xffa5e0cc M
Socket>destroy 0xd9ba02d4: a(n) Socket)  Göran mentioned finding something
that said on a very busy server you may try closing an already closed socket
that had already been reused.  On not so busy system you can get away with
it so you may not see this error until you put a very large load on it.

I captured the error: 

This repeated about 1000 times (no idea why it would repeat so many times):

errno 9
select: Bad file descriptor

Then:

Segmentation fault Fri Oct 10 02:46:17 2014


Squeak VM version: 4.0-2798 #1 Wed Nov 13 12:31:45 UTC 2013 gcc 4.4.7
[Production ITHB VM]
Built from: CoInterpreter VMMaker.oscog-eem.496 uuid:
f0401045-f2f7-470e-9940-3535be9c0334 Nov 13 2013
With: StackToRegisterMappingCogit VMMaker.oscog-eem.496 uuid:
f0401045-f2f7-470e-9940-3535be9c0334 Nov 13 2013
Revision: VM: r2798 http://www.squeakvm.org/svn/squeak/branches/Cog
Plugins: r2545 http://squeakvm.org/svn/squeak/trunk/platforms/Cross/plugins
Build host: Linux cogvm 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC
2013 x86_64 x86_64 x86_64 GNU/Linux
plugin path: /home/openqwaq/server/bin/forums/../qwaqvm/plugins [default:
/home/openqwaq/server/bin/cogvm/lib/squeak/4.0-2798/]


C stack backtrace & registers:
       eax 0xffa191a4 ebx 0xffa190c0 ecx 0xffa19158 edx 0xffa1910c
       edi 0xffa18f90 esi 0xffa18f90 ebp 0xffa19028 esp 0xffa19074
       eip 0xffa19288
*[0xffa19288]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805e
54e]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805e
84f]
[0x5ad410]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(sqSock
etCloseConnection+0x26)[0x811f0a6]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(sqSock
etDestroy+0x40)[0x811fad0]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x811c
9f8]
[0xd7444848]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(interp
ret+0x76e)[0x8084eee]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(main+0
x2b4)[0x805efa4]
/lib/libc.so.6(__libc_start_main+0xe6)[0x2f4d36]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805b
d41]
[0xffa18c80]


Smalltalk stack dump:
0xffa5e0cc M Socket>destroy 0xd9ba02d4: a(n) Socket
0xffa5e0e4 M QServiceProviderClient>destroy 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e0fc M QServiceProviderClient>connectionClosed 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e114 M [] in QServiceProviderClient>processRequest 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e138 M Socket>waitForDataFor:ifClosed:ifTimedOut: 0xd9ba02d4: a(n)
Socket
0xffa5e164 M QServiceProviderClient>processRequest 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e17c M [] in QServiceProviderClient>on:in: 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e194 M [] in QSafeProcessLoop>run 0xd9c35efc: a(n) QSafeProcessLoop
0xffa5e1b0 M BlockClosure>on:do: 0xda118824: a(n) BlockClosure
0xffa5e1d0 M QSafeProcessLoop>run 0xd9c35efc: a(n) QSafeProcessLoop
0xd9c646dc s [] in QSafeProcessLoop(Object)>fork:at:
0xd9c36020 s [] in BlockClosure>newProcess

Most recent primitives
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
millisecondClockValue
millisecondClockValue
primitiveEnterCriticalSection
primitiveEnterCriticalSection
at:put:
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
primitiveExitCriticalSection
new:
at:put:
at:put:
updateCrc:from:to:in:
updateCrc:from:to:in:
bitXor:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
signal
primitiveEnterCriticalSection
primitiveEnterCriticalSection
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
at:put:
at:put:
primitiveExitCriticalSection
primSocketConnectionStatus:
primSocketSendDone:
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
millisecondClockValue
primitiveWait
primitiveWait
primitiveWait
primitiveWait
primitiveWait
primitiveWait
millisecondClockValue
millisecondClockValue
basicNew:
at:put:
basicNew:
integerAt:put:size:signed:
basicNew:
primitiveARC4Process:from:to:m:x:y:
findSubstring:in:startingAt:matchTable:
basicNew
stringHash:initialHash:
hashBytes:startingWith:
basicNew
at:put:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
stringHash:initialHash:
basicNew
stringHash:initialHash:
primitiveEnterCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
at:put:
at:put:
signal
millisecondClockValue
millisecondClockValue
primitiveEnterCriticalSection
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
primitiveExitCriticalSection
new:
at:put:
at:put:
updateCrc:from:to:in:
updateCrc:from:to:in:
bitXor:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
primitiveEnterCriticalSection
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
at:put:
at:put:
primitiveExitCriticalSection
primSocketConnectionStatus:
primSocketSendDone:
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
basicNew
basicNew
primitiveEnterCriticalSection
new:
replaceFrom:to:with:startingAt:
at:put:
new:
replaceFrom:to:with:startingAt:
at:put:
primitiveExitCriticalSection
basicNew
basicNew
primitiveEnterCriticalSection
new:
replaceFrom:to:with:startingAt:
at:put:
new:
replaceFrom:to:with:startingAt:
at:put:
primitiveExitCriticalSection
basicNew
millisecondClockValue
primitiveWait
primitiveWait
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
signal
primitiveWait
primitiveEnterCriticalSection
replaceFrom:to:with:startingAt:
primSocketConnectionStatus:
primitiveExitCriticalSection
suspend
primSocketReceiveDataAvailable:
primSocketReceiveDataAvailable:
primSocketConnectionStatus:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew
millisecondClockValue
=
digitMultiply:neg:
<
basicNew
pointsTo:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
primitiveWait
at:put:
signal
signal
basicNew
primitiveWait
at:put:
signal
signal
primSocketConnectionStatus:
primSocketDestroy:

stack page bytes 4096 available headroom 3300 minimum unused headroom 2164

       (Segmentation fault)

From: David T. Lewis
Sent: Thursday, October 09, 2014 8:22 PM

On Fri, Oct 10, 2014 at 02:09:08AM +0200, G?ran Krampe wrote:

Hi guys!

Hi Eliot, hope you are happy and well!


On 10/10/2014 01:28 AM, David T. Lewis wrote:
Ron and I (3DICC) have a problem with the Unix VM networking and I
am reaching out before burning too many hours on something one of
you C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck
for your trouble.

Cool.  This is likely easy to fix.  Your image is running out of
file descriptors.  Track open and close calls, e.g. add logging
around at least StandardFileStream>>#primOpen:writable:
, AsyncFile>>#primOpen:forWrite:semaIndex:,

Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:rea
dSemaIndex:writeSemaIndex:
and their associated close calls and see what's being opened without
being closed.  It shoudl be easy to track=down, but may be more
difficult to fix.

Good luck!

Aha. Soo... am I understanding this correctly - we are probably
leaking fds and when we go above 1024 this makes select() go bonkers
and eventually leads to the "Bad file descriptor" error?

I agree with what Eliot is saying and would add a few thoughts:

- Don't fix the wrong problem (DFtWP). Unless you have some reason to
believe that this server application would realistically have a need
to handle anything close to a thousand concurrent TCP sessions, don't
fix it by raising the per-process file handle limit, and don't fix it
by reimplementing the socket listening code.

We haven't done the exact numbers, but we could probably hit several
hundreds concurrent at least. 1024 seemed a bit "over the top" though
:)

The system in question is meant to serve more than 1000 concurrent
users, so we are in fact moving into this territory. We have been up
to around 600 so far.

- It is entirely possible that no one before you has ever tried to
run a server application with the per-process file handle limit
bumped up above the default 1024. So if that configuration does not
play nicely with the select() mechanism, you may well be the first to
have encountered this as an issue. But see above, don't fix it if it
ain't
broke.

Well, it most probably *is* broke - I mean - I haven't read anywhere
that our Socket code is limited to 1024 concurrent sockets and that
going above that limit causes the Socket code to stop working? :)

But I agree - I don't want to touch that code if we can simply avoid
this bug by making sure we stay below 1024.

But it sounds broke to me, nevertheless. ;)

Indeed it probably is.


- Most "out of file descriptor" problems involve resource leaks (as
Eliot is suggesting), and in those cases you will see a gradual
increase in file descriptors in /proc/<vmpid>/fd/ over time.

Thank you for the pointer.  That's a pretty cool directory to look at!

Eventually you run out of descriptors and something horrible happens.

We will start looking at that and other tools too.


- Sorry to repeat myself but this is by far the most important point:
DFtWP.

Sure :). This is why I posted - to get your input. And I have a
suspicion that the SAML issue I mentioned may be the code leaking, we
will start looking.

regards, G?ran

Cool. Please keep us posted on progress. In the unlikely event that free
advice turns out to be worth anything, there is plenty more where that
came from ;-)

Dave


Ron Teitelbaum

Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

Max Leske
In reply to this post by Göran Krampe
 
Disclaimer:

I just realized I’m not 100% sure the scenario I described was the actual cause of errno 9. It might have been another thing we did, where we tried to reuse the memory handle for the socket (don’t judge… we duly facepalmed ourselves when we realized how silly that was :) ).

Still, the basic idea remains the same: make sure you don’t mess with sockets.

Just my too cents:

We ran into the “errno 9 select: Bad file descriptor” error two weaks back while playing around with systemd and socket activation.
In our case was the following:
1. send request to socket -> image starts through socket activation
2. image initializes itself with the special systemd socket handle and keeps track of it (because we want to reuse it after restarting the image)
3. save the image
4. errno 9

What we found then was, that when the image was being saved, stderr would be closed and then reopened by the system, as did the systemd socket (which was a different problem). Unfortunately we always assumed a fixed number for the systemd socket, namely 3. Now, when the image reopened stderr it could  happen that another socket had taken fd 2 and stderr would be on fd 3, thus we tried to open stderr as TCP socket.

What I want to say is, just really make sure that you’re not doing something funny with sockets (e.g. fixed fd number) before you look at the VM.

Good luck!


Cheers,
Max

On 10.10.2014, at 05:56, [hidden email] wrote:

Date: Thu, 9 Oct 2014 23:56:03 -0400
From: "Ron Teitelbaum" <[hidden email]>
Subject: RE: [Vm-dev] Squeak socket problem ... help!
To: "'Squeak Virtual Machine Development Discussion'"
<[hidden email]>
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="iso-8859-1"

Hi All!

Thanks for your help!!

Trying to reproduce the error I ran 1000 clients against our test cluster of
7 servers.

The fd size hovers around 250 and never made it above 490 (there were 4
service providers so the main server should only handle around 250 people).
David, Thanks for the pointers on how to find the fd size.  Very cool.
Since the fd size seemed to be well within proper limits and went both up
and down I think you are correct we may be looking at the wrong thing.
Could be that we are closing a socket twice?  (see 0xffa5e0cc M
Socket>destroy 0xd9ba02d4: a(n) Socket)  Göran mentioned finding something
that said on a very busy server you may try closing an already closed socket
that had already been reused.  On not so busy system you can get away with
it so you may not see this error until you put a very large load on it.

I captured the error: 

This repeated about 1000 times (no idea why it would repeat so many times):

errno 9
select: Bad file descriptor

Then:

Segmentation fault Fri Oct 10 02:46:17 2014


Squeak VM version: 4.0-2798 #1 Wed Nov 13 12:31:45 UTC 2013 gcc 4.4.7
[Production ITHB VM]
Built from: CoInterpreter VMMaker.oscog-eem.496 uuid:
f0401045-f2f7-470e-9940-3535be9c0334 Nov 13 2013
With: StackToRegisterMappingCogit VMMaker.oscog-eem.496 uuid:
f0401045-f2f7-470e-9940-3535be9c0334 Nov 13 2013
Revision: VM: r2798 http://www.squeakvm.org/svn/squeak/branches/Cog
Plugins: r2545 http://squeakvm.org/svn/squeak/trunk/platforms/Cross/plugins
Build host: Linux cogvm 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC
2013 x86_64 x86_64 x86_64 GNU/Linux
plugin path: /home/openqwaq/server/bin/forums/../qwaqvm/plugins [default:
/home/openqwaq/server/bin/cogvm/lib/squeak/4.0-2798/]


C stack backtrace & registers:
       eax 0xffa191a4 ebx 0xffa190c0 ecx 0xffa19158 edx 0xffa1910c
       edi 0xffa18f90 esi 0xffa18f90 ebp 0xffa19028 esp 0xffa19074
       eip 0xffa19288
*[0xffa19288]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805e
54e]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805e
84f]
[0x5ad410]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(sqSock
etCloseConnection+0x26)[0x811f0a6]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(sqSock
etDestroy+0x40)[0x811fad0]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x811c
9f8]
[0xd7444848]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(interp
ret+0x76e)[0x8084eee]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak(main+0
x2b4)[0x805efa4]
/lib/libc.so.6(__libc_start_main+0xe6)[0x2f4d36]
/home/openqwaq/server/bin/forums/../qwaqvm/lib/squeak/4.0-2798/squeak[0x805b
d41]
[0xffa18c80]


Smalltalk stack dump:
0xffa5e0cc M Socket>destroy 0xd9ba02d4: a(n) Socket
0xffa5e0e4 M QServiceProviderClient>destroy 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e0fc M QServiceProviderClient>connectionClosed 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e114 M [] in QServiceProviderClient>processRequest 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e138 M Socket>waitForDataFor:ifClosed:ifTimedOut: 0xd9ba02d4: a(n)
Socket
0xffa5e164 M QServiceProviderClient>processRequest 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e17c M [] in QServiceProviderClient>on:in: 0xd9c33dd4: a(n)
QServiceProviderClient
0xffa5e194 M [] in QSafeProcessLoop>run 0xd9c35efc: a(n) QSafeProcessLoop
0xffa5e1b0 M BlockClosure>on:do: 0xda118824: a(n) BlockClosure
0xffa5e1d0 M QSafeProcessLoop>run 0xd9c35efc: a(n) QSafeProcessLoop
0xd9c646dc s [] in QSafeProcessLoop(Object)>fork:at:
0xd9c36020 s [] in BlockClosure>newProcess

Most recent primitives
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
millisecondClockValue
millisecondClockValue
primitiveEnterCriticalSection
primitiveEnterCriticalSection
at:put:
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
primitiveExitCriticalSection
new:
at:put:
at:put:
updateCrc:from:to:in:
updateCrc:from:to:in:
bitXor:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
signal
primitiveEnterCriticalSection
primitiveEnterCriticalSection
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
at:put:
at:put:
primitiveExitCriticalSection
primSocketConnectionStatus:
primSocketSendDone:
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
millisecondClockValue
primitiveWait
primitiveWait
primitiveWait
primitiveWait
primitiveWait
primitiveWait
millisecondClockValue
millisecondClockValue
basicNew:
at:put:
basicNew:
integerAt:put:size:signed:
basicNew:
primitiveARC4Process:from:to:m:x:y:
findSubstring:in:startingAt:matchTable:
basicNew
stringHash:initialHash:
hashBytes:startingWith:
basicNew
at:put:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
stringHash:initialHash:
basicNew
stringHash:initialHash:
primitiveEnterCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
at:put:
at:put:
signal
millisecondClockValue
millisecondClockValue
primitiveEnterCriticalSection
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
primitiveExitCriticalSection
new:
at:put:
at:put:
updateCrc:from:to:in:
updateCrc:from:to:in:
bitXor:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
at:put:
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
replaceFrom:to:with:startingAt:
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
=
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
at:put:
primitiveExitCriticalSection
primitiveEnterCriticalSection
primitiveEnterCriticalSection
findNextUnwindContextUpTo:
tempAt:
tempAt:put:
tempAt:
terminateTo:
primitiveExitCriticalSection
findNextUnwindContextUpTo:
terminateTo:
at:put:
at:put:
primitiveExitCriticalSection
primSocketConnectionStatus:
primSocketSendDone:
primSocketConnectionStatus:
primSocketSendDone:
primSocket:sendData:startIndex:count:
millisecondClockValue
primitiveWait
basicNew
basicNew
primitiveEnterCriticalSection
new:
replaceFrom:to:with:startingAt:
at:put:
new:
replaceFrom:to:with:startingAt:
at:put:
primitiveExitCriticalSection
basicNew
basicNew
primitiveEnterCriticalSection
new:
replaceFrom:to:with:startingAt:
at:put:
new:
replaceFrom:to:with:startingAt:
at:put:
primitiveExitCriticalSection
basicNew
millisecondClockValue
primitiveWait
primitiveWait
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
primitiveWait
signal
signal
primitiveWait
primitiveEnterCriticalSection
replaceFrom:to:with:startingAt:
primSocketConnectionStatus:
primitiveExitCriticalSection
suspend
primSocketReceiveDataAvailable:
primSocketReceiveDataAvailable:
primSocketConnectionStatus:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew
millisecondClockValue
=
digitMultiply:neg:
<
basicNew
pointsTo:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
primitiveWait
at:put:
signal
signal
basicNew
primitiveWait
at:put:
signal
signal
primSocketConnectionStatus:
primSocketDestroy:

stack page bytes 4096 available headroom 3300 minimum unused headroom 2164

       (Segmentation fault)

From: David T. Lewis
Sent: Thursday, October 09, 2014 8:22 PM

On Fri, Oct 10, 2014 at 02:09:08AM +0200, G?ran Krampe wrote:

Hi guys!

Hi Eliot, hope you are happy and well!


On 10/10/2014 01:28 AM, David T. Lewis wrote:
Ron and I (3DICC) have a problem with the Unix VM networking and I
am reaching out before burning too many hours on something one of
you C-Unix/Socket/VM guys can fix in an afternoon - and earn a buck
for your trouble.

Cool.  This is likely easy to fix.  Your image is running out of
file descriptors.  Track open and close calls, e.g. add logging
around at least StandardFileStream>>#primOpen:writable:
, AsyncFile>>#primOpen:forWrite:semaIndex:,

Socket>>#primAcceptFrom:receiveBufferSize:sendBufSize:semaIndex:rea
dSemaIndex:writeSemaIndex:
and their associated close calls and see what's being opened without
being closed.  It shoudl be easy to track=down, but may be more
difficult to fix.

Good luck!

Aha. Soo... am I understanding this correctly - we are probably
leaking fds and when we go above 1024 this makes select() go bonkers
and eventually leads to the "Bad file descriptor" error?

I agree with what Eliot is saying and would add a few thoughts:

- Don't fix the wrong problem (DFtWP). Unless you have some reason to
believe that this server application would realistically have a need
to handle anything close to a thousand concurrent TCP sessions, don't
fix it by raising the per-process file handle limit, and don't fix it
by reimplementing the socket listening code.

We haven't done the exact numbers, but we could probably hit several
hundreds concurrent at least. 1024 seemed a bit "over the top" though
:)

The system in question is meant to serve more than 1000 concurrent
users, so we are in fact moving into this territory. We have been up
to around 600 so far.

- It is entirely possible that no one before you has ever tried to
run a server application with the per-process file handle limit
bumped up above the default 1024. So if that configuration does not
play nicely with the select() mechanism, you may well be the first to
have encountered this as an issue. But see above, don't fix it if it
ain't
broke.

Well, it most probably *is* broke - I mean - I haven't read anywhere
that our Socket code is limited to 1024 concurrent sockets and that
going above that limit causes the Socket code to stop working? :)

But I agree - I don't want to touch that code if we can simply avoid
this bug by making sure we stay below 1024.

But it sounds broke to me, nevertheless. ;)

Indeed it probably is.


- Most "out of file descriptor" problems involve resource leaks (as
Eliot is suggesting), and in those cases you will see a gradual
increase in file descriptors in /proc/<vmpid>/fd/ over time.

Thank you for the pointer.  That's a pretty cool directory to look at!

Eventually you run out of descriptors and something horrible happens.

We will start looking at that and other tools too.


- Sorry to repeat myself but this is by far the most important point:
DFtWP.

Sure :). This is why I posted - to get your input. And I have a
suspicion that the SAML issue I mentioned may be the code leaking, we
will start looking.

regards, G?ran

Cool. Please keep us posted on progress. In the unlikely event that free
advice turns out to be worth anything, there is plenty more where that
came from ;-)

Dave


Ron Teitelbaum

Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

Göran Krampe
In reply to this post by Ron Teitelbaum
 
Hi guys!

Long post but an interesting one. Summary for the impatient:

- Ron seems to prove that:
        - We never reach 1024 fds so no leak.
        - But select() still fails, so we probably have an obscure close() bug
with fds.

- And I prove that select() is indeed broken above 1024.
- And some other interesting details like why "Socket new" creates *two*
instances of Socket etc.

Details follows:

On 10/10/2014 05:56 AM, Ron Teitelbaum wrote:
> Hi All!
>
> Thanks for your help!!
>
> Trying to reproduce the error I ran 1000 clients against our test cluster of
> 7 servers.

Hehe, this is fairly interesting in itself - Ron is firing up a whole
slew of machines on EC2 and is in fact running 1000 Terf clients on ...
50 servers? I think, and they actually run under Wine in Ubuntu :)

> The fd size hovers around 250 and never made it above 490 (there were 4
> service providers so the main server should only handle around 250 people).

Very interesting. Thus, this is NOT the 1024 limitation of select()
then, nor are we leaking. Unfortunately - since it implies a harder bug
to find.

> David, Thanks for the pointers on how to find the fd size.  Very cool.
> Since the fd size seemed to be well within proper limits and went both up
> and down I think you are correct we may be looking at the wrong thing.

Well, David seems to have been both right and wrong - I guessed the
wrong thing, its probably not the 1024-limit of select() that is the
*cause* of our problem - but its not a leak, more likely an obscure
close bug. Right? :)

NOTE: After writing this whole post I am not sure. Given that I can
reproduce the errorno 9 spewage by simply crossing the 1024 limit (see
below) it would still be nice to verify that is NOT what is happening
when you throw 1000 users at it. You did check the fd count using
David's instructions, can we be sure that we are way below?


Select() *is* broken
====================

Further, just so we don't drop this ball, there *is* a problem with
select() if we raise the ulimit above 1024. If I first run this trivial
code (Apache answers on localhost:80) with the default limit of 1024:

| local |
local := NetNameResolver addressForName: 'localhost' timeout: 20.
(1 to: 1040) collect: [:n | Socket new connectTo: local port: 80]

And do an inspect on that - it will fail when it gets to 1017 (some
descriptors already taken I presume) due to the default ulimit of 1024 -
as expected. It fails a bit inconsistently with different errors, but
anyway, nothing really surprising.

So... then I raised my limits so that "ulimit -S" reports 2048 for my
user. Then tried again, and behold!!! When I pass 1017 my stderr starts
spewing out:

errno 9
select: Bad file descriptor
errno 9
select: Bad file descriptor
errno 9
select: Bad file descriptor
errno 9
select: Bad file descriptor
errno 9
select: Bad file descriptor

...continuously, so yes, it fails on every aioPoll().

And in Squeak there is a debugger that pops up about 45 seconds later
(timeout) with a connection timeout - also expected - since the next
Socket to open doesn't connect since the select() isn't working anymore
so Squeak never gets notified about the connection being made.

Now that the errno 9 is spewing continuosly the VM is still running and
Squeak "works". But if I try to connect somewhere - I just timeout.


Testing in Pharo 2.0 with pharo VM
==================================

So let's try in Pharo 2.0. Test code:

| local |
local := NetNameResolver addressForName: 'localhost' timeout: 20.
(1 to: 1017) collect: [:n | Socket new connectTo: local port: 80]

Note that 1016 works, over and over after a Smalltalk garbageCollect, no
problem. But if we try it with 1017 we get a smack in the face:

./pharo shared/Pharo2.0.image
*** buffer overflow detected ***: /home/gokr/pharo/pharo2.0/bin/pharo
terminated
======= Backtrace: =========
/lib/i386-linux-gnu/libc.so.6(+0x696de)[0xf757f6de]
/lib/i386-linux-gnu/libc.so.6(__fortify_fail+0x6b)[0xf761213b]
/lib/i386-linux-gnu/libc.so.6(+0xfafca)[0xf7610fca]
/lib/i386-linux-gnu/libc.so.6(+0xfc02a)[0xf761202a]
/home/gokr/pharo/pharo2.0/bin/pharo(aioEnable+0x31)[0x80978d1]
/home/gokr/pharo/pharo2.0/bin/pharo(sqSocketConnectToPort+0xd6)[0x80f38b6]
/home/gokr/pharo/pharo2.0/bin/pharo[0x80f1518]
[0xb70bd770]
[0xb716fbd9]
[0xb716fb16]
[0xb716d53a]
[0xb70bd700]
[0xb70bd5c0]
.... etc

That's interesting? So trying again with the latest Cog from
mirandabanda...

Testing in Pharo 2.0 with Cog from Eliot
========================================

Aha, so here we first need to execute (too few semaphores, only 256 and
I think we need about 3 per Socket?):

VirtualMachine maxExternalSemaphoresSilently: 8192

...and then we get the same behavior as using our oldish Cog, namely it
starts spewing out the "errno 9 select: Bad file descriptor" over and
over and the next Socket connection times out.


So... back to you Eliot - am I still wrong? :) :)


Extra Oddity
============

Socket allInstances size => 2038
(Socket allInstances select: [:e | e isConnected]) size => 2032
(Socket allInstances collect: [:e | e socketHandle]) asSet size => 1019

...what? After some investigation it turns out that "Socket new" will
create *two* instances of Socket. This has to do with WeakRegistry used
in the Socket class :)

executor
        "Return an object which can act as executor for finalization of the
receiver"
        ^self shallowCopy actAsExecutor

...so each Socket gets a shallow copy of itself as the executor. Ok,
that explains it - but IMHO it sure ain't pretty.

> Could be that we are closing a socket twice?  (see 0xffa5e0cc M
> Socket>destroy 0xd9ba02d4: a(n) Socket)  Göran mentioned finding something
> that said on a very busy server you may try closing an already closed socket
> that had already been reused.  On not so busy system you can get away with
> it so you may not see this error until you put a very large load on it.
>
> I captured the error:
>
> This repeated about 1000 times (no idea why it would repeat so many times):
>
> errno 9
> select: Bad file descriptor

It repeats because we call select() many times per second and the bad
(closed) fd is still in that set we send in, so it barfs every time and
since it barfs it doesn't do its job anymore so this bad fd is blocking
everything - and once this hits the VM goes unresponsive when it comes
to file handles.

I guess one could "clean out" the bad fd if this happens - but that is
of course a hack and not sure how to do it.

Then exactly why the segfault comes, who knows. :)

And oh, the SAML auth that we thought was leaking - no. It just uses
WebClient httpGet and doesn't hold onto anything. Some trivial tests
shows that they should be GCed properly - although yes, John's remark
about GC coming "later" is also interesting.

regards, Göran
Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

Göran Krampe
 
On 10/10/2014 11:46 AM, Göran Krampe wrote:

> Hi guys!
>
> Long post but an interesting one. Summary for the impatient:
>
> - Ron seems to prove that:
>      - We never reach 1024 fds so no leak.
>      - But select() still fails, so we probably have an obscure close()
> bug with fds.
>
> - And I prove that select() is indeed broken above 1024.

Aha!

It seems Ragnar Hojland Espinosa has already fixed this 11 years ago:

http://lists.squeakfoundation.org/pipermail/squeak-dev/2003-May/057750.html

I also found code from David Shaffer that fixes the listening to not
miss stuff, also interesting:

http://lists.squeakfoundation.org/pipermail/squeak-dev/2005-August/093575.html

...and the patch he mentions:

http://lists.squeakfoundation.org/pipermail/squeak-dev/2005-August/093638.html


regards, Göran




Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

David T. Lewis
 
On Fri, Oct 10, 2014 at 01:15:21PM +0200, G?ran Krampe wrote:

>
> On 10/10/2014 11:46 AM, G?ran Krampe wrote:
> >Hi guys!
> >
> >Long post but an interesting one. Summary for the impatient:
> >
> >- Ron seems to prove that:
> >     - We never reach 1024 fds so no leak.
> >     - But select() still fails, so we probably have an obscure close()
> >bug with fds.
> >
> >- And I prove that select() is indeed broken above 1024.
>
> Aha!
>
> It seems Ragnar Hojland Espinosa has already fixed this 11 years ago:
>
> http://lists.squeakfoundation.org/pipermail/squeak-dev/2003-May/057750.html
>
> I also found code from David Shaffer that fixes the listening to not
> miss stuff, also interesting:
>
> http://lists.squeakfoundation.org/pipermail/squeak-dev/2005-August/093575.html
>
> ...and the patch he mentions:
>
> http://lists.squeakfoundation.org/pipermail/squeak-dev/2005-August/093638.html
>
>
> regards, G?ran
>

Nice detective work! I can't look right now but over the weekend I'll check to
make sure we capture these in Mantis and get the fixes into the oscog/trunk
repositories as needed.

Dave

Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

Ron Teitelbaum
In reply to this post by Max Leske
 
Thanks Max!

Ron

From: [hidden email]
[mailto:[hidden email]] On Behalf Of Max Leske
Sent: Friday, October 10, 2014 1:13 AM
To: [hidden email]
Subject: [Vm-dev] Re: Squeak socket problem ... help!



Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

David T. Lewis
In reply to this post by Göran Krampe
 
On Fri, Oct 10, 2014 at 01:15:21PM +0200, G?ran Krampe wrote:

>
> On 10/10/2014 11:46 AM, G?ran Krampe wrote:
> >Hi guys!
> >
> >Long post but an interesting one. Summary for the impatient:
> >
> >- Ron seems to prove that:
> >     - We never reach 1024 fds so no leak.
> >     - But select() still fails, so we probably have an obscure close()
> >bug with fds.
> >
> >- And I prove that select() is indeed broken above 1024.
>
> Aha!
>
> It seems Ragnar Hojland Espinosa has already fixed this 11 years ago:
>
> http://lists.squeakfoundation.org/pipermail/squeak-dev/2003-May/057750.html

I entered Mantis 7822: "Unix aio VM support for 1024+ descriptors" to track
this issue:

  http://bugs.squeak.org/view.php?id=7822

>
> I also found code from David Shaffer that fixes the listening to not
> miss stuff, also interesting:
>
> http://lists.squeakfoundation.org/pipermail/squeak-dev/2005-August/093575.html
>
> ...and the patch he mentions:
>
> http://lists.squeakfoundation.org/pipermail/squeak-dev/2005-August/093638.html
>

I entered Mantis 7823: "Patch for socket accept problem on unix VM" to track
this issue:

  http://bugs.squeak.org/view.php?id=7823

Both of these issues will require some follow up work in order to get them
included in the VMs. If further progress is made, please keep the Mantis issues
updated and/or report progress to the vm-dev list.

Dave

Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

Levente Uzonyi-2
 
On Sat, 11 Oct 2014, David T. Lewis wrote:

>
> On Fri, Oct 10, 2014 at 01:15:21PM +0200, G?ran Krampe wrote:
>>
>> On 10/10/2014 11:46 AM, G?ran Krampe wrote:
>>> Hi guys!
>>>
>>> Long post but an interesting one. Summary for the impatient:
>>>
>>> - Ron seems to prove that:
>>>     - We never reach 1024 fds so no leak.
>>>     - But select() still fails, so we probably have an obscure close()
>>> bug with fds.
>>>
>>> - And I prove that select() is indeed broken above 1024.
>>
>> Aha!
>>
>> It seems Ragnar Hojland Espinosa has already fixed this 11 years ago:
>>
>> http://lists.squeakfoundation.org/pipermail/squeak-dev/2003-May/057750.html
>
> I entered Mantis 7822: "Unix aio VM support for 1024+ descriptors" to track
> this issue:
>
>  http://bugs.squeak.org/view.php?id=7822

I think this is a great summary of the current status on unix systems:
http://daniel.haxx.se/docs/poll-vs-select.html
I think the best would be to use libevent or libev, but adding support for
epoll (on linux only) is probably a lot easier thing to do.

>
>>
>> I also found code from David Shaffer that fixes the listening to not
>> miss stuff, also interesting:
>>
>> http://lists.squeakfoundation.org/pipermail/squeak-dev/2005-August/093575.html
>>
>> ...and the patch he mentions:
>>
>> http://lists.squeakfoundation.org/pipermail/squeak-dev/2005-August/093638.html
>>
>
> I entered Mantis 7823: "Patch for socket accept problem on unix VM" to track
> this issue:
>
>  http://bugs.squeak.org/view.php?id=7823

This is pretty interesting. Now I know where those spurious 504 responses
come from.


Levente

>
> Both of these issues will require some follow up work in order to get them
> included in the VMs. If further progress is made, please keep the Mantis issues
> updated and/or report progress to the vm-dev list.
>
> Dave
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Squeak socket problem ... help!

Göran Krampe
 
Hi Levente!

On 10/12/2014 12:52 PM, Levente Uzonyi wrote:

>>> It seems Ragnar Hojland Espinosa has already fixed this 11 years ago:
>>>
>>> http://lists.squeakfoundation.org/pipermail/squeak-dev/2003-May/057750.html
>>>
>>
>> I entered Mantis 7822: "Unix aio VM support for 1024+ descriptors" to
>> track
>> this issue:
>>
>>  http://bugs.squeak.org/view.php?id=7822
>
> I think this is a great summary of the current status on unix systems:
> http://daniel.haxx.se/docs/poll-vs-select.html
> I think the best would be to use libevent or libev, but adding support
> for epoll (on linux only) is probably a lot easier thing to do.

And it seems that Ragnar has already done it for epoll (see post below)
- I need to find his email and check with him. I agree on the
libev/libevent thing, but as you say - that requires more changes
presumably.

I am not sure it will fix our issue which seems to be that a socket
suddenly gets closed twice (or something) but who knows.

regards, Göran