I was able to work on getting Magma 1.2 going in Pharo. It was quite
easy to get the code loaded and functioning in Pharo 1.1.1, Pharo 1.2, and Pharo 1.3. But something seems to have changed in Pharo's networking from 1.1.1 to 1.2. All Magma functionality seems to work fine for low-volume activity. However, when the test-suite gets to the HA test cases (at the end), one of the images performing heavy networking activity, consistently gets very slow and bogged down for some reason; causing the clients to timeout and disrupting the test suite. Fortunately, it happens in the same place in the test-suite every time. The UI of the image in question becomes VERY sluggish, but MessageTally spyAllOn: didn't reveal anything useful. What is it doing? I did verify that the Magma server in that image is still functioning; clients were committing, but I had to increase their timeouts from 10 to 45 seconds to avoid timeouts.. Unfortunately, two days of wrangling in Pharo (because I'm an old Squeak dog) I could not nail the problem down; but I have one suspect.. A couple of times, I caught a process seemingly hung up in NetworkNameResolver; trying to resolve an IP from 'localhost'. This exact set of Magma packages is rock-solid on Pharo 1.1.1 and Squeak, but that doesn't mean the problem for sure lies in Pharo 1.2; maybe a networking bug in 1.1.1 is allowing Magma to "misuse" the network and get away with it and Pharo 1.2 is now more strict? I don't know, I would just like to ask the experts here for help who know all what went into Pharo 1.2 so hopefully we can get to the bottom of it. Thanks, Chris |
Two thoughts: (1) Gary recently mentioned a delay fix that IIRC was in Squeak but had not yet made it into Pharo. It might be central to the problem??
(2) Not to say that the test should take so long to run, the but network layer should not be timing out at all. Decisions of when to give up should be left to the application and indirectly the user - assuming the machine is attended (the app will "know" that, which the network layer cannot). Attempts to connect or do I/O should do just that until they are told to stop. Servers should listen and accept connections until they are stopped. Bill ________________________________________ From: [hidden email] [[hidden email]] On Behalf Of Chris Muller [[hidden email]] Sent: Sunday, April 17, 2011 4:48 PM To: [hidden email]; magma Subject: [Pharo-project] Networking change in Pharo 1.2? I was able to work on getting Magma 1.2 going in Pharo. It was quite easy to get the code loaded and functioning in Pharo 1.1.1, Pharo 1.2, and Pharo 1.3. But something seems to have changed in Pharo's networking from 1.1.1 to 1.2. All Magma functionality seems to work fine for low-volume activity. However, when the test-suite gets to the HA test cases (at the end), one of the images performing heavy networking activity, consistently gets very slow and bogged down for some reason; causing the clients to timeout and disrupting the test suite. Fortunately, it happens in the same place in the test-suite every time. The UI of the image in question becomes VERY sluggish, but MessageTally spyAllOn: didn't reveal anything useful. What is it doing? I did verify that the Magma server in that image is still functioning; clients were committing, but I had to increase their timeouts from 10 to 45 seconds to avoid timeouts.. Unfortunately, two days of wrangling in Pharo (because I'm an old Squeak dog) I could not nail the problem down; but I have one suspect.. A couple of times, I caught a process seemingly hung up in NetworkNameResolver; trying to resolve an IP from 'localhost'. This exact set of Magma packages is rock-solid on Pharo 1.1.1 and Squeak, but that doesn't mean the problem for sure lies in Pharo 1.2; maybe a networking bug in 1.1.1 is allowing Magma to "misuse" the network and get away with it and Pharo 1.2 is now more strict? I don't know, I would just like to ask the experts here for help who know all what went into Pharo 1.2 so hopefully we can get to the bottom of it. Thanks, Chris |
In reply to this post by Chris Muller-4
On Sun, Apr 17, 2011 at 2:48 PM, Chris Muller <[hidden email]> wrote:
> Unfortunately, two days of wrangling in Pharo (because I'm an old > Squeak dog) I could not nail the problem down; but I have one > suspect.. A couple of times, I caught a process seemingly hung up in > NetworkNameResolver; trying to resolve an IP from 'localhost'. probably related to this: http://lists.squeakfoundation.org/pipermail/magma/2010-September/001594.html |
In reply to this post by Chris Muller-4
On Apr 17, 2011, at 10:48 PM, Chris Muller wrote: > I was able to work on getting Magma 1.2 going in Pharo. It was quite > easy to get the code loaded and functioning in Pharo 1.1.1, Pharo 1.2, > and Pharo 1.3. > > But something seems to have changed in Pharo's networking from 1.1.1 > to 1.2. All Magma functionality seems to work fine for low-volume > activity. However, when the test-suite gets to the HA test cases (at > the end), one of the images performing heavy networking activity, > consistently gets very slow and bogged down for some reason; causing > the clients to timeout and disrupting the test suite. Fortunately, it > happens in the same place in the test-suite every time. > > The UI of the image in question becomes VERY sluggish, but > MessageTally spyAllOn: didn't reveal anything useful. What is it > doing? I did verify that the Magma server in that image is still > functioning; clients were committing, but I had to increase their > timeouts from 10 to 45 seconds to avoid timeouts.. > > Unfortunately, two days of wrangling in Pharo (because I'm an old > Squeak dog) I could not nail the problem down; but I have one > suspect.. A couple of times, I caught a process seemingly hung up in > NetworkNameResolver; trying to resolve an IP from 'localhost'. > The only change to NetNameResolver was this: http://code.google.com/p/pharo/issues/detail?id=1853 Socket in general did not see many changes: http://code.google.com/p/pharo/issues/list?can=1&q=milestone%3D1.2+Socket > This exact set of Magma packages is rock-solid on Pharo 1.1.1 and > Squeak, but that doesn't mean the problem for sure lies in Pharo 1.2; > maybe a networking bug in 1.1.1 is allowing Magma to "misuse" the > network and get away with it and Pharo 1.2 is now more strict? I > don't know, I would just like to ask the experts here for help who > know all what went into Pharo 1.2 so hopefully we can get to the > bottom of it. > > Thanks, > Chris > -- Marcus Denker -- http://www.marcusdenker.de INRIA Lille -- Nord Europe. Team RMoD. |
In reply to this post by Chris Muller-4
On Apr 17, 2011, at 11:01 PM, Schwab,Wilhelm K wrote: > Two thoughts: (1) Gary recently mentioned a delay fix that IIRC was in Squeak but had not yet made it into Pharo. It might be central to the problem?? > Gary's fix was not in Squeak... It is now for testing in 1.2.2a and 1.3 Marcus -- Marcus Denker -- http://www.marcusdenker.de INRIA Lille -- Nord Europe. Team RMoD. |
In reply to this post by Chris Muller-4
On 17.04.2011 22:48, Chris Muller wrote:
> I was able to work on getting Magma 1.2 going in Pharo. It was quite > easy to get the code loaded and functioning in Pharo 1.1.1, Pharo 1.2, > and Pharo 1.3. > > But something seems to have changed in Pharo's networking from 1.1.1 > to 1.2. All Magma functionality seems to work fine for low-volume > activity. However, when the test-suite gets to the HA test cases (at > the end), one of the images performing heavy networking activity, > consistently gets very slow and bogged down for some reason; causing > the clients to timeout and disrupting the test suite. Fortunately, it > happens in the same place in the test-suite every time. > > The UI of the image in question becomes VERY sluggish, but > MessageTally spyAllOn: didn't reveal anything useful. What is it > doing? I did verify that the Magma server in that image is still > functioning; clients were committing, but I had to increase their > timeouts from 10 to 45 seconds to avoid timeouts.. > > Unfortunately, two days of wrangling in Pharo (because I'm an old > Squeak dog) I could not nail the problem down; but I have one > suspect.. A couple of times, I caught a process seemingly hung up in > NetworkNameResolver; trying to resolve an IP from 'localhost'. > > This exact set of Magma packages is rock-solid on Pharo 1.1.1 and > Squeak, but that doesn't mean the problem for sure lies in Pharo 1.2; > maybe a networking bug in 1.1.1 is allowing Magma to "misuse" the > network and get away with it and Pharo 1.2 is now more strict? I > don't know, I would just like to ask the experts here for help who > know all what went into Pharo 1.2 so hopefully we can get to the > bottom of it. > > Thanks, > Chris > IIRC, Cog has a hard limit on how many external semaphores are available, and each Socket consumes 3 of those. So if you are running on Cog, the problem when under heavy load may be that there simpy aren't enough free external semaphores to create enough sockets... Cheers, Henry |
This is the VM I used:
3.9-7 #1 Sun Feb 6 18:58:21 PST 2011 gcc 4.1.2 Croquet Closure Cog VM [CoInterpreter VMMaker-oscog.47] Linux mcqfes 2.6.18-128.el5 #1 SMP Wed Jan 21 10:44:23 EST 2009 i686 i686 i386 GNU/Linux plugin path: /opt/4dst/thirdparty/squeak/lib/squeak/3.9-7/ [default: /opt/4dst/thirdparty/squeak/lib/squeak/3.9-7/] However, I use this same VM when I run the test in Pharo 1.1.1 and it's solid. - Chris On Mon, Apr 18, 2011 at 3:23 AM, Henrik Sperre Johansen <[hidden email]> wrote: > On 17.04.2011 22:48, Chris Muller wrote: >> >> I was able to work on getting Magma 1.2 going in Pharo. It was quite >> easy to get the code loaded and functioning in Pharo 1.1.1, Pharo 1.2, >> and Pharo 1.3. >> >> But something seems to have changed in Pharo's networking from 1.1.1 >> to 1.2. All Magma functionality seems to work fine for low-volume >> activity. However, when the test-suite gets to the HA test cases (at >> the end), one of the images performing heavy networking activity, >> consistently gets very slow and bogged down for some reason; causing >> the clients to timeout and disrupting the test suite. Fortunately, it >> happens in the same place in the test-suite every time. >> >> The UI of the image in question becomes VERY sluggish, but >> MessageTally spyAllOn: didn't reveal anything useful. What is it >> doing? I did verify that the Magma server in that image is still >> functioning; clients were committing, but I had to increase their >> timeouts from 10 to 45 seconds to avoid timeouts.. >> >> Unfortunately, two days of wrangling in Pharo (because I'm an old >> Squeak dog) I could not nail the problem down; but I have one >> suspect.. A couple of times, I caught a process seemingly hung up in >> NetworkNameResolver; trying to resolve an IP from 'localhost'. >> >> This exact set of Magma packages is rock-solid on Pharo 1.1.1 and >> Squeak, but that doesn't mean the problem for sure lies in Pharo 1.2; >> maybe a networking bug in 1.1.1 is allowing Magma to "misuse" the >> network and get away with it and Pharo 1.2 is now more strict? I >> don't know, I would just like to ask the experts here for help who >> know all what went into Pharo 1.2 so hopefully we can get to the >> bottom of it. >> >> Thanks, >> Chris >> > Which VM did you run these tests on? > IIRC, Cog has a hard limit on how many external semaphores are available, > and each Socket consumes 3 of those. > So if you are running on Cog, the problem when under heavy load may be that > there simpy aren't enough free external semaphores to create enough > sockets... > > Cheers, > Henry > > |
In reply to this post by Marcus Denker-4
> The only change to NetNameResolver was this:
> > http://code.google.com/p/pharo/issues/detail?id=1853 Reverting this change fixed it. |
In reply to this post by Marcus Denker-4
On Apr 18, 2011, at 7:14 PM, Chris Muller wrote: >> The only change to NetNameResolver was this: >> >> http://code.google.com/p/pharo/issues/detail?id=1853 > > Reverting this change fixed it. > Thanks! I have opend the issue again for 1.2.2 and 1.3 Marcus -- Marcus Denker -- http://www.marcusdenker.de INRIA Lille -- Nord Europe. Team RMoD. |
Thank you too; it was a bruising problem I'm glad to have it identified.
On Mon, Apr 18, 2011 at 12:21 PM, Marcus Denker <[hidden email]> wrote: > > On Apr 18, 2011, at 7:14 PM, Chris Muller wrote: > >>> The only change to NetNameResolver was this: >>> >>> http://code.google.com/p/pharo/issues/detail?id=1853 >> >> Reverting this change fixed it. >> > > > Thanks! I have opend the issue again for 1.2.2 and 1.3 > > Marcus > > > -- > Marcus Denker -- http://www.marcusdenker.de > INRIA Lille -- Nord Europe. Team RMoD. > > > |
yes
Now do you have a description how we can reproduce your problem because the fix was fixing something. and it would be good to understand what is the deeper problem. Stef On Apr 19, 2011, at 4:20 AM, Chris Muller wrote: > Thank you too; it was a bruising problem I'm glad to have it identified. > > On Mon, Apr 18, 2011 at 12:21 PM, Marcus Denker <[hidden email]> wrote: >> >> On Apr 18, 2011, at 7:14 PM, Chris Muller wrote: >> >>>> The only change to NetNameResolver was this: >>>> >>>> http://code.google.com/p/pharo/issues/detail?id=1853 >>> >>> Reverting this change fixed it. >>> >> >> >> Thanks! I have opend the issue again for 1.2.2 and 1.3 >> >> Marcus >> >> >> -- >> Marcus Denker -- http://www.marcusdenker.de >> INRIA Lille -- Nord Europe. Team RMoD. >> >> >> > |
Stef,
What was it fixing? There might be a better solution. I found myself trying to swim in Linux, Pharo and machines with multiple interfaces (wired and wireless) almost simultaneously. I still haven't really figured it out, but *if* #localHostAddress makes sense at all (#localHostAddresses might be more meaningful message), it should probably raise an error if there is not a unique result. #localHostAddress:ifNone:ifMany: would put the sender in control. For your problem, #localHostOrLoopBackAddress would be another option; at least the sender would be knowingly accepting the "risk" of getting the loopback address. Bill ________________________________________ From: [hidden email] [[hidden email]] On Behalf Of Stéphane Ducasse [[hidden email]] Sent: Tuesday, April 19, 2011 3:27 AM To: [hidden email]; [hidden email] Cc: magma Subject: Re: [Pharo-project] Networking change in Pharo 1.2? yes Now do you have a description how we can reproduce your problem because the fix was fixing something. and it would be good to understand what is the deeper problem. Stef On Apr 19, 2011, at 4:20 AM, Chris Muller wrote: > Thank you too; it was a bruising problem I'm glad to have it identified. > > On Mon, Apr 18, 2011 at 12:21 PM, Marcus Denker <[hidden email]> wrote: >> >> On Apr 18, 2011, at 7:14 PM, Chris Muller wrote: >> >>>> The only change to NetNameResolver was this: >>>> >>>> http://code.google.com/p/pharo/issues/detail?id=1853 >>> >>> Reverting this change fixed it. >>> >> >> >> Thanks! I have opend the issue again for 1.2.2 and 1.3 >> >> Marcus >> >> >> -- >> Marcus Denker -- http://www.marcusdenker.de >> INRIA Lille -- Nord Europe. Team RMoD. >> >> >> > |
In reply to this post by Stéphane Ducasse
Just bench it:
Before change: [ NetNameResolver localHostAddress ] bench " '34,000 per second.' " After change: [ NetNameResolver localHostAddress ] bench " '31 per second.' " In just looking at the reason given for making the change, it says this is to satisfy an _exceptional_ case; e.g., the case where "no network connection is available." Then I look at the new code called by #localHostAddress and becomes obvious why: isConnected "Dirty, but avoids fixing the plugin bug" [NetNameResolver addressForName: 'www.esug.org'.] on: NameLookupFailure do: [:ex| ^false]. ^true A hard-coded nslookup to 'www.esug.org' wrapped in an exception-handler? Wow! If this isn't enough, you can run the Magma test-suite to see the effect on a real-world networking application. I recommend Pharo crew revert this change and consider a different approach. - Chris On Tue, Apr 19, 2011 at 2:27 AM, Stéphane Ducasse <[hidden email]> wrote: > yes > Now do you have a description how we can reproduce your problem because the fix was fixing something. > and it would be good to understand what is the deeper problem. > Stef > > On Apr 19, 2011, at 4:20 AM, Chris Muller wrote: > >> Thank you too; it was a bruising problem I'm glad to have it identified. >> >> On Mon, Apr 18, 2011 at 12:21 PM, Marcus Denker <[hidden email]> wrote: >>> >>> On Apr 18, 2011, at 7:14 PM, Chris Muller wrote: >>> >>>>> The only change to NetNameResolver was this: >>>>> >>>>> http://code.google.com/p/pharo/issues/detail?id=1853 >>>> >>>> Reverting this change fixed it. >>>> >>> >>> >>> Thanks! I have opend the issue again for 1.2.2 and 1.3 >>> >>> Marcus >>> >>> >>> -- >>> Marcus Denker -- http://www.marcusdenker.de >>> INRIA Lille -- Nord Europe. Team RMoD. >>> >>> >>> >> > > > |
In reply to this post by Stéphane Ducasse
On Apr 19, 2011, at 5:16 PM, Chris Muller wrote: > > Then I look at the new code called by #localHostAddress and becomes obvious why: > > isConnected > "Dirty, but avoids fixing the plugin bug" > [NetNameResolver addressForName: 'www.esug.org'.] on: > NameLookupFailure do: [:ex| ^false]. > ^true > > A hard-coded nslookup to 'www.esug.org' wrapped in an exception-handler? Wow! Ups.... (shamefuly looking somewhere else, as I harvested the change...) We will fix it. Marcus -- Marcus Denker -- http://www.marcusdenker.de INRIA Lille -- Nord Europe. Team RMoD. |
In reply to this post by Henrik Sperre Johansen
Hi Henrik,
On Mon, Apr 18, 2011 at 1:23 AM, Henrik Sperre Johansen <[hidden email]> wrote:
Not so. The limit is soft. It can be accessed using Smalltalk vmParameterAt: 49. It defaults to 256 entries. It maxes out it 64k entries because the value set via vmParameterAt: 49 put: X persists in a short in the image header. I expect 20k sockets to be sufficient for a while, right?
So if you are running on Cog, the problem when under heavy load may be that there simpy aren't enough free external semaphores to create enough sockets... |
In reply to this post by Chris Muller-3
Ok thanks for the analysis.
We should really be able to collect such information and build regression tests. Right now we have test checking for simple behavior but I would like to capture the regression you spotted. I do not know what was the "avoids fixing the plugin bug" but I would like to know that too. Stef > Just bench it: > > Before change: > > [ NetNameResolver localHostAddress ] bench " '34,000 per second.' " > > After change: > > [ NetNameResolver localHostAddress ] bench " '31 per second.' " > > In just looking at the reason given for making the change, it says > this is to satisfy an _exceptional_ case; e.g., the case where "no > network connection is available." > > Then I look at the new code called by #localHostAddress and becomes obvious why: > > isConnected > "Dirty, but avoids fixing the plugin bug" > [NetNameResolver addressForName: 'www.esug.org'.] on: > NameLookupFailure do: [:ex| ^false]. > ^true > > A hard-coded nslookup to 'www.esug.org' wrapped in an exception-handler? Wow! > > If this isn't enough, you can run the Magma test-suite to see the > effect on a real-world networking application. > > I recommend Pharo crew revert this change and consider a different approach. > > - Chris > > > > On Tue, Apr 19, 2011 at 2:27 AM, Stéphane Ducasse > <[hidden email]> wrote: >> yes >> Now do you have a description how we can reproduce your problem because the fix was fixing something. >> and it would be good to understand what is the deeper problem. >> Stef >> >> On Apr 19, 2011, at 4:20 AM, Chris Muller wrote: >> >>> Thank you too; it was a bruising problem I'm glad to have it identified. >>> >>> On Mon, Apr 18, 2011 at 12:21 PM, Marcus Denker <[hidden email]> wrote: >>>> >>>> On Apr 18, 2011, at 7:14 PM, Chris Muller wrote: >>>> >>>>>> The only change to NetNameResolver was this: >>>>>> >>>>>> http://code.google.com/p/pharo/issues/detail?id=1853 >>>>> >>>>> Reverting this change fixed it. >>>>> >>>> >>>> >>>> Thanks! I have opend the issue again for 1.2.2 and 1.3 >>>> >>>> Marcus >>>> >>>> >>>> -- >>>> Marcus Denker -- http://www.marcusdenker.de >>>> INRIA Lille -- Nord Europe. Team RMoD. >>>> >>>> >>>> >>> >> >> >> > |
In reply to this post by Marcus Denker-4
> Ups.... (shamefuly looking somewhere else, as I harvested the change...)
No problem Marcus. I think it would be difficult to do a better job than what you do. Cheers, Alexandre -- _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;: Alexandre Bergel http://www.bergel.eu ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;. |
+1
but we should write some specific tests. On Apr 19, 2011, at 10:40 PM, Alexandre Bergel wrote: >> Ups.... (shamefuly looking somewhere else, as I harvested the change...) > > > No problem Marcus. I think it would be difficult to do a better job than what you do. > > Cheers, > Alexandre > > -- > _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;: > Alexandre Bergel http://www.bergel.eu > ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;. > > > > > > |
In reply to this post by Eliot Miranda-2
On 19.04.2011 20:19, Eliot Miranda wrote:
Hi Henrik,Ah, absolutely :D That's what I get for skimming readme's, think it'd be good to upgrade the comment though? No specific mention is made that it can be(although frowned upon)/how to set it after startup, currently it reads: "Another significant change is in the external semaphore table support code. This is now lock-free at the cost of having to specify a maximum number of external semaphores at start-up (default 256)." I guess having it accessible from image is one interpretation of that line, personally I thought it was that you could use some parameter when launching the executable :) Also, it's currently possible to register more than this limit in current images (Smalltalk registerExternalObject:) without an error. Am I correct in my reading of the code that when this happens, they will never be signaled? If so, we'd probably want to do some changes to ExternalSemaphoreTable :) Cheers, Henry |
On 20.04.2011 01:13, Henrik Sperre Johansen wrote:
Eheh, I guess so. I did it as part of testing and forgot about it, then when I wanted to publish the slice, I got stuck on a Socket semaphore which never signaled, and had to wait the entire timeout period before it proceeded. :) Cheers, Henry |
Free forum by Nabble | Edit this page |