This seems to be happening a lot. Is there something that can be done to
alleviate the problem? thanks, Rob |
2007/12/22, Rob Withers <[hidden email]>:
> This seems to be happening a lot. Is there something that can be done to > alleviate the problem? Yeah, fix the VM. Cheers Philippe |
Philippe Marschall wrote:
> 2007/12/22, Rob Withers <[hidden email]>: >> This seems to be happening a lot. Is there something that can be done to >> alleviate the problem? > > Yeah, fix the VM. What do you think is broken to cause those problems? Cheers, - Andreas |
2007/12/23, Andreas Raab <[hidden email]>:
> Philippe Marschall wrote: > > 2007/12/22, Rob Withers <[hidden email]>: > >> This seems to be happening a lot. Is there something that can be done to > >> alleviate the problem? > > > > Yeah, fix the VM. > > What do you think is broken to cause those problems? Basically the stuff that made you choose Gemstone over Squeak. Semaphores for example (ok, not actually VW per se), other usual suspects are the scheduler, Sockets and the GC. See the stack trace Lukas sent earlier. Also it's not uncommon to have hundres of processes hanging on the same Semaphore >> #critical:. The block has terminated but the semaphore doesn't get released. I mean it's not that we do something fancy like terminating a process. And sometimes the image simply freezes and would stop reacting. An easy way to make use of the second CPU sure wouldn't hurt as well. Honestly I'm sick of searching which exact combination of what patches I have to apply to which image and which GC tweaking I have to apply to which VM with which patches collected from some posts or forks. Is it asked to much that this stuff simply works? Cheers Philippe |
Most of those issues are not bugs of the VM but of code in the image.
Adrian On Dec 23, 2007, at 12:14 , Philippe Marschall wrote: > 2007/12/23, Andreas Raab <[hidden email]>: >> Philippe Marschall wrote: >>> 2007/12/22, Rob Withers <[hidden email]>: >>>> This seems to be happening a lot. Is there something that can be >>>> done to >>>> alleviate the problem? >>> >>> Yeah, fix the VM. >> >> What do you think is broken to cause those problems? > > Basically the stuff that made you choose Gemstone over Squeak. > Semaphores for example (ok, not actually VW per se), other usual > suspects are the scheduler, Sockets and the GC. See the stack trace > Lukas sent earlier. Also it's not uncommon to have hundres of > processes hanging on the same Semaphore >> #critical:. The block has > terminated but the semaphore doesn't get released. I mean it's not > that we do something fancy like terminating a process. And sometimes > the image simply freezes and would stop reacting. An easy way to make > use of the second CPU sure wouldn't hurt as well. > > Honestly I'm sick of searching which exact combination of what patches > I have to apply to which image and which GC tweaking I have to apply > to which VM with which patches collected from some posts or forks. Is > it asked to much that this stuff simply works? > > Cheers > Philippe > |
it would be a good case for a bounty. I'm sure ESUG would participate.
Stef On 23 déc. 07, at 12:35, Adrian Lienhard wrote: > Most of those issues are not bugs of the VM but of code in the image. > Adrian > > On Dec 23, 2007, at 12:14 , Philippe Marschall wrote: > >> 2007/12/23, Andreas Raab <[hidden email]>: >>> Philippe Marschall wrote: >>>> 2007/12/22, Rob Withers <[hidden email]>: >>>>> This seems to be happening a lot. Is there something that can >>>>> be done to >>>>> alleviate the problem? >>>> >>>> Yeah, fix the VM. >>> >>> What do you think is broken to cause those problems? >> >> Basically the stuff that made you choose Gemstone over Squeak. >> Semaphores for example (ok, not actually VW per se), other usual >> suspects are the scheduler, Sockets and the GC. See the stack trace >> Lukas sent earlier. Also it's not uncommon to have hundres of >> processes hanging on the same Semaphore >> #critical:. The block has >> terminated but the semaphore doesn't get released. I mean it's not >> that we do something fancy like terminating a process. And sometimes >> the image simply freezes and would stop reacting. An easy way to make >> use of the second CPU sure wouldn't hurt as well. >> >> Honestly I'm sick of searching which exact combination of what >> patches >> I have to apply to which image and which GC tweaking I have to apply >> to which VM with which patches collected from some posts or forks. Is >> it asked to much that this stuff simply works? >> >> Cheers >> Philippe >> > > > |
In reply to this post by Philippe Marschall
Philippe Marschall wrote:
> Basically the stuff that made you choose Gemstone over Squeak. I don't think you understand what made me choose Gemstone. There were really two reasons for it: The first one is that Squeaksource doesn't have a viable database solution for our loads. Gemstone does and it works great. But the second one is just as important: Gemstone is a vendor, this is a company that if anything goes wrong I can turn to and ask them to fix it in return for money. Given that we're ramping up on people the latter is perhaps more important than the former since I don't know exactly how well Gemstone does scale - but I *do* know that if we'd be outgrowing the box we're using now I can ask them to help us fix it. Cheers, - Andreas |
In reply to this post by Philippe Marschall
Philippe Marschall wrote:
> 2007/12/22, Rob Withers <[hidden email]>: >> This seems to be happening a lot. Is there something that can be done to >> alleviate the problem? > > Yeah, fix the VM. How about stepping out of the reality distortion field and fix SqueakSource so it is actually scaling and production quality? Michael |
2007/12/23, Michael Rueger <[hidden email]>:
> Philippe Marschall wrote: > > 2007/12/22, Rob Withers <[hidden email]>: > >> This seems to be happening a lot. Is there something that can be done to > >> alleviate the problem? > > > > Yeah, fix the VM. > > How about stepping out of the reality distortion field and fix > SqueakSource so it is actually scaling and production quality? So which parts do we need to fix to make the Semaphore, Socket and image freezing problems go away? As for scaling and production quality do you seriously expect me to do this for free in my spare time? We fixed performance the problems and now run seriously faster than source.impara.de while being much bigger. Philippe |
Philippe Marschall wrote:
> So which parts do we need to fix to make the Semaphore, Socket and > image freezing problems go away? For semaphores I'd recommend the fixes that I've posted over the year. For sockets I am not aware of any evidence that indicate a socket issue (we had a few issues that at first looked like sockets were related but turned out not) but I'd like to hear any evidence that points to sockets as the cause of problems. As far as I can tell the socket implementation is very robust right now. For image freezes -in particular in Squeaksource- you probably need to fix the concurrency issues in Squeaksource itself. The last time I checked the code was not robust enough by far against concurrent modifications (parallel commits etc). > As for scaling and production quality do you seriously expect me to do > this for free in my spare time? That depends on whether or not you seriously expect for example the VM people to fix the VM problems in their spare time for free. If the answer is yes, then the answer is yes. > We fixed performance the problems and now run seriously faster than > source.impara.de while being much bigger. That's great to hear. I wish you would have told me a couple of months ago how to achieve that when I was asking (repeatedly) the same questions. Cheers, - Andreas |
Well in someone's spare time someone might review the page list below
and Rob Withers' comments at http://lists.squeakfoundation.org/pipermail/squeak-dev/2000-July/021307.html original code at http://www.smalltalkconsulting.com/html/OTNotes4.html Grab the socket test suite and rework for the latest socket implementation. This suite was used by Ian and I a few years back to beat on the Unix Socket implementation, it also if I recall uncovered a Socket issue in the beta version of NetBSD Ian was using. On Dec 23, 2007, at 12:08 PM, Andreas Raab wrote: > Philippe Marschall wrote: >> So which parts do we need to fix to make the Semaphore, Socket and >> image freezing problems go away? > > For semaphores I'd recommend the fixes that I've posted over the > year. For sockets I am not aware of any evidence that indicate a > socket issue (we had a few issues that at first looked like sockets > were related but turned out not) but I'd like to hear any evidence > that points to sockets as the cause of problems. As far as I can > tell the socket implementation is very robust right now. For image > freezes -in particular in Squeaksource- you probably need to fix the > concurrency issues in Squeaksource itself. The last time I checked > the code was not robust enough by far against concurrent > modifications (parallel commits etc). > >> As for scaling and production quality do you seriously expect me to >> do >> this for free in my spare time? > > That depends on whether or not you seriously expect for example the > VM people to fix the VM problems in their spare time for free. If > the answer is yes, then the answer is yes. > >> We fixed performance the problems and now run seriously faster than >> source.impara.de while being much bigger. > > That's great to hear. I wish you would have told me a couple of > months ago how to achieve that when I was asking (repeatedly) the > same questions. > > Cheers, > - Andreas > -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
In reply to this post by Andreas.Raab
> > So which parts do we need to fix to make the Semaphore, Socket and
> > image freezing problems go away? > > For semaphores I'd recommend the fixes that I've posted over the year. I loaded all your semaphore related patches a couple of months ago and squeaksource.com ran quietly and happily up to a few weeks ago. Then suddenly we got many processes hanging in Semaphore>>#critical:. > For image freezes -in particular in > Squeaksource- you probably need to fix the concurrency issues in > Squeaksource itself. What kind of concurrency issues in squeaksource.com itself could cause these problems? I know that the code is far from perfect, but I must also point out that we didn't loose a single of the more than 71'000 versions during the past 4 years. We also never experienced a corrupted data model. I wonder how it can happen that semaphores are suddenly blocked? Might this be related to image saving happening while being within a critical section? Cheers, Lukas -- Lukas Renggli http://www.lukas-renggli.ch |
Lukas Renggli wrote:
>>> So which parts do we need to fix to make the Semaphore, Socket and >>> image freezing problems go away? >> For semaphores I'd recommend the fixes that I've posted over the year. > > I loaded all your semaphore related patches a couple of months ago and > squeaksource.com ran quietly and happily up to a few weeks ago. Then > suddenly we got many processes hanging in Semaphore>>#critical:. If you could send a couple of complete stack dumps from the affected image it might be interesting. There is a possibility you were affected by the problem of primitiveSuspend (which we discussed earlier) but that's difficult to tell from a stack dump. Much much easier if you can go into the image and check whether the doIt I sent comes up empty or not. >> For image freezes -in particular in >> Squeaksource- you probably need to fix the concurrency issues in >> Squeaksource itself. > > What kind of concurrency issues in squeaksource.com itself could cause > these problems? I know that the code is far from perfect, but I must > also point out that we didn't loose a single of the more than 71'000 > versions during the past 4 years. We also never experienced a > corrupted data model. What we've experienced was basically that after the first commit, when our image went to saving the data model in a reference stream (via SSFileSystem; takes about two minutes or so), a second commit would wreck havoc on the system. You can probably simulate this by generating enough load from different clients on the network with or without SSFileSystem. And I don't like the idea of saving the image very much because it's probably not feasible to save multiple versions of that image which ultimately means that any data corruption kills the whole data model. > I wonder how it can happen that semaphores are suddenly blocked? Might > this be related to image saving happening while being within a > critical section? Interesting thought. It may be possible for some strange things to happen if Seaside doesn't take precautions of not accepting connections while in the midst of a save. The problem is that the image save/startup runs with whatever priority it's being issued at, so if there's another process running at the same time there is a chance this process interrupts the image save with the potential for strange things happening. Here is one way in which I could see this happening: A critical lock held by a process waiting for network traffic to occur when the image is saved. When the image is restored later on, that socket is no longer valid but the process could still wait on the semaphore, blocking the critical section for all other uses. Cheers, - Andreas |
In reply to this post by Lukas Renggli
2007/12/23, Lukas Renggli <[hidden email]>:
> > > So which parts do we need to fix to make the Semaphore, Socket and > > > image freezing problems go away? > > > > For semaphores I'd recommend the fixes that I've posted over the year. > > I loaded all your semaphore related patches a couple of months ago and > squeaksource.com ran quietly and happily up to a few weeks ago. Then > suddenly we got many processes hanging in Semaphore>>#critical:. > > > For image freezes -in particular in > > Squeaksource- you probably need to fix the concurrency issues in > > Squeaksource itself. > > What kind of concurrency issues in squeaksource.com itself could cause > these problems? We have concurrent, unsychronized writing access to shared data. Until now we have been very lucky to get away with this without any problems. It's certainly not the right way to do it. Cheers Philippe > I know that the code is far from perfect, but I must > also point out that we didn't loose a single of the more than 71'000 > versions during the past 4 years. We also never experienced a > corrupted data model. > > I wonder how it can happen that semaphores are suddenly blocked? Might > this be related to image saving happening while being within a > critical section? > > Cheers, > Lukas > > -- > Lukas Renggli > http://www.lukas-renggli.ch > > |
In reply to this post by Andreas.Raab
2007/12/24, Andreas Raab <[hidden email]>:
> Lukas Renggli wrote: > >>> So which parts do we need to fix to make the Semaphore, Socket and > >>> image freezing problems go away? > >> For semaphores I'd recommend the fixes that I've posted over the year. > > > > I loaded all your semaphore related patches a couple of months ago and > > squeaksource.com ran quietly and happily up to a few weeks ago. Then > > suddenly we got many processes hanging in Semaphore>>#critical:. > > If you could send a couple of complete stack dumps from the affected > image it might be interesting. There is a possibility you were affected > by the problem of primitiveSuspend (which we discussed earlier) but > that's difficult to tell from a stack dump. Much much easier if you can > go into the image and check whether the doIt I sent comes up empty or not. > > >> For image freezes -in particular in > >> Squeaksource- you probably need to fix the concurrency issues in > >> Squeaksource itself. > > > > What kind of concurrency issues in squeaksource.com itself could cause > > these problems? I know that the code is far from perfect, but I must > > also point out that we didn't loose a single of the more than 71'000 > > versions during the past 4 years. We also never experienced a > > corrupted data model. > > What we've experienced was basically that after the first commit, when > our image went to saving the data model in a reference stream (via > SSFileSystem; takes about two minutes or so), a second commit would > wreck havoc on the system. You can probably simulate this by generating > enough load from different clients on the network with or without > SSFileSystem. And I don't like the idea of saving the image very much > because it's probably not feasible to save multiple versions of that > image which ultimately means that any data corruption kills the whole > data model. We don't use reference streams anymore. We are at the point were it takes more than 30 minutes to write the model to disk. We only save the image. We are aware how suboptimal this is but until now we have been very lucky to get away with this. Cheers Philippe > > I wonder how it can happen that semaphores are suddenly blocked? Might > > this be related to image saving happening while being within a > > critical section? > > Interesting thought. It may be possible for some strange things to > happen if Seaside doesn't take precautions of not accepting connections > while in the midst of a save. The problem is that the image save/startup > runs with whatever priority it's being issued at, so if there's another > process running at the same time there is a chance this process > interrupts the image save with the potential for strange things > happening. Here is one way in which I could see this happening: A > critical lock held by a process waiting for network traffic to occur > when the image is saved. When the image is restored later on, that > socket is no longer valid but the process could still wait on the > semaphore, blocking the critical section for all other uses. > > Cheers, > - Andreas > > > |
In reply to this post by Andreas.Raab
2007/12/23, Andreas Raab <[hidden email]>:
> Philippe Marschall wrote: > > So which parts do we need to fix to make the Semaphore, Socket and > > image freezing problems go away? > > For semaphores I'd recommend the fixes that I've posted over the year. > For sockets I am not aware of any evidence that indicate a socket issue > (we had a few issues that at first looked like sockets were related but > turned out not) but I'd like to hear any evidence that points to sockets > as the cause of problems. As far as I can tell the socket implementation > is very robust right now. For image freezes -in particular in > Squeaksource- you probably need to fix the concurrency issues in > Squeaksource itself. The last time I checked the code was not robust > enough by far against concurrent modifications (parallel commits etc). > > > As for scaling and production quality do you seriously expect me to do > > this for free in my spare time? > > That depends on whether or not you seriously expect for example the VM > people to fix the VM problems in their spare time for free. If the > answer is yes, then the answer is yes. Well I can honestly say that SqS is not production quality. It has no serious persistence (the main installation on Squeak) and we make to guarantees in this regard. The "storage" leaves several things to be desired. We do write the .mcz to disk and back it put so there is limit to the damage a broken image can cause. If you are uneasy with this, don't use it. It has several stability issues which we believe are not due to bugs in our code but in the Squeak-Kernel/VM. But we never pretended otherwise, we never said there are no issues. We never said "rock stable, no known bugs for years". If you ask on this list if Squeak is production ready, how many of the VM maintainers are that frank and say no? > > We fixed performance the problems and now run seriously faster than > > source.impara.de while being much bigger. > > That's great to hear. I wish you would have told me a couple of months > ago how to achieve that when I was asking (repeatedly) the same questions. What I was talking about is pure rendering performance. You get this by loading the latest version, this was true several months ago as it is now. If you use the Impara fork, well talk to the Impara guys. From the description of your problems I got the impression that the issues you faced had much more to do with "persistence" and the issues we face (general stability). As for persistence there is a Magma backend which I pointed you at. AFAIK this has seen no action which I also mentioned. Cheers Philippe |
may be I should repeat it since andreas pointed to Gemstone as a way
to pay for service, but may be it would be time to collect some money to get someone working on these problems: makeing ss robust and fixing what should be fixed in VM/Kernel. ESUG is really to spend money for that. Stef On 24 déc. 07, at 00:36, Philippe Marschall wrote: > 2007/12/23, Andreas Raab <[hidden email]>: >> Philippe Marschall wrote: >>> So which parts do we need to fix to make the Semaphore, Socket and >>> image freezing problems go away? >> >> For semaphores I'd recommend the fixes that I've posted over the >> year. >> For sockets I am not aware of any evidence that indicate a socket >> issue >> (we had a few issues that at first looked like sockets were >> related but >> turned out not) but I'd like to hear any evidence that points to >> sockets >> as the cause of problems. As far as I can tell the socket >> implementation >> is very robust right now. For image freezes -in particular in >> Squeaksource- you probably need to fix the concurrency issues in >> Squeaksource itself. The last time I checked the code was not robust >> enough by far against concurrent modifications (parallel commits >> etc). >> >>> As for scaling and production quality do you seriously expect me >>> to do >>> this for free in my spare time? >> >> That depends on whether or not you seriously expect for example >> the VM >> people to fix the VM problems in their spare time for free. If the >> answer is yes, then the answer is yes. > > Well I can honestly say that SqS is not production quality. It has no > serious persistence (the main installation on Squeak) and we make to > guarantees in this regard. The "storage" leaves several things to be > desired. We do write the .mcz to disk and back it put so there is > limit to the damage a broken image can cause. If you are uneasy with > this, don't use it. It has several stability issues which we believe > are not due to bugs in our code but in the Squeak-Kernel/VM. But we > never pretended otherwise, we never said there are no issues. We never > said "rock stable, no known bugs for years". If you ask on this list > if Squeak is production ready, how many of the VM maintainers are that > frank and say no? > >>> We fixed performance the problems and now run seriously faster than >>> source.impara.de while being much bigger. >> >> That's great to hear. I wish you would have told me a couple of >> months >> ago how to achieve that when I was asking (repeatedly) the same >> questions. > > What I was talking about is pure rendering performance. You get this > by loading the latest version, this was true several months ago as it > is now. If you use the Impara fork, well talk to the Impara guys. From > the description of your problems I got the impression that the issues > you faced had much more to do with "persistence" and the issues we > face (general stability). As for persistence there is a Magma backend > which I pointed you at. AFAIK this has seen no action which I also > mentioned. > > Cheers > Philippe > > |
In reply to this post by Andreas.Raab
> > I loaded all your semaphore related patches a couple of months ago and
> > squeaksource.com ran quietly and happily up to a few weeks ago. Then > > suddenly we got many processes hanging in Semaphore>>#critical:. > > If you could send a couple of complete stack dumps from the affected > image it might be interesting. There is a possibility you were affected > by the problem of primitiveSuspend (which we discussed earlier) but > that's difficult to tell from a stack dump. Much much easier if you can > go into the image and check whether the doIt I sent comes up empty or not. The doIt you sent comes out empty, I've never seen a case where it actually returned a process. For the stack dumps I've got only the attached screenshot from the process browser that I took December 5., roughly a month after loading your patches. > What we've experienced was basically that after the first commit, when > our image went to saving the data model in a reference stream (via > SSFileSystem; takes about two minutes or so), a second commit would > wreck havoc on the system. You can probably simulate this by generating > enough load from different clients on the network with or without > SSFileSystem. And I don't like the idea of saving the image very much > because it's probably not feasible to save multiple versions of that > image which ultimately means that any data corruption kills the whole > data model. We save the image every hour, what only takes a couple of seconds. We also recently fixed some bugs that caused it to block for minutes afterwards. > Interesting thought. It may be possible for some strange things to > happen if Seaside doesn't take precautions of not accepting connections > while in the midst of a save. The problem is that the image save/startup > runs with whatever priority it's being issued at, so if there's another > process running at the same time there is a chance this process > interrupts the image save with the potential for strange things > happening. Here is one way in which I could see this happening: A > critical lock held by a process waiting for network traffic to occur > when the image is saved. When the image is restored later on, that > socket is no longer valid but the process could still wait on the > semaphore, blocking the critical section for all other uses. while saving the image, but I have to check if this is also the case with the version of Seaside used in squeaksource.com. Cheers, Lukas -- Lukas Renggli http://www.lukas-renggli.ch Picture 1.png (17K) Download Attachment |
Lukas Renggli wrote:
> The doIt you sent comes out empty, I've never seen a case where it > actually returned a process. For the stack dumps I've got only the > attached screenshot from the process browser that I took December 5., > roughly a month after loading your patches. Unfortunately, the screenshot doesn't show much interesting - the best thing to do is to cover *all* processes when the system is locked up for forensic analysis. What we have done in our server VMs is to hook up the USR1 signal to the VM's printAllStacks() function so that we can simply get a stack dump via kill -USR1 <pid_of_squeak>. I don't know if this is in the standard Unix VMs (I have no idea how portable the code is) but it's a must have feature for running a Linux server with Squeak. > Current versions of the Kom server adapter for Seaside stop listening > while saving the image, but I have to check if this is also the case > with the version of Seaside used in squeaksource.com. That's a likely cause for problems. Also, you probably need to make sure all requests are finished before saving the image - otherwise some of them may rely on network activity to wake up. Cheers, - Andreas |
Sockets are registering semaphores in external object table, to be
able signaled by socket plugin. When image starting after booting VM, an external object table is cleared and replaced by empty fresh array. So, if there is any process(es) left in saved image which waiting on such semaphores to be signaled via external event (as sockets), they will never awake, because there's no one who can signal them. To get around problem, at startup phase, we should do something like that: Socket allInstancesDo: [:s | s signalAndClearSemaphores ]. So, any process which has waiting for given semaphores will have a chance to get past the lock and die. As for gracious shutdown/startup, to prevent interrupting process which saving image by some process who can handle network requests, i think best way is to suspend all active processes except one, which saving image. And then resume them after save done. But lately, when i tried to do it myself, i found that it's not possible, due to bug in #resume: (see other discussion about suspend/resume). So, we need to wait until it will be fixed, or find a way around. On 24/12/2007, Andreas Raab <[hidden email]> wrote: > Lukas Renggli wrote: > > The doIt you sent comes out empty, I've never seen a case where it > > actually returned a process. For the stack dumps I've got only the > > attached screenshot from the process browser that I took December 5., > > roughly a month after loading your patches. > > Unfortunately, the screenshot doesn't show much interesting - the best > thing to do is to cover *all* processes when the system is locked up for > forensic analysis. What we have done in our server VMs is to hook up the > USR1 signal to the VM's printAllStacks() function so that we can simply > get a stack dump via kill -USR1 <pid_of_squeak>. I don't know if this is > in the standard Unix VMs (I have no idea how portable the code is) but > it's a must have feature for running a Linux server with Squeak. > > > Current versions of the Kom server adapter for Seaside stop listening > > while saving the image, but I have to check if this is also the case > > with the version of Seaside used in squeaksource.com. > > That's a likely cause for problems. Also, you probably need to make sure > all requests are finished before saving the image - otherwise some of > them may rely on network activity to wake up. > > Cheers, > - Andreas > > > -- Best regards, Igor Stasenko AKA sig. |
Free forum by Nabble | Edit this page |