Magma has been stable in 5.2 for a long time under an older VM, all tests pass. But by changing ONLY the VM (not the image) to the new release-candidate, it fails the forward-recovery test. This test tests the scenario of a server failure during mid-write. Unless I change StandardFileStream>>#flush as in Files-cmm.182, the recovery data which Magma relies on #flush to ensure is preserved is, in fact, not preserved. It appears to be a breakage of the contract which causes the test to fail. This functionality is important to avoid corrupting databases.
I saw a discussion on the Cuis list in which someone was asserting that flush is no longer necessary(!!), and made a vague reference to a "thread on squeak-dev" which I never found. I hope this is just an oversight, otherwise I'll have to rely something like Files-cmm.182, which is half the speed of the old #flush. Best, Chris |
Hi Chris,
Do you expect #flush to write the changes to disk? Levente On Sun, 12 Jan 2020, Chris Muller wrote: > Magma has been stable in 5.2 for a long time under an older VM, all tests pass. But by changing ONLY the VM (not the image) to the new release-candidate, it fails the forward-recovery test. This test tests the scenario of a > server failure during mid-write. Unless I change StandardFileStream>>#flush as in Files-cmm.182, the recovery data which Magma relies on #flush to ensure is preserved is, in fact, not preserved. It appears to be a breakage > of the contract which causes the test to fail. This functionality is important to avoid corrupting databases. > I saw a discussion on the Cuis list in which someone was asserting that flush is no longer necessary(!!), and made a vague reference to a "thread on squeak-dev" which I never found. > > I hope this is just an oversight, otherwise I'll have to rely something like Files-cmm.182, which is half the speed of the old #flush. > > Best, > Chris > > |
A wise plumber once taught me that flushing harder is almost never
the right thing to do ;-) On a Unix (OS X) VM, #flush calls fflush() in the C runtime library. This flushes the stream data out of your program data space (the VM) but it may or may not write it out to the storage medium. In addition to #flush, we also have #sync which calls fsync(). This is a lower level system call on Linux (and probably other unices such as OS X), and it performs a "flush" out to the actual storage medium. You will definitely want to read the man pages (man 3 fflush, and man 2 fsync) on whatever system you are using to get a better description of the differences. With respect to VM changes, there is very little chance that the VM (FilePlugin actually) has changed recently. But is it certainly possible that the compilers and runtime libraries have changed over the last year or so, and it is possible that the actual runtime behavior of fflush() will be different as a result. If so, I would not regard that as a VM problem per se, it is really going be a difference in the runtime environment. As a point of comparison, I find that my CommandShell has accumulated a few bugs over the years as a result of differences in the runtime environments on Linux and other Unix systems. Some of these I have addressed, and some I haven't. But it is not a defect in the VM, more likely it is some poor assumptions in my original implementation that became visible as the runtime C library changed. HTH, Dave On Mon, Jan 13, 2020 at 01:15:15AM +0100, Levente Uzonyi wrote: > Hi Chris, > > Do you expect #flush to write the changes to disk? > > Levente > > On Sun, 12 Jan 2020, Chris Muller wrote: > > >Magma has been stable in 5.2 for a long time under an older VM, all tests > >pass.?? But by changing ONLY the VM (not the image) to the new > >release-candidate, it fails the forward-recovery test.?? This test tests > >the scenario of a > >server failure during mid-write.?? Unless I change > >StandardFileStream>>#flush as in Files-cmm.182,??the recovery data which > >Magma relies on #flush to ensure is preserved is, in fact, not > >preserved.?? It appears to be a breakage > >of the contract which causes the test to fail.?? This functionality is > >important to avoid corrupting databases. > >I saw a discussion on the Cuis list in which someone was asserting that > >flush is no longer necessary(!!), and made a vague reference to a "thread > >on squeak-dev" which I never found. > > > >I hope this is just an oversight, otherwise I'll have to rely something > >like Files-cmm.182, which is half the speed of the old #flush. > > > >Best, > >?? Chris > > > > > |
In reply to this post by Levente Uzonyi
The test case proves that it does. The comment of the method is, "When writing, flush the current buffer out to disk." I know what filesystem I deploy to, but Squeak appears to be silently ignoring this (rather important) expectation about #flush, that its own comment presents. - Chris On Sun, Jan 12, 2020 at 6:15 PM Levente Uzonyi <[hidden email]> wrote: Hi Chris, |
In reply to this post by David T. Lewis
Hi Dave, With respect to VM changes, there is very little chance that the I do hope you're right, but when you say, the "runtime environment", do you mean statically-linked stuff? Because, in terms of the dynamic linkages, the test is running both times in the same environment. If I use the old VM, it works. If I use the new VM, I have to change #flush or it won't work. - Chris
|
In reply to this post by Chris Muller-4
Hi Chris, On Mon, Jan 13, 2020 at 12:44 PM Chris Muller <[hidden email]> wrote:
We're talking linux right? So the only change I see since Alistair's commit in 2018 is Nicolas' change that has to be harmless: Aeolus.oscogvm$ git log platforms/Cross/plugins/FilePlugin/sqFilePluginBasicPrims.c commit e0d70a91864eca8f72947fe36d1865785482ed2e Author: Nicolas Cellier <[hidden email]> Date: Thu Dec 27 12:10:34 2018 +0100 Workaround S_ISFIFO to let MSVC compile FilePlugin Note: there is a possibility to create named pipe in windows But named pipes cannot be intermixed with regular files. They are mounted on special named pipe file system (NPFS) So I think that answering false to the query is a good solution. commit 827ebe0ffbbf56abdee94fbdd519f00f7ddc0a27 Author: AlistairGrant <[hidden email]> Date: Thu Nov 22 20:58:04 2018 +0100 sqFilePluginBasicPrims.c: remove unused variable position and Nicolas' change is only this: diff --git a/platforms/Cross/plugins/FilePlugin/sqFilePluginBasicPrims.c b/platforms/Cross/plugins/FilePlugin/sqFilePluginBasicPrims.c index 0f2d6fe..df604fb 100755 --- a/platforms/Cross/plugins/FilePlugin/sqFilePluginBasicPrims.c +++ b/platforms/Cross/plugins/FilePlugin/sqFilePluginBasicPrims.c @@ -40,6 +40,12 @@ #include <limits.h> #include <stdio.h> +#ifdef _MSC_VER +#ifndef S_ISFIFO +#define S_ISFIFO(x) 0 +#endif +#endif + #include "sqMemoryAccess.h" #include "FilePlugin.h" /* must be included after sq.h */ So I don't see where to look for a change in flush behavior. AFAICT there hasn't been one.
_,,,^..^,,,_ best, Eliot |
In reply to this post by Chris Muller-3
On Mon, Jan 13, 2020 at 12:53 PM Chris Muller <[hidden email]> wrote:
Can you provide the version info for both please?
_,,,^..^,,,_ best, Eliot |
Hi Eliot, > We're talking linux right? Yes. > So the only change I see since Alistair's commit in 2018 is Nicolas' change that has to be harmless: Thanks for checking those recent commits, I was hoping we'd be able to pinpoint an obvious cause. Hm, so it could be something else with FileStream besides #flush that's causing the test to fail.. Hm.
The old I'm testing with is: /usr/local/lib/squeak/5.0-201811272342/squeak Unix built on Nov 27 2018 23:46:03 Compiler: 4.8.4 platform sources revision VM: 201811272342 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Date: Tue Nov 27 15:42:06 2018 CommitHash: 476f7060 While the new one is the one Fabio posted the other day. platform sources revision VM: 201912311458 https://github.com/OpenSmalltalk/opensmalltalk-vm.git Date: Tue Dec 31 15:58:24 2019 CommitHash: f219b72 The issue is easily recreatable by running Magma's test suite, but the failing test is complex and exercises the full DB stack, so I wouldn't expect you to try to debug anything that way. With the new VM, I just tried a manual StandardFileStream in an inspector, did a #nextPutAll:, and verified the file contents in another window weren't there until after I did #flush. So, it does appear that flush worked at least in that simple scenario. I may have to dig deeper into the failing test to see if it's a different issue -- now I'm suspecting extra "buffer" bytes on the end of hte file or something which is throwing off the computed hash... I'll try digging deeper and see if I can better isolate the problem into something you can deal with.. Best, Chris |
In reply to this post by Chris Muller-4
Hi Chris,
#flush calls fflush(), #sync calls #fflush() and then #fsync(). The former does not write data to disk, the latter does. And the latter is obviously a lot slower. If that explanation is not clear, this might be better: https://stackoverflow.com/a/2340641 On Mon, 13 Jan 2020, Chris Muller wrote: > The test case proves that it does. Can you give us a small snippet which reproduces the bug and can be executed in a Trunk image? > > The comment of the method is, > > "When writing, flush the current buffer out to disk." That comment is wrong. That might have been true 21 years ago, even though that's not very likely either. The comment on StandardFileStream >> #flush is correct, but could be more verbose: "Flush pending changes". Levente > > I know what filesystem I deploy to, but Squeak appears to be silently ignoring this (rather important) expectation about #flush, that its own > comment presents. > > - Chris > > > > On Sun, Jan 12, 2020 at 6:15 PM Levente Uzonyi <[hidden email]> wrote: > Hi Chris, > > Do you expect #flush to write the changes to disk? > > Levente > > On Sun, 12 Jan 2020, Chris Muller wrote: > > > Magma has been stable in 5.2 for a long time under an older VM, all tests pass. But by changing ONLY the VM (not the image) to the > new release-candidate, it fails the forward-recovery test. This test tests the scenario of a > > server failure during mid-write. Unless I change StandardFileStream>>#flush as in Files-cmm.182, the recovery data which Magma > relies on #flush to ensure is preserved is, in fact, not preserved. It appears to be a breakage > > of the contract which causes the test to fail. This functionality is important to avoid corrupting databases. > > I saw a discussion on the Cuis list in which someone was asserting that flush is no longer necessary(!!), and made a vague reference > to a "thread on squeak-dev" which I never found. > > > > I hope this is just an oversight, otherwise I'll have to rely something like Files-cmm.182, which is half the speed of the old > #flush. > > > > Best, > > Chris > > > > > > > |
In reply to this post by Chris Muller-3
Chris, FWIW, if it's the thread I think it is, you got the wrong impression from the discussion. The assertion was that the flush/close/reopen dance on the changes file was no longer needed, *not* that flush was no longer necessary. The flush was the part that we kept! (i.e. the argument was that it's no longer the 90's and that we should trust flush to behave as expected) Thanks, Phil On Sun, Jan 12, 2020 at 6:18 PM Chris Muller <[hidden email]> wrote:
|
In reply to this post by Levente Uzonyi
Oh, okay, #sync does what I *thought* #flush did. I would never discover some of these things if you weren't so generous with your expertise, thanks. I think #sync is what I want since I want it to be safe from sudden power outage, not just a process kill. I did try changing it to #sync, but it still failed the same. I then went back and tried my #reopen a second time, this time it failed in the same way too! So either it got lucky before, or I err'd in my testing somehow, but I'm actually glad that it seems to be failing consistently now. There's definitely something different with the new VM, but it's not flush. Sorry for the false alarm. I'll have to keep digging. - Chris On Mon, Jan 13, 2020 at 4:39 PM Levente Uzonyi <[hidden email]> wrote: Hi Chris, |
In reply to this post by Phil B
I did indeed have the wrong idea from only having skimmed that thread, thanks for clarifying. I'm glad to know flush wasn't on the chopping block afterall. :) - Chris On Mon, Jan 13, 2020 at 4:44 PM Phil B <[hidden email]> wrote:
|
In reply to this post by Chris Muller-3
> I think #sync is what I want since I want it to be safe from sudden power outage, not just a process kill. Wow, perhaps not, this is really hammering the HD, solid busy light, on an SSD! On Mon, Jan 13, 2020 at 10:56 PM Chris Muller <[hidden email]> wrote:
|
In reply to this post by Chris Muller-3
Has anything changed in the VM related to process scheduling? Assuming you're doing your data store writes and transaction logging via processes, that would be one of the first places I'd look. On Mon, Jan 13, 2020 at 11:57 PM Chris Muller <[hidden email]> wrote:
|
On Jan 14, 2020, at 12:07 AM, Phil B <[hidden email]> wrote:
|
A random thought without proper understanding of the problem: could the problem or change in behavior be related to the linux build recently switching to clang rather than GCC (at least I believe it used to be GCC)? These lowlevel primitives would likely be in the libc that the compiler links to, no?
Best, Tom ________________________________________ From: Squeak-dev <[hidden email]> on behalf of Eliot Miranda <[hidden email]> Sent: Tuesday, January 14, 2020 9:37:03 AM To: The general-purpose Squeak developers list Subject: Re: [squeak-dev] new VM appears not to be flushing On Jan 14, 2020, at 12:07 AM, Phil B <[hidden email]> wrote: Has anything changed in the VM related to process scheduling? Assuming you're doing your data store writes and transaction logging via processes, that would be one of the first places I'd look. No, the scheduler/interrupt code is unchanged. Let’s let Chris stabilize his tests and then find out if and by which versions of the vm they are affected by. I’m not convinced there’s a signal there yet. On Mon, Jan 13, 2020 at 11:57 PM Chris Muller <[hidden email]<mailto:[hidden email]>> wrote: Oh, okay, #sync does what I *thought* #flush did. I would never discover some of these things if you weren't so generous with your expertise, thanks. I think #sync is what I want since I want it to be safe from sudden power outage, not just a process kill. I did try changing it to #sync, but it still failed the same. I then went back and tried my #reopen a second time, this time it failed in the same way too! So either it got lucky before, or I err'd in my testing somehow, but I'm actually glad that it seems to be failing consistently now. There's definitely something different with the new VM, but it's not flush. Sorry for the false alarm. I'll have to keep digging. - Chris On Mon, Jan 13, 2020 at 4:39 PM Levente Uzonyi <[hidden email]<mailto:[hidden email]>> wrote: Hi Chris, #flush calls fflush(), #sync calls #fflush() and then #fsync(). The former does not write data to disk, the latter does. And the latter is obviously a lot slower. If that explanation is not clear, this might be better: https://stackoverflow.com/a/2340641 On Mon, 13 Jan 2020, Chris Muller wrote: > The test case proves that it does. Can you give us a small snippet which reproduces the bug and can be executed in a Trunk image? > > The comment of the method is, > > "When writing, flush the current buffer out to disk." That comment is wrong. That might have been true 21 years ago, even though that's not very likely either. The comment on StandardFileStream >> #flush is correct, but could be more verbose: "Flush pending changes". Levente > > I know what filesystem I deploy to, but Squeak appears to be silently ignoring this (rather important) expectation about #flush, that its own > comment presents. > > - Chris > > > > On Sun, Jan 12, 2020 at 6:15 PM Levente Uzonyi <[hidden email]<mailto:[hidden email]>> wrote: > Hi Chris, > > Do you expect #flush to write the changes to disk? > > Levente > > On Sun, 12 Jan 2020, Chris Muller wrote: > > > Magma has been stable in 5.2 for a long time under an older VM, all tests pass. But by changing ONLY the VM (not the image) to the > new release-candidate, it fails the forward-recovery test. This test tests the scenario of a > > server failure during mid-write. Unless I change StandardFileStream>>#flush as in Files-cmm.182, the recovery data which Magma > relies on #flush to ensure is preserved is, in fact, not preserved. It appears to be a breakage > > of the contract which causes the test to fail. This functionality is important to avoid corrupting databases. > > I saw a discussion on the Cuis list in which someone was asserting that flush is no longer necessary(!!), and made a vague reference > to a "thread on squeak-dev" which I never found. > > > > I hope this is just an oversight, otherwise I'll have to rely something like Files-cmm.182, which is half the speed of the old > #flush. > > > > Best, > > Chris > > > > > > > |
After a two-day rabbit-hole adventure, I finally found the cause of the failing test. I'll spare myself having to type the long story and you having to see it and just say, it came down to NetNameResolver localHostName having gotten fixed in the VM, and reporting the hostname in the new VM, instead of the IP string, as in the old. This caused the test to take a different path which I've now accounted for. The Magma suite now passes in 5.2 with the new VM. Now I can begin debugging the issues it's having in Squeak 5.3! :/ Thanks again for all y'all's help. - Chris On Tue, Jan 14, 2020 at 2:38 AM Beckmann, Tom <[hidden email]> wrote: A random thought without proper understanding of the problem: could the problem or change in behavior be related to the linux build recently switching to clang rather than GCC (at least I believe it used to be GCC)? These lowlevel primitives would likely be in the libc that the compiler links to, no? |
Thanks for the follow up note. Flushing streams in conjunction with
multiple process synchronization can be tricky to debug, so I'm happy to hear that the issue was not as it first appeared. Dave On Wed, Jan 15, 2020 at 06:04:04PM -0600, Chris Muller wrote: > After a two-day rabbit-hole adventure, I finally found the cause of the > failing test. I'll spare myself having to type the long story and you > having to see it and just say, it came down to > > NetNameResolver localHostName > > having gotten fixed in the VM, and reporting the hostname in the new VM, > instead of the IP string, as in the old. This caused the test to take a > different path which I've now accounted for. The Magma suite now passes in > 5.2 with the new VM. Now I can begin debugging the issues it's having in > Squeak 5.3! :/ > > Thanks again for all y'all's help. > > - Chris > > On Tue, Jan 14, 2020 at 2:38 AM Beckmann, Tom < > [hidden email]> wrote: > > > A random thought without proper understanding of the problem: could the > > problem or change in behavior be related to the linux build recently > > switching to clang rather than GCC (at least I believe it used to be GCC)? > > These lowlevel primitives would likely be in the libc that the compiler > > links to, no? > > > > Best, > > Tom > > ________________________________________ > > From: Squeak-dev <[hidden email]> on > > behalf of Eliot Miranda <[hidden email]> > > Sent: Tuesday, January 14, 2020 9:37:03 AM > > To: The general-purpose Squeak developers list > > Subject: Re: [squeak-dev] new VM appears not to be flushing > > > > On Jan 14, 2020, at 12:07 AM, Phil B <[hidden email]> wrote: > > > > ??? > > Has anything changed in the VM related to process scheduling? Assuming > > you're doing your data store writes and transaction logging via processes, > > that would be one of the first places I'd look. > > > > No, the scheduler/interrupt code is unchanged. Let???s let Chris stabilize > > his tests and then find out if and by which versions of the vm they are > > affected by. I???m not convinced there???s a signal there yet. > > > > On Mon, Jan 13, 2020 at 11:57 PM Chris Muller <[hidden email]<mailto: > > [hidden email]>> wrote: > > Oh, okay, #sync does what I *thought* #flush did. I would never discover > > some of these things if you weren't so generous with your expertise, > > thanks. I think #sync is what I want since I want it to be safe from > > sudden power outage, not just a process kill. > > > > I did try changing it to #sync, but it still failed the same. I then went > > back and tried my #reopen a second time, this time it failed in the same > > way too! So either it got lucky before, or I err'd in my testing somehow, > > but I'm actually glad that it seems to be failing consistently now. > > > > There's definitely something different with the new VM, but it's not > > flush. Sorry for the false alarm. I'll have to keep digging. > > > > - Chris > > > > On Mon, Jan 13, 2020 at 4:39 PM Levente Uzonyi <[hidden email] > > <mailto:[hidden email]>> wrote: > > Hi Chris, > > > > #flush calls fflush(), #sync calls #fflush() and then #fsync(). > > The former does not write data to disk, the latter does. And the latter is > > obviously a lot slower. If that explanation is not clear, this might be > > better: https://stackoverflow.com/a/2340641 > > > > On Mon, 13 Jan 2020, Chris Muller wrote: > > > > > The test case proves that it does. > > > > Can you give us a small snippet which reproduces the bug and can be > > executed in a Trunk image? > > > > > > > > The comment of the method is, > > > > > > "When writing, flush the current buffer out to disk." > > > > That comment is wrong. That might have been true 21 years ago, even > > though that's not very likely either. > > The comment on StandardFileStream >> #flush is correct, but could be more > > verbose: "Flush pending changes". > > > > > > Levente > > > > > > > > I know what filesystem I deploy to, but Squeak appears to be silently > > ignoring this (rather important) expectation about #flush, that its own > > > comment presents. > > > > > > - Chris > > > > > > > > > > > > On Sun, Jan 12, 2020 at 6:15 PM Levente Uzonyi <[hidden email] > > <mailto:[hidden email]>> wrote: > > > Hi Chris, > > > > > > Do you expect #flush to write the changes to disk? > > > > > > Levente > > > > > > On Sun, 12 Jan 2020, Chris Muller wrote: > > > > > > > Magma has been stable in 5.2 for a long time under an older VM, > > all tests pass. But by changing ONLY the VM (not the image) to the > > > new release-candidate, it fails the forward-recovery test. This > > test tests the scenario of a > > > > server failure during mid-write. Unless I change > > StandardFileStream>>#flush as in Files-cmm.182, the recovery data which > > Magma > > > relies on #flush to ensure is preserved is, in fact, not > > preserved. It appears to be a breakage > > > > of the contract which causes the test to fail. This > > functionality is important to avoid corrupting databases. > > > > I saw a discussion on the Cuis list in which someone was > > asserting that flush is no longer necessary(!!), and made a vague reference > > > to a "thread on squeak-dev" which I never found. > > > > > > > > I hope this is just an oversight, otherwise I'll have to rely > > something like Files-cmm.182, which is half the speed of the old > > > #flush. > > > > > > > > Best, > > > > Chris > > > > > > > > > > > > > > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |