Smalltalk › Squeak › Squeak VM

[OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

56 messages Options

123

alistairgrant

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

Hi Clément,

On Thu, 28 Nov 2019 at 22:36, Clément Béra <[hidden email]> wrote:
>
> Hi Alistair,
>
> I've just investigated the bug tonight and fixed it in VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run the 400 iterations of your script without any crashes.
> Thanks for the easy reproduction! Last year when I used the GC benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1], I initially had an image crashing 9 times out of 10
> when going to 10Gb. I fixed a few bugs on the production GC back then (mainly on segment management) which led the benchmarks to run successfully 99% of the times. But it was still crashing
> on 1%, since I was benchmarking on experimental GCs with various changes I thought the bug did not happen in the production GC, but it turns out I was wrong. And you found a reliable way to
> reproduce :-). So I could investigate. It's so fun to do lemming debugging in the simulator.

We need to thank Juraj here, he was the one who produced the initial
version of the script which made all of this possible.

> The GC bug was basically that when Planning Compactor (Production Full GC compactor) decided to do a multiple pass compaction, if it managed to compact everything in one go then it would
> get confused and attempt to compact objects upward instead of downward (address wise) on the second attempt, and that's broken and corrupts memory.
>
> I started from this script:
>
> | aJson anArray |
> aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl.
> Array streamContents: [ :aStream |
> 400 timesRepeat: [
> aStream nextPutAll: (STON fromString: aJson contents).
> Smalltalk saveSession ] ].
>
>
> It makes me however very sad that you were not able to use the simulator to debug this issue, I used it and that's how I tracked down the bug in only a few hours. Tracking things down in lldb would have taken me weeks, and I would not have been able to do it since I work during the week :-).
>
> Therefore I'm going to explain you my process to reproduce the bug in the simulator and to understand where the issue comes from. The mail is quite long, but it would be nice if you could track the bug quickly on your own next time using the simulator. Of course you can skip if you're not interested. @Eliot you may read since I explain how I set-up a Pharo 7 image for simulator debugging, that might come handy for you at some point.
>
> 1] The first thing I did was to reproduce your bug, based on the script, both on Cog and and Stack vm compiled from OpenSmalltalk-VM repository. I initially started with Pharo 8, but for some reason that image is quite broken (formatter issue? Integrator gone wild?).

That was unlucky timing, there was a bad commit made. I think it's
largely tidied up now, still, using the current stable version isn't
necessarily bad :-)

Just for future reference: the first thing I tried was reproducing it
on the Pharo 8 minimal image (I did this before the formatter bug
appeared and kept the same image). The minimal image has a few
advantages:

- It's smaller, 14M vs. 54M, so less memory to keep track of (and the
simulator will be a bit faster)
- It doesn't have FreeType loaded, so that quickly ruled it out as an issue.
- I wasn't sure if there would be other FFI calls, so this just
reduced the chances.

> So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the bug was unrelated to the JIT. Most bugs on the core VM (besides people mis-using FFI, which is by far the most common VM bug reported) is either JIT or GC. So we're tracking a GC bug.
> I then built an image which runs your script at start-up (Smalltalk snapshot: true andQuit: true followed by your script, I select all and run do-it).
>
> 2] Then I started the image in the simulator. First thing I noticed is that Pharo 7 is using FFI calls in FreeType, from start-up, and even if you're not using text or if you disable FreeType from the setting browser, Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI calls are incorrectly implemented (the C stack references heap object which are not pinned), therefore these calls corrupts the heap. Running a corrupted heap on the VM has undefined behavior, therefore any usage of Pharo 7 right now, wether you actually text or not, wether freetype is enabled or not in the settings, is undefined behavior. I saw in the thread Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning objects is image-side responsibility and it's not a VM bug. In addition, most reported bug comes from people mis-using FFI, so I understand their answer. There was however another bug in the GC, but it's very hard for us to debug it if it's hidden after image corrupting bugs like the FreeType one here.
> So for that I made that change:
> FreeTypeSettings>>startUp: resuming
> "resuming ifTrue:[ self updateFreeType ]"
> saved, restarted the image, and ensured it was not corrupted (leak checker + swizzling in simulation).
>
> 3] Then I started the image in the simulator. Turns out the image start-up raises error if libgit cannot be loaded, and then the start-up script is not executed due to the exception. So I made that change:
> LibGitLibrary>>startUp: isImageStarting
> "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"

Also for future reference, I'm surprised you didn't hit an FFI call
trying to get the current working directory. Making the following
change in OSPlatform removes the FFI call:

currentWorkingDirectoryPathWithBuffer: aByteString
<primitive: 'primitiveGetCurrentWorkingDirectory' module:
'UnixOSProcessPlugin' error: ec>
^self primitiveFailed

(if on windows you need to use WinOSProcessorPlugin).

> 4] Turns out ZnEasy does not work well in the simulator. So I preloaded this line aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global variable. The rest of the script remains the same. I can finally run your script in the simulator! Usually we simulate Squeak image and all these preliminary steps are not required. But! It is still easier to reproduce this bug that most bugs I have to deal with for Android at work, at least I don't need to buy an uncommon device from an obscure chinese vendor to reproduce :-).

I put the data in to a file and loaded it :-)

> 5] To shortcut simulation time, since the bug happened around the 60th save for me, I build a different script which snapshots the image to different image names.

We also updated the script to save to different files.

But did you actually get it to save the image in the simulator? I'm
just reproducing your work now but couldn't save an image due to a bug
in the FileAttributesPluginSimulator. I've got a fix and will commit
a bit later.

> With a crash at snapshot 59 (only change file written to disk), image 57 was the latest non corrupted image. I then started the simulator (The StackSimulator since we are debugging a GC bug, not the Cog simulator, simulation is faster and simpler). I used the standard script available in the workspace of the Cog dev image built from the guidelines. [2]
> | sis |
> sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory Spur64BitMemoryManager).
> "sis desiredNumStackPages: 8." "Speeds up scavenging when simulating. Set to e.g. 64 for something like the real VM."
> "sis assertValidExecutionPointersAtEachStep: false." "Set this to true to turn on an expensive assert that checks for valid stack, frame pointers etc on each bytecode. Useful when you're adding new bytecodes or exotic execution primitives."
> sis openOn: 'Save57.image'.
> sis openAsMorph; run
> I then let the simulator simulate, went swimming for 1h, and came back 1h30 later (with commute time). The bug happened in the simulator at save 90, I don't know how long it took to reproduce, but < 1h30. Then I had an assertion failure in the compactor:
> self assert: (self validRelocationPlanInPass: finalPass) = 0.
> Good! From there I debugged using lemming debugging (technique described in [3], Section 3.2). When the assertion has failed, simulation is the clone. I went up in the debugger to the point where the clone was made, and restarted the same GC approximately 40 times during debugging because once the heap is corrupted you cannot know anymore what the problem is, but you need to trigger the problem to understand. 40 lemmings over that cliff :-) Good lemmings.
>
> Then I quickly figured out that the GC was performing two successive compactions, and that the second compaction is broken right at the start (tries to move objects upward). Then I looked at the glue code in-between the 2 compactions, and yeah, in the case where the first compaction has compacted everything, the variables are incorrectly set for the second compaction. I tried fixing the variables but it's not that easy, so instead I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
>
> 6] I then compiled a VM from the sources to check Slang translator would not complain, it did not. I then built a stack VM (Cog VM seems to be broken on tip of tree due on-going work for ARMv8 support) and run your script again. I was able to run the 400 iterations without crash. Bug seems to be fixed!
>
> @Eliot now needs to fix tip of tree, generate the code and produce new VMs. ARMv8 support is quite exciting though, giving that MacBooks do not support 32 bits any more and that the next Macbooks are rumoured to be on ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
>
> Alistair, let me know if you have questions. I hope you can work with the simulator as efficiently as we can. If you've not seen it, there's this screencast where I showed how I used the simulator to debug JIT bugs [4]. Audio is not very good because my spoken English sucks, but it shows the main ideas.
>
> [1] https://www.researchgate.net/publication/336422106_Lazy_pointer_update_for_low_heap_compaction_pause_times
> [2] http://www.mirandabanda.org/cogblog/build-image/
> [3] https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_VM_Development_Live_VM_Development_through_Simulation_Tools
> [4] https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/

You wrote in [3]:

"the slightest change in the heap
might change the bug; any variability in timing or user input
can result in a different heap and hence in the bug morphing
or going into hiding."

This was evident in this issue. While the script (fortunately) would
always produce a crash, small changes, such as how the initial JSON is
loaded, or the name of the image that it is saved to, caused fairly
large changes in the number of loops to trigger the crash.

Also, while trying to reproduce your debug steps above, the image I
have already has memory leaks, so it isn't hitting the "self assert:
(self validRelocationPlanInPass: finalPass) = 0" assertion.

Thanks for the links, I'll keep reading.

Thanks again!
Alistair

> --
> Clément Béra
> https://clementbera.github.io/
> https://clementbera.wordpress.com/

Clément Béra

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

Hi,

On Fri, Nov 29, 2019, 10:21 Alistair Grant <[hidden email]> wrote:

Hi Clément,

On Thu, 28 Nov 2019 at 22:36, Clément Béra <[hidden email]> wrote:
>
> Hi Alistair,
>
> I've just investigated the bug tonight and fixed it in VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run the 400 iterations of your script without any crashes.
> Thanks for the easy reproduction! Last year when I used the GC benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1], I initially had an image crashing 9 times out of 10
> when going to 10Gb. I fixed a few bugs on the production GC back then (mainly on segment management) which led the benchmarks to run successfully 99% of the times. But it was still crashing
> on 1%, since I was benchmarking on experimental GCs with various changes I thought the bug did not happen in the production GC, but it turns out I was wrong. And you found a reliable way to
> reproduce :-). So I could investigate. It's so fun to do lemming debugging in the simulator.

We need to thank Juraj here, he was the one who produced the initial
version of the script which made all of this possible.

Thanks Juraj. Are you both Feenk people?

Are you mainly working on the VM Alistair? Or just having fun?

> The GC bug was basically that when Planning Compactor (Production Full GC compactor) decided to do a multiple pass compaction, if it managed to compact everything in one go then it would
> get confused and attempt to compact objects upward instead of downward (address wise) on the second attempt, and that's broken and corrupts memory.
>
> I started from this script:
>
> | aJson anArray |
> aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl.
> Array streamContents: [ :aStream |
> 400 timesRepeat: [
> aStream nextPutAll: (STON fromString: aJson contents).
> Smalltalk saveSession ] ].
>
>
> It makes me however very sad that you were not able to use the simulator to debug this issue, I used it and that's how I tracked down the bug in only a few hours. Tracking things down in lldb would have taken me weeks, and I would not have been able to do it since I work during the week :-).
>
> Therefore I'm going to explain you my process to reproduce the bug in the simulator and to understand where the issue comes from. The mail is quite long, but it would be nice if you could track the bug quickly on your own next time using the simulator. Of course you can skip if you're not interested. @Eliot you may read since I explain how I set-up a Pharo 7 image for simulator debugging, that might come handy for you at some point.
>
> 1] The first thing I did was to reproduce your bug, based on the script, both on Cog and and Stack vm compiled from OpenSmalltalk-VM repository. I initially started with Pharo 8, but for some reason that image is quite broken (formatter issue? Integrator gone wild?).

That was unlucky timing, there was a bad commit made. I think it's
largely tidied up now, still, using the current stable version isn't
necessarily bad :-)

Just for future reference: the first thing I tried was reproducing it
on the Pharo 8 minimal image (I did this before the formatter bug
appeared and kept the same image). The minimal image has a few
advantages:

- It's smaller, 14M vs. 54M, so less memory to keep track of (and the
simulator will be a bit faster)
- It doesn't have FreeType loaded, so that quickly ruled it out as an issue.
- I wasn't sure if there would be other FFI calls, so this just
reduced the chances.

Not having FreeType and LibGit would be nice indeed. The difference between simulation performance 14Mb-54Mb is not really an issue for me, the bug happened on > 100Mb heap and simulation is still fairly fast.

The problem is more to find a reliable way to crash soon after start-up, in some cases I start the simulator, go to sleep, but if the next morning it hasn't crashed, well, too bad :-(.

In most cases we reproduce bugs using the Squeak REPL image. See:

https://github.com/OpenSmalltalk/opensmalltalk-vm/blob/Cog/image/buildspurtrunkreaderimage.sh

I suggest you try using the simulator on the squeak repl, it's convenient you can run a few things and see what is going on. The REPL support chunk format (Put a ! after each do it).

You can build something similar from the minimal Pharo if you want to, but I doubt you'll catch bugs that you can't catch from the Squeak one.

> So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the bug was unrelated to the JIT. Most bugs on the core VM (besides people mis-using FFI, which is by far the most common VM bug reported) is either JIT or GC. So we're tracking a GC bug.
> I then built an image which runs your script at start-up (Smalltalk snapshot: true andQuit: true followed by your script, I select all and run do-it).
>
> 2] Then I started the image in the simulator. First thing I noticed is that Pharo 7 is using FFI calls in FreeType, from start-up, and even if you're not using text or if you disable FreeType from the setting browser, Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI calls are incorrectly implemented (the C stack references heap object which are not pinned), therefore these calls corrupts the heap. Running a corrupted heap on the VM has undefined behavior, therefore any usage of Pharo 7 right now, wether you actually text or not, wether freetype is enabled or not in the settings, is undefined behavior. I saw in the thread Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning objects is image-side responsibility and it's not a VM bug. In addition, most reported bug comes from people mis-using FFI, so I understand their answer. There was however another bug in the GC, but it's very hard for us to debug it if it's hidden after image corrupting bugs like the FreeType one here.
> So for that I made that change:
> FreeTypeSettings>>startUp: resuming
> "resuming ifTrue:[ self updateFreeType ]"
> saved, restarted the image, and ensured it was not corrupted (leak checker + swizzling in simulation).
>
> 3] Then I started the image in the simulator. Turns out the image start-up raises error if libgit cannot be loaded, and then the start-up script is not executed due to the exception. So I made that change:
> LibGitLibrary>>startUp: isImageStarting
> "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"

Also for future reference, I'm surprised you didn't hit an FFI call
trying to get the current working directory. Making the following
change in OSPlatform removes the FFI call:

currentWorkingDirectoryPathWithBuffer: aByteString
<primitive: 'primitiveGetCurrentWorkingDirectory' module:
'UnixOSProcessPlugin' error: ec>
^self primitiveFailed

(if on windows you need to use WinOSProcessorPlugin).

Err. Maybe I forgot to write down a few steps here and commented a few other methods... I fixed it and then wrote the mail, I don't remember it all.

I think indeed there was something accessing source or change files and I commented something in there.

I'll try to check the change file later on.

I don't have access to my laptop right now I'm at work so I cannot check.

> 4] Turns out ZnEasy does not work well in the simulator. So I preloaded this line aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global variable. The rest of the script remains the same. I can finally run your script in the simulator! Usually we simulate Squeak image and all these preliminary steps are not required. But! It is still easier to reproduce this bug that most bugs I have to deal with for Android at work, at least I don't need to buy an uncommon device from an obscure chinese vendor to reproduce :-).

I put the data in to a file and loaded it :-)

> 5] To shortcut simulation time, since the bug happened around the 60th save for me, I build a different script which snapshots the image to different image names.

We also updated the script to save to different files.

But did you actually get it to save the image in the simulator? I'm
just reproducing your work now but couldn't save an image due to a bug
in the FileAttributesPluginSimulator. I've got a fix and will commit
a bit later.

Yes, running the script in the simulator generated me around 30 images (Save57.image to Save90.image). I frequently use saving from the simulator (usually Squeak image though). Should work.

Then running the script again to 400 iterations from the VM I generated filled my local SSD :-).

I don't remember which API I used though to save, maybe we used different ones? I try to use snapshot:andQuit: as much as possible to avoid unexpected errors, but this time I renamed, I don't remember how.

> With a crash at snapshot 59 (only change file written to disk), image 57 was the latest non corrupted image. I then started the simulator (The StackSimulator since we are debugging a GC bug, not the Cog simulator, simulation is faster and simpler). I used the standard script available in the workspace of the Cog dev image built from the guidelines. [2]
> | sis |
> sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory Spur64BitMemoryManager).
> "sis desiredNumStackPages: 8." "Speeds up scavenging when simulating. Set to e.g. 64 for something like the real VM."
> "sis assertValidExecutionPointersAtEachStep: false." "Set this to true to turn on an expensive assert that checks for valid stack, frame pointers etc on each bytecode. Useful when you're adding new bytecodes or exotic execution primitives."
> sis openOn: 'Save57.image'.
> sis openAsMorph; run
> I then let the simulator simulate, went swimming for 1h, and came back 1h30 later (with commute time). The bug happened in the simulator at save 90, I don't know how long it took to reproduce, but < 1h30. Then I had an assertion failure in the compactor:
> self assert: (self validRelocationPlanInPass: finalPass) = 0.
> Good! From there I debugged using lemming debugging (technique described in [3], Section 3.2). When the assertion has failed, simulation is the clone. I went up in the debugger to the point where the clone was made, and restarted the same GC approximately 40 times during debugging because once the heap is corrupted you cannot know anymore what the problem is, but you need to trigger the problem to understand. 40 lemmings over that cliff :-) Good lemmings.
>
> Then I quickly figured out that the GC was performing two successive compactions, and that the second compaction is broken right at the start (tries to move objects upward). Then I looked at the glue code in-between the 2 compactions, and yeah, in the case where the first compaction has compacted everything, the variables are incorrectly set for the second compaction. I tried fixing the variables but it's not that easy, so instead I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
>
> 6] I then compiled a VM from the sources to check Slang translator would not complain, it did not. I then built a stack VM (Cog VM seems to be broken on tip of tree due on-going work for ARMv8 support) and run your script again. I was able to run the 400 iterations without crash. Bug seems to be fixed!
>
> @Eliot now needs to fix tip of tree, generate the code and produce new VMs. ARMv8 support is quite exciting though, giving that MacBooks do not support 32 bits any more and that the next Macbooks are rumoured to be on ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
>
> Alistair, let me know if you have questions. I hope you can work with the simulator as efficiently as we can. If you've not seen it, there's this screencast where I showed how I used the simulator to debug JIT bugs [4]. Audio is not very good because my spoken English sucks, but it shows the main ideas.
>
> [1] https://www.researchgate.net/publication/336422106_Lazy_pointer_update_for_low_heap_compaction_pause_times
> [2] http://www.mirandabanda.org/cogblog/build-image/
> [3] https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_VM_Development_Live_VM_Development_through_Simulation_Tools
> [4] https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/

You wrote in [3]:

"the slightest change in the heap
might change the bug; any variability in timing or user input
can result in a different heap and hence in the bug morphing
or going into hiding."

This was evident in this issue. While the script (fortunately) would
always produce a crash, small changes, such as how the initial JSON is
loaded, or the name of the image that it is saved to, caused fairly
large changes in the number of loops to trigger the crash.

Yeah that's the main problem when debugging GC in general. Pharo is less deterministic than Squeak for some reason (things are happening in the background doing FFI calls). In both environment user events is a problem.

That's why lemming debugging is very handy. And that's why OpenSmalltalk-VM development tools are far superior to other VMs I've dealt with. The back-in-time features that I used in C++ recently are very good though, in OpenSmalltalk-VM

I guess the circular buffer of JIT simulation has a better time spent on tools/productivity ratio and is enough for now.

And this is a crash. Performance pitfalls issue are even harder to track down IMO.

Also, while trying to reproduce your debug steps above, the image I
have already has memory leaks, so it isn't hitting the "self assert:
(self validRelocationPlanInPass: finalPass) = 0" assertion.

You have to start simulation on a non already corrupted image. Did you make sure to comment the startUp: method in FreeTypeSettings? Disabling FreeType in the setting browser is not enough. Then you need to save and restart the image, and verifies it is not already corrupted.

If you're talking about starting simulation from the saved images from the script, I did not take the latest which crashed because it was already corrupted, I used 57 while 58 was saved and 59 only changes were saved. You can see at start-up if swizzling and the initial GC find leaks.

Thanks for the links, I'll keep reading.

Thanks again!
Alistair

> --
> Clément Béra
> https://clementbera.github.io/
> https://clementbera.wordpress.com/

Clément Béra

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

Alistair,

Indeed I also had an issue with workingDirectoryPath. This is the complete list of what I changed in the Pharo 7 image,

it's not clear to me if everything is required (Like the zodiac change, I ended up using the global instead):

!FreeTypeSystemSettings class methodsFor: 'settings' stamp: 'cb 11/15/2019 21:22' prior: 25673466!
ft2LibraryVersion
^ Smalltalk ui theme
newLabelIn: World
for: self
label: 'Available version: None'
getEnabled: nil.! !
!FreeTypeSettings class methodsFor: 'system startup' stamp: 'cb 11/15/2019 21:24' prior: 25665613!
startUp: resuming
"resuming ifTrue:[ self updateFreeType ]"! !
!LGitLibrary class methodsFor: 'system startup' stamp: 'cb 11/15/2019 21:24' prior: 31291584!
startUp: isImageStarting
"isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"! !
!OSPlatform methodsFor: 'accessing' stamp: 'cb 11/15/2019 21:36' prior: 53154052!
currentWorkingDirectoryPath
"This method calls the method getPwdViaFFI with arugement of a buffer size. By default it uses the defaultMaximumPathLength of each subclass as the buffer size."
^ '/' "self currentWorkingDirectoryPathWithBufferSize: self defaultMaximumPathLength"
! !
!ZdcPluginSSLSession methodsFor: 'initialization' stamp: 'cb 11/15/2019 21:39' prior: 66961330!
initialize
"Initialize the receiver"
"[ handle := self primitiveSSLCreate ]
on: PrimitiveFailed
do: [ :exception |
ZdcPluginMissing signal ].
self logging: false"
! !
!PharoFilesOpener methodsFor: 'public' stamp: 'cb 11/27/2019 22:56' prior: 54156796!
changesFileOrNilReadOnly: readOnly silent: silent
| changesFile |
changesFile := self openChanges: self changesName readOnly: readOnly.
(changesFile isNil and: [ silent not ])
ifTrue: [ self informProblemInChanges: self cannotLocateMsg ].
^ changesFile .
! !

On Fri, Nov 29, 2019 at 11:53 AM Clément Béra <[hidden email]> wrote:

Hi,

On Fri, Nov 29, 2019, 10:21 Alistair Grant <[hidden email]> wrote:
Hi Clément,

On Thu, 28 Nov 2019 at 22:36, Clément Béra <[hidden email]> wrote:
>
> Hi Alistair,
>
> I've just investigated the bug tonight and fixed it in VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run the 400 iterations of your script without any crashes.
> Thanks for the easy reproduction! Last year when I used the GC benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1], I initially had an image crashing 9 times out of 10
> when going to 10Gb. I fixed a few bugs on the production GC back then (mainly on segment management) which led the benchmarks to run successfully 99% of the times. But it was still crashing
> on 1%, since I was benchmarking on experimental GCs with various changes I thought the bug did not happen in the production GC, but it turns out I was wrong. And you found a reliable way to
> reproduce :-). So I could investigate. It's so fun to do lemming debugging in the simulator.

We need to thank Juraj here, he was the one who produced the initial
version of the script which made all of this possible.

Thanks Juraj. Are you both Feenk people?
Are you mainly working on the VM Alistair? Or just having fun?

> The GC bug was basically that when Planning Compactor (Production Full GC compactor) decided to do a multiple pass compaction, if it managed to compact everything in one go then it would
> get confused and attempt to compact objects upward instead of downward (address wise) on the second attempt, and that's broken and corrupts memory.
>
> I started from this script:
>
> | aJson anArray |
> aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl.
> Array streamContents: [ :aStream |
> 400 timesRepeat: [
> aStream nextPutAll: (STON fromString: aJson contents).
> Smalltalk saveSession ] ].
>
>
> It makes me however very sad that you were not able to use the simulator to debug this issue, I used it and that's how I tracked down the bug in only a few hours. Tracking things down in lldb would have taken me weeks, and I would not have been able to do it since I work during the week :-).
>
> Therefore I'm going to explain you my process to reproduce the bug in the simulator and to understand where the issue comes from. The mail is quite long, but it would be nice if you could track the bug quickly on your own next time using the simulator. Of course you can skip if you're not interested. @Eliot you may read since I explain how I set-up a Pharo 7 image for simulator debugging, that might come handy for you at some point.
>
> 1] The first thing I did was to reproduce your bug, based on the script, both on Cog and and Stack vm compiled from OpenSmalltalk-VM repository. I initially started with Pharo 8, but for some reason that image is quite broken (formatter issue? Integrator gone wild?).

That was unlucky timing, there was a bad commit made. I think it's
largely tidied up now, still, using the current stable version isn't
necessarily bad :-)

Just for future reference: the first thing I tried was reproducing it
on the Pharo 8 minimal image (I did this before the formatter bug
appeared and kept the same image). The minimal image has a few
advantages:

- It's smaller, 14M vs. 54M, so less memory to keep track of (and the
simulator will be a bit faster)
- It doesn't have FreeType loaded, so that quickly ruled it out as an issue.
- I wasn't sure if there would be other FFI calls, so this just
reduced the chances.

Not having FreeType and LibGit would be nice indeed. The difference between simulation performance 14Mb-54Mb is not really an issue for me, the bug happened on > 100Mb heap and simulation is still fairly fast.
The problem is more to find a reliable way to crash soon after start-up, in some cases I start the simulator, go to sleep, but if the next morning it hasn't crashed, well, too bad :-(.

In most cases we reproduce bugs using the Squeak REPL image. See:
https://github.com/OpenSmalltalk/opensmalltalk-vm/blob/Cog/image/buildspurtrunkreaderimage.sh
I suggest you try using the simulator on the squeak repl, it's convenient you can run a few things and see what is going on. The REPL support chunk format (Put a ! after each do it).
You can build something similar from the minimal Pharo if you want to, but I doubt you'll catch bugs that you can't catch from the Squeak one.

> So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the bug was unrelated to the JIT. Most bugs on the core VM (besides people mis-using FFI, which is by far the most common VM bug reported) is either JIT or GC. So we're tracking a GC bug.
> I then built an image which runs your script at start-up (Smalltalk snapshot: true andQuit: true followed by your script, I select all and run do-it).
>
> 2] Then I started the image in the simulator. First thing I noticed is that Pharo 7 is using FFI calls in FreeType, from start-up, and even if you're not using text or if you disable FreeType from the setting browser, Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI calls are incorrectly implemented (the C stack references heap object which are not pinned), therefore these calls corrupts the heap. Running a corrupted heap on the VM has undefined behavior, therefore any usage of Pharo 7 right now, wether you actually text or not, wether freetype is enabled or not in the settings, is undefined behavior. I saw in the thread Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning objects is image-side responsibility and it's not a VM bug. In addition, most reported bug comes from people mis-using FFI, so I understand their answer. There was however another bug in the GC, but it's very hard for us to debug it if it's hidden after image corrupting bugs like the FreeType one here.
> So for that I made that change:
> FreeTypeSettings>>startUp: resuming
> "resuming ifTrue:[ self updateFreeType ]"
> saved, restarted the image, and ensured it was not corrupted (leak checker + swizzling in simulation).
>
> 3] Then I started the image in the simulator. Turns out the image start-up raises error if libgit cannot be loaded, and then the start-up script is not executed due to the exception. So I made that change:
> LibGitLibrary>>startUp: isImageStarting
> "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"

Also for future reference, I'm surprised you didn't hit an FFI call
trying to get the current working directory. Making the following
change in OSPlatform removes the FFI call:

currentWorkingDirectoryPathWithBuffer: aByteString
<primitive: 'primitiveGetCurrentWorkingDirectory' module:
'UnixOSProcessPlugin' error: ec>
^self primitiveFailed

(if on windows you need to use WinOSProcessorPlugin).

Err. Maybe I forgot to write down a few steps here and commented a few other methods... I fixed it and then wrote the mail, I don't remember it all.
I think indeed there was something accessing source or change files and I commented something in there.
I'll try to check the change file later on.
I don't have access to my laptop right now I'm at work so I cannot check.

> 4] Turns out ZnEasy does not work well in the simulator. So I preloaded this line aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global variable. The rest of the script remains the same. I can finally run your script in the simulator! Usually we simulate Squeak image and all these preliminary steps are not required. But! It is still easier to reproduce this bug that most bugs I have to deal with for Android at work, at least I don't need to buy an uncommon device from an obscure chinese vendor to reproduce :-).

I put the data in to a file and loaded it :-)

> 5] To shortcut simulation time, since the bug happened around the 60th save for me, I build a different script which snapshots the image to different image names.

We also updated the script to save to different files.

But did you actually get it to save the image in the simulator? I'm
just reproducing your work now but couldn't save an image due to a bug
in the FileAttributesPluginSimulator. I've got a fix and will commit
a bit later.

Yes, running the script in the simulator generated me around 30 images (Save57.image to Save90.image). I frequently use saving from the simulator (usually Squeak image though). Should work.
Then running the script again to 400 iterations from the VM I generated filled my local SSD :-).
I don't remember which API I used though to save, maybe we used different ones? I try to use snapshot:andQuit: as much as possible to avoid unexpected errors, but this time I renamed, I don't remember how.

> With a crash at snapshot 59 (only change file written to disk), image 57 was the latest non corrupted image. I then started the simulator (The StackSimulator since we are debugging a GC bug, not the Cog simulator, simulation is faster and simpler). I used the standard script available in the workspace of the Cog dev image built from the guidelines. [2]
> | sis |
> sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory Spur64BitMemoryManager).
> "sis desiredNumStackPages: 8." "Speeds up scavenging when simulating. Set to e.g. 64 for something like the real VM."
> "sis assertValidExecutionPointersAtEachStep: false." "Set this to true to turn on an expensive assert that checks for valid stack, frame pointers etc on each bytecode. Useful when you're adding new bytecodes or exotic execution primitives."
> sis openOn: 'Save57.image'.
> sis openAsMorph; run
> I then let the simulator simulate, went swimming for 1h, and came back 1h30 later (with commute time). The bug happened in the simulator at save 90, I don't know how long it took to reproduce, but < 1h30. Then I had an assertion failure in the compactor:
> self assert: (self validRelocationPlanInPass: finalPass) = 0.
> Good! From there I debugged using lemming debugging (technique described in [3], Section 3.2). When the assertion has failed, simulation is the clone. I went up in the debugger to the point where the clone was made, and restarted the same GC approximately 40 times during debugging because once the heap is corrupted you cannot know anymore what the problem is, but you need to trigger the problem to understand. 40 lemmings over that cliff :-) Good lemmings.
>
> Then I quickly figured out that the GC was performing two successive compactions, and that the second compaction is broken right at the start (tries to move objects upward). Then I looked at the glue code in-between the 2 compactions, and yeah, in the case where the first compaction has compacted everything, the variables are incorrectly set for the second compaction. I tried fixing the variables but it's not that easy, so instead I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
>
> 6] I then compiled a VM from the sources to check Slang translator would not complain, it did not. I then built a stack VM (Cog VM seems to be broken on tip of tree due on-going work for ARMv8 support) and run your script again. I was able to run the 400 iterations without crash. Bug seems to be fixed!
>
> @Eliot now needs to fix tip of tree, generate the code and produce new VMs. ARMv8 support is quite exciting though, giving that MacBooks do not support 32 bits any more and that the next Macbooks are rumoured to be on ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
>
> Alistair, let me know if you have questions. I hope you can work with the simulator as efficiently as we can. If you've not seen it, there's this screencast where I showed how I used the simulator to debug JIT bugs [4]. Audio is not very good because my spoken English sucks, but it shows the main ideas.
>
> [1] https://www.researchgate.net/publication/336422106_Lazy_pointer_update_for_low_heap_compaction_pause_times
> [2] http://www.mirandabanda.org/cogblog/build-image/
> [3] https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_VM_Development_Live_VM_Development_through_Simulation_Tools
> [4] https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/

You wrote in [3]:

"the slightest change in the heap
might change the bug; any variability in timing or user input
can result in a different heap and hence in the bug morphing
or going into hiding."

This was evident in this issue. While the script (fortunately) would
always produce a crash, small changes, such as how the initial JSON is
loaded, or the name of the image that it is saved to, caused fairly
large changes in the number of loops to trigger the crash.

Yeah that's the main problem when debugging GC in general. Pharo is less deterministic than Squeak for some reason (things are happening in the background doing FFI calls). In both environment user events is a problem.
That's why lemming debugging is very handy. And that's why OpenSmalltalk-VM development tools are far superior to other VMs I've dealt with. The back-in-time features that I used in C++ recently are very good though, in OpenSmalltalk-VM
I guess the circular buffer of JIT simulation has a better time spent on tools/productivity ratio and is enough for now.

And this is a crash. Performance pitfalls issue are even harder to track down IMO.

Also, while trying to reproduce your debug steps above, the image I
have already has memory leaks, so it isn't hitting the "self assert:
(self validRelocationPlanInPass: finalPass) = 0" assertion.

You have to start simulation on a non already corrupted image. Did you make sure to comment the startUp: method in FreeTypeSettings? Disabling FreeType in the setting browser is not enough. Then you need to save and restart the image, and verifies it is not already corrupted.
If you're talking about starting simulation from the saved images from the script, I did not take the latest which crashed because it was already corrupted, I used 57 while 58 was saved and 59 only changes were saved. You can see at start-up if swizzling and the initial GC find leaks.

Thanks for the links, I'll keep reading.

Thanks again!
Alistair

> --
> Clément Béra
> https://clementbera.github.io/
> https://clementbera.wordpress.com/

Clément Béra
https://clementbera.github.io/

https://clementbera.wordpress.com/

David T Lewis

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

In reply to this post by David T Lewis

Closed #444.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

alistairgrant

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

In reply to this post by Clément Béra

Hi Clément,

(for anyone else reading this, the thread had become quite long, I've
chopped out quite a bit assuming the context is already familiar)

On Fri, 29 Nov 2019 at 11:53, Clément Béra <[hidden email]> wrote:
>
>>
> Thanks Juraj. Are you both Feenk people?
> Are you mainly working on the VM Alistair? Or just having fun?

Yes, we're both feenk people. :-)

I'm not focusing on the VM as such, but when we have issues with the
VM I'm one of the people that tend to look at it.

> Not having FreeType and LibGit would be nice indeed. The difference between simulation performance 14Mb-54Mb is not really an issue for me, the bug happened on > 100Mb heap and simulation is still fairly fast.
> The problem is more to find a reliable way to crash soon after start-up, in some cases I start the simulator, go to sleep, but if the next morning it hasn't crashed, well, too bad :-(.
>
> In most cases we reproduce bugs using the Squeak REPL image. See:
> https://github.com/OpenSmalltalk/opensmalltalk-vm/blob/Cog/image/buildspurtrunkreaderimage.sh
> I suggest you try using the simulator on the squeak repl, it's convenient you can run a few things and see what is going on. The REPL support chunk format (Put a ! after each do it).
> You can build something similar from the minimal Pharo if you want to, but I doubt you'll catch bugs that you can't catch from the Squeak one.

I've actually done both of these in the past, used the Squeak REPL and
build a Pharo version.

> Err. Maybe I forgot to write down a few steps here and commented a few other methods... I fixed it and then wrote the mail, I don't remember it all.
> I think indeed there was something accessing source or change files and I commented something in there.
> I'll try to check the change file later on.
> I don't have access to my laptop right now I'm at work so I cannot check.

I saw the list of changes you made, thanks. I avoided the LibGit and
Zodiac issues by using the minimal image.

> Yes, running the script in the simulator generated me around 30 images (Save57.image to Save90.image). I frequently use saving from the simulator (usually Squeak image though). Should work.
> Then running the script again to 400 iterations from the VM I generated filled my local SSD :-).
> I don't remember which API I used though to save, maybe we used different ones? I try to use snapshot:andQuit: as much as possible to avoid unexpected errors, but this time I renamed, I don't remember how.

That's the difference alright: #saveImageInFileNamed: checks that the
parent directory exists first, which uses FileAttributesPlugin, while
#snapshot:andQuit: doesn't do those checks.

> Yeah that's the main problem when debugging GC in general. Pharo is less deterministic than Squeak for some reason (things are happening in the background doing FFI calls).

I think this will be more to do with the fact that Pharo has
#processPreemptionYields true, while Squeak has it false. It means
that every IO and timer event can effectively change the active
process (if there are multiple at the same priority), so process
completion is much less deterministic.

> In both environment user events is a problem.
> That's why lemming debugging is very handy. And that's why OpenSmalltalk-VM development tools are far superior to other VMs I've dealt with. The back-in-time features that I used in C++ recently are very good though, in OpenSmalltalk-VM
> I guess the circular buffer of JIT simulation has a better time spent on tools/productivity ratio and is enough for now.

Yep, I'll be reading your papers.

> And this is a crash. Performance pitfalls issue are even harder to track down IMO.
>
>>
>> Also, while trying to reproduce your debug steps above, the image I
>> have already has memory leaks, so it isn't hitting the "self assert:
>> (self validRelocationPlanInPass: finalPass) = 0" assertion.
>
>
> You have to start simulation on a non already corrupted image. Did you make sure to comment the startUp: method in FreeTypeSettings? Disabling FreeType in the setting browser is not enough. Then you need to save and restart the image, and verifies it is not already corrupted.
> If you're talking about starting simulation from the saved images from the script, I did not take the latest which crashed because it was already corrupted, I used 57 while 58 was saved and 59 only changes were saved. You can see at start-up if swizzling and the initial GC find leaks.

This image didn't show any problems with validImage, but you're right,
I'll need to go one image back.

Thanks!
Alistair

Eliot Miranda-2

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

In reply to this post by alistairgrant

Hi Alistair,

On Fri, Nov 29, 2019 at 1:21 AM Alistair Grant <[hidden email]> wrote:

Hi Clément,

On Thu, 28 Nov 2019 at 22:36, Clément Béra <[hidden email]> wrote:
>
> Hi Alistair,
>
> I've just investigated the bug tonight and fixed it in VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run the 400 iterations of your script without any crashes.
> Thanks for the easy reproduction! Last year when I used the GC benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1], I initially had an image crashing 9 times out of 10
> when going to 10Gb. I fixed a few bugs on the production GC back then (mainly on segment management) which led the benchmarks to run successfully 99% of the times. But it was still crashing
> on 1%, since I was benchmarking on experimental GCs with various changes I thought the bug did not happen in the production GC, but it turns out I was wrong. And you found a reliable way to
> reproduce :-). So I could investigate. It's so fun to do lemming debugging in the simulator.

We need to thank Juraj here, he was the one who produced the initial
version of the script which made all of this possible.

> The GC bug was basically that when Planning Compactor (Production Full GC compactor) decided to do a multiple pass compaction, if it managed to compact everything in one go then it would
> get confused and attempt to compact objects upward instead of downward (address wise) on the second attempt, and that's broken and corrupts memory.
>
> I started from this script:
>
> | aJson anArray |
> aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl.
> Array streamContents: [ :aStream |
> 400 timesRepeat: [
> aStream nextPutAll: (STON fromString: aJson contents).
> Smalltalk saveSession ] ].
>
>
> It makes me however very sad that you were not able to use the simulator to debug this issue, I used it and that's how I tracked down the bug in only a few hours. Tracking things down in lldb would have taken me weeks, and I would not have been able to do it since I work during the week :-).
>
> Therefore I'm going to explain you my process to reproduce the bug in the simulator and to understand where the issue comes from. The mail is quite long, but it would be nice if you could track the bug quickly on your own next time using the simulator. Of course you can skip if you're not interested. @Eliot you may read since I explain how I set-up a Pharo 7 image for simulator debugging, that might come handy for you at some point.
>
> 1] The first thing I did was to reproduce your bug, based on the script, both on Cog and and Stack vm compiled from OpenSmalltalk-VM repository. I initially started with Pharo 8, but for some reason that image is quite broken (formatter issue? Integrator gone wild?).

That was unlucky timing, there was a bad commit made. I think it's
largely tidied up now, still, using the current stable version isn't
necessarily bad :-)

Just for future reference: the first thing I tried was reproducing it
on the Pharo 8 minimal image (I did this before the formatter bug
appeared and kept the same image). The minimal image has a few
advantages:

- It's smaller, 14M vs. 54M, so less memory to keep track of (and the
simulator will be a bit faster)
- It doesn't have FreeType loaded, so that quickly ruled it out as an issue.
- I wasn't sure if there would be other FFI calls, so this just
reduced the chances.

What we should have done is ran the test case using an assert VM with the leak checker turned on, running in gdb/lldb. This would have proved the bug was in GC on snapshot because

- the leak check before GC on snapshot would have succeeded

- the leak check immediately after GC for snapshot, but before snapshot, would have failed

It may be that the leak check would not have failed, because in investigating this bug I added a bounds check before probing the leak map, so the leak map is only probed for pointers that are within the full extent of the heap (which, because the heap is segmented, may be a much larger range than the size of the heap). But know that the leak checker is a useful tool for pinpointing heap corruption and GC bugs. The leak checker is enabled by bitwise flags to apply to various GC activities (scavenge, full GC, become, and can be extended to be run on FFI call), and when enabled runs before and after each phase.

When running an assert VM under gdb/lldb one puts a breakpoint in warning, the routine that outputs assert failure messages, and then runs an image.

When running in the simulator asserts are always run, and the leak checker can be enabled by sending a message to the interpreter's objectMemory.

> So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the bug was unrelated to the JIT. Most bugs on the core VM (besides people mis-using FFI, which is by far the most common VM bug reported) is either JIT or GC. So we're tracking a GC bug.
> I then built an image which runs your script at start-up (Smalltalk snapshot: true andQuit: true followed by your script, I select all and run do-it).
>
> 2] Then I started the image in the simulator. First thing I noticed is that Pharo 7 is using FFI calls in FreeType, from start-up, and even if you're not using text or if you disable FreeType from the setting browser, Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI calls are incorrectly implemented (the C stack references heap object which are not pinned), therefore these calls corrupts the heap. Running a corrupted heap on the VM has undefined behavior, therefore any usage of Pharo 7 right now, wether you actually text or not, wether freetype is enabled or not in the settings, is undefined behavior. I saw in the thread Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning objects is image-side responsibility and it's not a VM bug. In addition, most reported bug comes from people mis-using FFI, so I understand their answer. There was however another bug in the GC, but it's very hard for us to debug it if it's hidden after image corrupting bugs like the FreeType one here.
> So for that I made that change:
> FreeTypeSettings>>startUp: resuming
> "resuming ifTrue:[ self updateFreeType ]"
> saved, restarted the image, and ensured it was not corrupted (leak checker + swizzling in simulation).
>
> 3] Then I started the image in the simulator. Turns out the image start-up raises error if libgit cannot be loaded, and then the start-up script is not executed due to the exception. So I made that change:
> LibGitLibrary>>startUp: isImageStarting
> "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"

Also for future reference, I'm surprised you didn't hit an FFI call
trying to get the current working directory. Making the following
change in OSPlatform removes the FFI call:

currentWorkingDirectoryPathWithBuffer: aByteString
<primitive: 'primitiveGetCurrentWorkingDirectory' module:
'UnixOSProcessPlugin' error: ec>
^self primitiveFailed

(if on windows you need to use WinOSProcessorPlugin).

> 4] Turns out ZnEasy does not work well in the simulator.

Can you say more on this?

So I preloaded this line aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global variable. The rest of the script remains the same. I can finally run your script in the simulator! Usually we simulate Squeak image and all these preliminary steps are not required. But! It is still easier to reproduce this bug that most bugs I have to deal with for Android at work, at least I don't need to buy an uncommon device from an obscure chinese vendor to reproduce :-).

I put the data in to a file and loaded it :-)

> 5] To shortcut simulation time, since the bug happened around the 60th save for me, I build a different script which snapshots the image to different image names.

We also updated the script to save to different files.

But did you actually get it to save the image in the simulator? I'm
just reproducing your work now but couldn't save an image due to a bug
in the FileAttributesPluginSimulator. I've got a fix and will commit
a bit later.

> With a crash at snapshot 59 (only change file written to disk), image 57 was the latest non corrupted image. I then started the simulator (The StackSimulator since we are debugging a GC bug, not the Cog simulator, simulation is faster and simpler). I used the standard script available in the workspace of the Cog dev image built from the guidelines. [2]
> | sis |
> sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory Spur64BitMemoryManager).
> "sis desiredNumStackPages: 8." "Speeds up scavenging when simulating. Set to e.g. 64 for something like the real VM."
> "sis assertValidExecutionPointersAtEachStep: false." "Set this to true to turn on an expensive assert that checks for valid stack, frame pointers etc on each bytecode. Useful when you're adding new bytecodes or exotic execution primitives."
> sis openOn: 'Save57.image'.
> sis openAsMorph; run
> I then let the simulator simulate, went swimming for 1h, and came back 1h30 later (with commute time). The bug happened in the simulator at save 90, I don't know how long it took to reproduce, but < 1h30. Then I had an assertion failure in the compactor:
> self assert: (self validRelocationPlanInPass: finalPass) = 0.
> Good! From there I debugged using lemming debugging (technique described in [3], Section 3.2). When the assertion has failed, simulation is the clone. I went up in the debugger to the point where the clone was made, and restarted the same GC approximately 40 times during debugging because once the heap is corrupted you cannot know anymore what the problem is, but you need to trigger the problem to understand. 40 lemmings over that cliff :-) Good lemmings.
>
> Then I quickly figured out that the GC was performing two successive compactions, and that the second compaction is broken right at the start (tries to move objects upward). Then I looked at the glue code in-between the 2 compactions, and yeah, in the case where the first compaction has compacted everything, the variables are incorrectly set for the second compaction. I tried fixing the variables but it's not that easy, so instead I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
>
> 6] I then compiled a VM from the sources to check Slang translator would not complain, it did not. I then built a stack VM (Cog VM seems to be broken on tip of tree due on-going work for ARMv8 support) and run your script again. I was able to run the 400 iterations without crash. Bug seems to be fixed!
>
> @Eliot now needs to fix tip of tree, generate the code and produce new VMs. ARMv8 support is quite exciting though, giving that MacBooks do not support 32 bits any more and that the next Macbooks are rumoured to be on ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
>
> Alistair, let me know if you have questions. I hope you can work with the simulator as efficiently as we can. If you've not seen it, there's this screencast where I showed how I used the simulator to debug JIT bugs [4]. Audio is not very good because my spoken English sucks, but it shows the main ideas.
>
> [1] https://www.researchgate.net/publication/336422106_Lazy_pointer_update_for_low_heap_compaction_pause_times
> [2] http://www.mirandabanda.org/cogblog/build-image/
> [3] https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_VM_Development_Live_VM_Development_through_Simulation_Tools
> [4] https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/

You wrote in [3]:

"the slightest change in the heap
might change the bug; any variability in timing or user input
can result in a different heap and hence in the bug morphing
or going into hiding."

This was evident in this issue. While the script (fortunately) would
always produce a crash, small changes, such as how the initial JSON is
loaded, or the name of the image that it is saved to, caused fairly
large changes in the number of loops to trigger the crash.

Also, while trying to reproduce your debug steps above, the image I
have already has memory leaks, so it isn't hitting the "self assert:
(self validRelocationPlanInPass: finalPass) = 0" assertion.

Thanks for the links, I'll keep reading.

Thanks again!
Alistair

> --
> Clément Béra
> https://clementbera.github.io/
> https://clementbera.wordpress.com/

_,,,^..^,,,_

best, Eliot

Clément Béra

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

On Fri, Nov 29, 2019 at 7:28 PM Eliot Miranda <[hidden email]> wrote:

Hi Alistair,

On Fri, Nov 29, 2019 at 1:21 AM Alistair Grant <[hidden email]> wrote:
Hi Clément,

On Thu, 28 Nov 2019 at 22:36, Clément Béra <[hidden email]> wrote:
>
> Hi Alistair,
>
> I've just investigated the bug tonight and fixed it in VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run the 400 iterations of your script without any crashes.
> Thanks for the easy reproduction! Last year when I used the GC benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1], I initially had an image crashing 9 times out of 10
> when going to 10Gb. I fixed a few bugs on the production GC back then (mainly on segment management) which led the benchmarks to run successfully 99% of the times. But it was still crashing
> on 1%, since I was benchmarking on experimental GCs with various changes I thought the bug did not happen in the production GC, but it turns out I was wrong. And you found a reliable way to
> reproduce :-). So I could investigate. It's so fun to do lemming debugging in the simulator.

We need to thank Juraj here, he was the one who produced the initial
version of the script which made all of this possible.

> The GC bug was basically that when Planning Compactor (Production Full GC compactor) decided to do a multiple pass compaction, if it managed to compact everything in one go then it would
> get confused and attempt to compact objects upward instead of downward (address wise) on the second attempt, and that's broken and corrupts memory.
>
> I started from this script:
>
> | aJson anArray |
> aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl.
> Array streamContents: [ :aStream |
> 400 timesRepeat: [
> aStream nextPutAll: (STON fromString: aJson contents).
> Smalltalk saveSession ] ].
>
>
> It makes me however very sad that you were not able to use the simulator to debug this issue, I used it and that's how I tracked down the bug in only a few hours. Tracking things down in lldb would have taken me weeks, and I would not have been able to do it since I work during the week :-).
>
> Therefore I'm going to explain you my process to reproduce the bug in the simulator and to understand where the issue comes from. The mail is quite long, but it would be nice if you could track the bug quickly on your own next time using the simulator. Of course you can skip if you're not interested. @Eliot you may read since I explain how I set-up a Pharo 7 image for simulator debugging, that might come handy for you at some point.
>
> 1] The first thing I did was to reproduce your bug, based on the script, both on Cog and and Stack vm compiled from OpenSmalltalk-VM repository. I initially started with Pharo 8, but for some reason that image is quite broken (formatter issue? Integrator gone wild?).

That was unlucky timing, there was a bad commit made. I think it's
largely tidied up now, still, using the current stable version isn't
necessarily bad :-)

Just for future reference: the first thing I tried was reproducing it
on the Pharo 8 minimal image (I did this before the formatter bug
appeared and kept the same image). The minimal image has a few
advantages:

- It's smaller, 14M vs. 54M, so less memory to keep track of (and the
simulator will be a bit faster)
- It doesn't have FreeType loaded, so that quickly ruled it out as an issue.
- I wasn't sure if there would be other FFI calls, so this just
reduced the chances.

What we should have done is ran the test case using an assert VM with the leak checker turned on, running in gdb/lldb. This would have proved the bug was in GC on snapshot because
- the leak check before GC on snapshot would have succeeded
- the leak check immediately after GC for snapshot, but before snapshot, would have failed

It may be that the leak check would not have failed, because in investigating this bug I added a bounds check before probing the leak map, so the leak map is only probed for pointers that are within the full extent of the heap (which, because the heap is segmented, may be a much larger range than the size of the heap). But know that the leak checker is a useful tool for pinpointing heap corruption and GC bugs. The leak checker is enabled by bitwise flags to apply to various GC activities (scavenge, full GC, become, and can be extended to be run on FFI call), and when enabled runs before and after each phase.

When running an assert VM under gdb/lldb one puts a breakpoint in warning, the routine that outputs assert failure messages, and then runs an image.

When running in the simulator asserts are always run, and the leak checker can be enabled by sending a message to the interpreter's objectMemory.

> So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the bug was unrelated to the JIT. Most bugs on the core VM (besides people mis-using FFI, which is by far the most common VM bug reported) is either JIT or GC. So we're tracking a GC bug.
> I then built an image which runs your script at start-up (Smalltalk snapshot: true andQuit: true followed by your script, I select all and run do-it).
>
> 2] Then I started the image in the simulator. First thing I noticed is that Pharo 7 is using FFI calls in FreeType, from start-up, and even if you're not using text or if you disable FreeType from the setting browser, Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI calls are incorrectly implemented (the C stack references heap object which are not pinned), therefore these calls corrupts the heap. Running a corrupted heap on the VM has undefined behavior, therefore any usage of Pharo 7 right now, wether you actually text or not, wether freetype is enabled or not in the settings, is undefined behavior. I saw in the thread Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning objects is image-side responsibility and it's not a VM bug. In addition, most reported bug comes from people mis-using FFI, so I understand their answer. There was however another bug in the GC, but it's very hard for us to debug it if it's hidden after image corrupting bugs like the FreeType one here.
> So for that I made that change:
> FreeTypeSettings>>startUp: resuming
> "resuming ifTrue:[ self updateFreeType ]"
> saved, restarted the image, and ensured it was not corrupted (leak checker + swizzling in simulation).
>
> 3] Then I started the image in the simulator. Turns out the image start-up raises error if libgit cannot be loaded, and then the start-up script is not executed due to the exception. So I made that change:
> LibGitLibrary>>startUp: isImageStarting
> "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]"

Also for future reference, I'm surprised you didn't hit an FFI call
trying to get the current working directory. Making the following
change in OSPlatform removes the FFI call:

currentWorkingDirectoryPathWithBuffer: aByteString
<primitive: 'primitiveGetCurrentWorkingDirectory' module:
'UnixOSProcessPlugin' error: ec>
^self primitiveFailed

(if on windows you need to use WinOSProcessorPlugin).

> 4] Turns out ZnEasy does not work well in the simulator.

Can you say more on this?

FFI call in Zodiac > fail in sim > avoid it if possible,

So I preloaded this line aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global variable. The rest of the script remains the same. I can finally run your script in the simulator! Usually we simulate Squeak image and all these preliminary steps are not required. But! It is still easier to reproduce this bug that most bugs I have to deal with for Android at work, at least I don't need to buy an uncommon device from an obscure chinese vendor to reproduce :-).

I put the data in to a file and loaded it :-)

> 5] To shortcut simulation time, since the bug happened around the 60th save for me, I build a different script which snapshots the image to different image names.

We also updated the script to save to different files.

But did you actually get it to save the image in the simulator? I'm
just reproducing your work now but couldn't save an image due to a bug
in the FileAttributesPluginSimulator. I've got a fix and will commit
a bit later.

> With a crash at snapshot 59 (only change file written to disk), image 57 was the latest non corrupted image. I then started the simulator (The StackSimulator since we are debugging a GC bug, not the Cog simulator, simulation is faster and simpler). I used the standard script available in the workspace of the Cog dev image built from the guidelines. [2]
> | sis |
> sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory Spur64BitMemoryManager).
> "sis desiredNumStackPages: 8." "Speeds up scavenging when simulating. Set to e.g. 64 for something like the real VM."
> "sis assertValidExecutionPointersAtEachStep: false." "Set this to true to turn on an expensive assert that checks for valid stack, frame pointers etc on each bytecode. Useful when you're adding new bytecodes or exotic execution primitives."
> sis openOn: 'Save57.image'.
> sis openAsMorph; run
> I then let the simulator simulate, went swimming for 1h, and came back 1h30 later (with commute time). The bug happened in the simulator at save 90, I don't know how long it took to reproduce, but < 1h30. Then I had an assertion failure in the compactor:
> self assert: (self validRelocationPlanInPass: finalPass) = 0.
> Good! From there I debugged using lemming debugging (technique described in [3], Section 3.2). When the assertion has failed, simulation is the clone. I went up in the debugger to the point where the clone was made, and restarted the same GC approximately 40 times during debugging because once the heap is corrupted you cannot know anymore what the problem is, but you need to trigger the problem to understand. 40 lemmings over that cliff :-) Good lemmings.
>
> Then I quickly figured out that the GC was performing two successive compactions, and that the second compaction is broken right at the start (tries to move objects upward). Then I looked at the glue code in-between the 2 compactions, and yeah, in the case where the first compaction has compacted everything, the variables are incorrectly set for the second compaction. I tried fixing the variables but it's not that easy, so instead I just aborted compaction in that case (See VMMaker.oscog-cb.2595).
>
> 6] I then compiled a VM from the sources to check Slang translator would not complain, it did not. I then built a stack VM (Cog VM seems to be broken on tip of tree due on-going work for ARMv8 support) and run your script again. I was able to run the 400 iterations without crash. Bug seems to be fixed!
>
> @Eliot now needs to fix tip of tree, generate the code and produce new VMs. ARMv8 support is quite exciting though, giving that MacBooks do not support 32 bits any more and that the next Macbooks are rumoured to be on ARMv8. One wouldn't want to run the VM in a virtual box intel image :-).
>
> Alistair, let me know if you have questions. I hope you can work with the simulator as efficiently as we can. If you've not seen it, there's this screencast where I showed how I used the simulator to debug JIT bugs [4]. Audio is not very good because my spoken English sucks, but it shows the main ideas.
>
> [1] https://www.researchgate.net/publication/336422106_Lazy_pointer_update_for_low_heap_compaction_pause_times
> [2] http://www.mirandabanda.org/cogblog/build-image/
> [3] https://www.researchgate.net/publication/328509577_Two_Decades_of_Smalltalk_VM_Development_Live_VM_Development_through_Simulation_Tools
> [4] https://clementbera.wordpress.com/2018/03/07/sista-vm-screencast/

You wrote in [3]:

"the slightest change in the heap
might change the bug; any variability in timing or user input
can result in a different heap and hence in the bug morphing
or going into hiding."

This was evident in this issue. While the script (fortunately) would
always produce a crash, small changes, such as how the initial JSON is
loaded, or the name of the image that it is saved to, caused fairly
large changes in the number of loops to trigger the crash.

Also, while trying to reproduce your debug steps above, the image I
have already has memory leaks, so it isn't hitting the "self assert:
(self validRelocationPlanInPass: finalPass) = 0" assertion.

Thanks for the links, I'll keep reading.

Thanks again!
Alistair

> --
> Clément Béra
> https://clementbera.github.io/
> https://clementbera.wordpress.com/

--
_,,,^..^,,,_
best, Eliot

Clément Béra
https://clementbera.github.io/

https://clementbera.wordpress.com/

Tudor Girba-2

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

In reply to this post by David T Lewis

This was some amazing work to observe. Thank you everyone involved!

Cheers,
Doru

> On Nov 29, 2019, at 9:01 AM, Clement Bera <[hidden email]> wrote:
>
> Closed #444.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub, or unsubscribe.
>

--
feenk.com

"We are all great at making mistakes."

David T Lewis

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

In reply to this post by David T Lewis

Very good job! Thanks Alistair for your tenacity and Clement for letting us learn the advanced technics.
A pity that the snippet did not fail in Squeak...

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

David T Lewis

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

In reply to this post by David T Lewis

On Sun, Dec 1, 2019 at 9:10 PM Nicolas Cellier <[hidden email]>
wrote:

> Very good job! Thanks Alistair for your tenacity and Clement for letting
> us learn the advanced technics.
> A pity that the snippet did not fail in Squeak...
>
> Thanks Nicolas. I have to say I explained it because a while ago you
resolved a bug that happened when compiling with O2, but not with O0, and
then explained how you solved it on the mailing list and I really enjoyed
reading that mail. It's really nice to share knowledge here.

I would speculate the snippet did not fail in Squeak either because the
Stream implementation is slightly different, or because the default
settings for Eden size and other settings like that is different on Pharo.
I don't know.

> —
> You are receiving this because you modified the open/close state.
> Reply to this email directly, view it on GitHub
> <https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/444?email_source=notifications&email_token=AAWQNAUUBMIP6YGGGHTP4DDQWQKZVA5CNFSM4JNBNJH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFRUEJQ#issuecomment-560153126>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAWQNASE7ZVEUJUV4ZO4I5TQWQKZVANCNFSM4JNBNJHQ>
> .
>

--
Clément Béra
https://clementbera.github.io/
https://clementbera.wordpress.com/

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

alistairgrant

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

In reply to this post by David T Lewis

Hi Nicolas,

On Sun, 1 Dec 2019 at 21:10, Nicolas Cellier <[hidden email]> wrote:
>
>
>
> Very good job! Thanks Alistair for your tenacity and Clement for letting us learn the advanced technics.
> A pity that the snippet did not fail in Squeak...

This was interesting in that before this year I can't remember the
last time the VM crashed for me (other than my own mistakes while
working on plugins). Then this year it has got to the point where it
is significantly impacting our productivity. I guess it is something
about the size of the images we're using that just happens to line up
with the trigger conditions.

Anyway, once Juraj had made it reproducible it was worthwhile
committing 100% to fixing it.

Thanks for testing it on Squeak, that's something I had on my ToDo
list, so it saved me some time.

Cheers,
Alistair

Clément Béra

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

On Mon, Dec 2, 2019 at 11:44 AM Alistair Grant <[hidden email]> wrote:

Hi Nicolas,

On Sun, 1 Dec 2019 at 21:10, Nicolas Cellier <[hidden email]> wrote:
>
>
>
> Very good job! Thanks Alistair for your tenacity and Clement for letting us learn the advanced technics.
> A pity that the snippet did not fail in Squeak...

This was interesting in that before this year I can't remember the
last time the VM crashed for me (other than my own mistakes while
working on plugins). Then this year it has got to the point where it
is significantly impacting our productivity. I guess it is something
about the size of the images we're using that just happens to line up
with the trigger conditions.

Yeah that bug would happen only when the compactor would decide to go for a multiple passes compaction, which typically means

Eden size is far smaller than old space size (Eden memory is usually used to hold first field of moved objects in the compactor).

My personal desktop has more RAM now that I used to have so I'll try later this week to run the VM on a 100+Gb heap

(Maybe loading multiple times the NetBeans bench) and see how it holds the workload. There's a lot of GC tuning

to do for such heaps. Right now the VM is tuned by default for heaps < 100Mb. See [1] for tips on GC tuning.

I need to set-up the compilation environment for my desktop machine anyway hoping the C compiler will benefit from

having more threads and more RAM to work with to lower compilation time.

Saving and reloading multi-Gbs heap is also not very good because the VM at start-up merges old space into a single

memory segment. It would be better to reshape the heaps in different segments. We tried with Sophie a few years ago but

it's not that easy. If you want to give it a try...

[1] https://clementbera.wordpress.com/2017/03/12/tuning-the-pharo-garbage-collector/

Anyway, once Juraj had made it reproducible it was worthwhile
committing 100% to fixing it.

Thanks for testing it on Squeak, that's something I had on my ToDo
list, so it saved me some time.

Cheers,
Alistair

Clément Béra
https://clementbera.github.io/

https://clementbera.wordpress.com/

David T Lewis

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

In reply to this post by David T Lewis

Hi Clément,

Yeah that bug would happen only when the compactor would decide to go for a multiple passes compaction, which typically means Eden size is far smaller than old space size (Eden memory is usually used to hold first field of moved objects in the compactor).

Our base image is around 220MB at the moment, which seems to pretty well match what you're saying.

There's a lot of GC tuning to do for such heaps.

Is there something I can read on tuning for these scenarios?

Thanks!
Alistair

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

David T Lewis

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

In reply to this post by David T Lewis

Doh!

Reading https://clementbera.wordpress.com/2017/03/12/tuning-the-pharo-garbage-collector/ now.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

David T Lewis

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

In reply to this post by David T Lewis

Assuming you're using Moose, the paragraph *Experience report >> loading a
200Mb Moose model* looks relevant :-)

If you're using a machine with much more RAM than the heap size, you can
tune things so that it uses an extra fixed amount of memory
(for example 50Mb extra through growth headroom and eden size) but runs
faster. You can also tune the full GC ratio but you have to be
more careful since the extra memory used is proportional to heap size.

The problem is that the GC cannot be tuned by default for strong machines
or the VM won't start on small devices such as the Pie Nano,
and the VM has to work for everyone. In the context of an IDE or software
analysis, you may want to assume that the machine has at least
2Gb of RAM and don't care about saving 50Mb.

I'm sure you can set-up something so that the base settings in the Feenk
image is tuned up. It's not clear you'll gain more than 10-20%
though at 220Mb.

On Mon, Dec 2, 2019 at 2:03 PM Alistair Grant <[hidden email]>
wrote:

> Doh!
>
> Reading
> https://clementbera.wordpress.com/2017/03/12/tuning-the-pharo-garbage-collector/
> now.
>
> —
> You are receiving this because you modified the open/close state.
> Reply to this email directly, view it on GitHub
> <https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/444?email_source=notifications&email_token=AAWQNAXABTPLO6XIZZRQNDLQWUBQTA5CNFSM4JNBNJH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFTNKEQ#issuecomment-560387346>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAWQNATQ6LZA5GLGQWT6SWTQWUBQTANCNFSM4JNBNJHQ>
> .
>

--
Clément Béra
https://clementbera.github.io/
https://clementbera.wordpress.com/

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

David T Lewis

Re: [OpenSmalltalk/opensmalltalk-vm] Reproduceable Segmentation fault while saving images (#444)

In reply to this post by David T Lewis

Hi Clément,

This isn't a moose image but the latest Gtoolkit image (there's plenty of room for reducing the size of the image, but that will be after functionality has stabilised).

The problem is that the GC cannot be tuned by default for strong machines or the VM won't start on small devices such as the Pie Nano, and the VM has to work for everyone.

Understood. :-)

If you're using a machine with much more RAM than the heap size, you can tune things so that it uses an extra fixed amount of memory (for example 50Mb extra through growth headroom and eden size) but runs faster. You can also tune the full GC ratio but you have to be more careful since the extra memory used is proportional to heap size.

Our development machines all have much more RAM than the heap size, so I'll play with tuning the headroom and eden sizes.

I'm sure you can set-up something so that the base settings in the Feenk image is tuned up. It's not clear you'll gain more than 10-20% though at 220Mb.

The image starts at 220Mb, but while running the virtual memory usage of the process can increase by a Gb or more, so I think the tuning will still be worthwhile.

Thanks!
Alistair

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

123