The way to use the assert failures to track down bugs is to run under the debugger, place a breakpoint in warning (the print function for assert failures), and then run. When you reach an assert fail you can then print the stack trace, etc. So the assert message on its own tells you something is wrong, but one needs to use the debugging facilities to get more information. — |
In reply to this post by David T Lewis
You can run the leak checker. You have to be careful because it scans stack frames, etc, and so needs to be called at a sensible point. Expsriencer shows what these points are, but generally around the invocation of a primitive is a good point, when the VM is polling for events, etc. Btu half way through building a frame is probably not a good idea, etc. In any case here are the functions: And the GC modes are in VMMaker, see GCModeFull, etc. Aeolus.image$ finddefine GCMode ../spur64src/vm/gcc3x-cointerp.c — |
In reply to this post by David T Lewis
To this we should of course add GCModePreFFICall & GCModePostFFICall. I may implement this soon but am concentrating on getting the ARMv8 processor simulator working. — |
In reply to this post by David T Lewis
Hi Eliot, Tracing back from the first time a warning is printed in the assert VM, it looks like what is happening is that a I have a script which walks gdb to the point where the FP corruption occurs, but ignoring the multiple steps it takes to get to the correct iteration: Setting a breakpoint in
Setting a breakpoint at line 46909:
The VM actually crashes some time later, however the call stack shows the an invalid receiver. Shouldn't the FP be updated to point to the moved — |
I tried to investigate the sensitivity of crashes probability on the eden size and there is definitely some. In my testing setup, if the eden size was set around 2 MB, it crashed. For values around 1MB or 10 MB, there were no crashes or were significantly less probable. However, there seems to be no strict limit. -- Pavel st 20. 11. 2019 v 12:59 odesílatel Alistair Grant <[hidden email]> napsal:
|
In reply to this post by David T Lewis
Incidentally, I played around with the VM on a big machine (>1.4TB RAM) and tried to set a large Eden Bytes count (around 100m or more); it crashed early-on. — |
st 20. 11. 2019 v 21:00 odesílatel Tobias Pape <[hidden email]> napsal:
yes, it is possible. I was talking about my testing setup for this crash where I have a small script that causes it very quickly so, to some extent, it generates very similar object memory every time. And in these conditions, there is clear Eden size dependency on the probability of the crash.
|
In reply to this post by David T Lewis
Hi alistair This is great especially because when I discussed with guille about the problem last week, he told me that they were suspected another cause than Freetype and FFI. So it looks that they hypothesis were right. (Guille is coming from fatherhood leave and he restarting to work a couple of days a week). Apparently what Pablo was looking for before leaving on vacation was if forwarder logic or something related with the write barrier and the GC was not nilling one stack element. But I may have misunderstood. — |
In reply to this post by David T Lewis
Hi Stef,
That aligns with one thing I'm seeing, the Frame Pointer not being updated after an object is moved and it's old copy converted to a forwarder. However there are other assertions being printed before the Frame Pointer corruption which I don't understand yet. It would be great if Pablo could pass on what he knows after he gets back from leave. Thanks, — |
In reply to this post by David T Lewis
Just for the record:
The call frame being on a free page doesn't look good. — |
In reply to this post by David T Lewis
Last week I've been trying to have a simpler way to reproduce it (still on the way BTW). 20 timesRepeat: [ClyFullBrowser openOnMethod: SourceSansProRegular class>>#fontContents ].
1 to: 20 do: [ :i | Smalltalk snapshot: true andQuit: false ]. I've been trying to reduce it into something not depending on Calypso nor Pharo's startup. I got the following script that is longer to run, and does not reproduce the problem as often as Pavel's one. 2000 timesRepeat: [
[ 1000 timesRepeat: [
[ (1 to: 100000) collect: [ :e | e ** 17 ] ] forkAt: 38 ] ] fork ].
1 to: 20 do: [ :i | Smalltalk snapshotPrimitive]. The good thing about this last script is that is shows also that the problem has nothing to do with the startup mechanism and whatever is running in it. Enabling the leak checker starts shows output when snapshotting, but once the leak is produced it's too late, the objects leaking are already corrupted. — |
In reply to this post by David T Lewis
Hi Guille, Can you try running this script from a minimal image:
I'm using a minimal image from 18 Nov 2019 and it normally crashes on the 90th iteration. On my machine it takes about 50 seconds to crash. The good bit though is the addresses are the same on each run, so breakpoints can be easily set when working backwards.
— |
In reply to this post by David T Lewis
I could reproduce it with that too, but I would like a case that I can run in a 100kb image in the simulator :) Otherwise this is super hard to debug... — |
In reply to this post by David T Lewis
If you get it reproducible in the simulator please let me know. I'm working in gdb. — |
In reply to this post by David T Lewis
This conversation is already quite long, and I need somewhere to keep more detailed information, so I've created a summary page at: https://akgrant43.github.io/Initial-Post/ and will be adding more details. — |
In reply to this post by David T Lewis
Hi Alistair, I've just investigated the bug tonight and fixed it in VMMaker.oscog-cb.2595. I compiled a new VM from 2595 and I was able to run the 400 iterations of your script without any crashes. Thanks for the easy reproduction! Last year when I used the GC benchmarks provided by Feenk, with ~10Gb workloads, for the DLS paper [1], I initially had an image crashing 9 times out of 10 when going to 10Gb. I fixed a few bugs on the production GC back then (mainly on segment management) which led the benchmarks to run successfully 99% of the times. But it was still crashing on 1%, since I was benchmarking on experimental GCs with various changes I thought the bug did not happen in the production GC, but it turns out I was wrong. And you found a reliable way to reproduce :-). So I could investigate. It's so fun to do lemming debugging in the simulator. The GC bug was basically that when Planning Compactor (Production Full GC compactor) decided to do a multiple pass compaction, if it managed to compact everything in one go then it would get confused and attempt to compact objects upward instead of downward (address wise) on the second attempt, and that's broken and corrupts memory. I started from this script: | aJson anArray | aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. Array streamContents: [ :aStream | 400 timesRepeat: [ aStream nextPutAll: (STON fromString: aJson contents). Smalltalk saveSession ] ]. It makes me however very sad that you were not able to use the simulator to debug this issue, I used it and that's how I tracked down the bug in only a few hours. Tracking things down in lldb would have taken me weeks, and I would not have been able to do it since I work during the week :-). Therefore I'm going to explain you my process to reproduce the bug in the simulator and to understand where the issue comes from. The mail is quite long, but it would be nice if you could track the bug quickly on your own next time using the simulator. Of course you can skip if you're not interested. @Eliot you may read since I explain how I set-up a Pharo 7 image for simulator debugging, that might come handy for you at some point. 1] The first thing I did was to reproduce your bug, based on the script, both on Cog and and Stack vm compiled from OpenSmalltalk-VM repository. I initially started with Pharo 8, but for some reason that image is quite broken (formatter issue? Integrator gone wild?). So I switched to Pharo 7 stable. It crashes on both VMs, so I knew the bug was unrelated to the JIT. Most bugs on the core VM (besides people mis-using FFI, which is by far the most common VM bug reported) is either JIT or GC. So we're tracking a GC bug. I then built an image which runs your script at start-up (Smalltalk snapshot: true andQuit: true followed by your script, I select all and run do-it). 2] Then I started the image in the simulator. First thing I noticed is that Pharo 7 is using FFI calls in FreeType, from start-up, and even if you're not using text or if you disable FreeType from the setting browser, Pharo performs in the backgrounds FFI calls for freetype. FreeType FFI calls are incorrectly implemented (the C stack references heap object which are not pinned), therefore these calls corrupts the heap. Running a corrupted heap on the VM has undefined behavior, therefore any usage of Pharo 7 right now, wether you actually text or not, wether freetype is enabled or not in the settings, is undefined behavior. I saw in the thread Nicolas/Eliot complaining that this is not a VM bug, indeed, pinning objects is image-side responsibility and it's not a VM bug. In addition, most reported bug comes from people mis-using FFI, so I understand their answer. There was however another bug in the GC, but it's very hard for us to debug it if it's hidden after image corrupting bugs like the FreeType one here. So for that I made that change: FreeTypeSettings>>startUp: resuming "resuming ifTrue:[ self updateFreeType ]" saved, restarted the image, and ensured it was not corrupted (leak checker + swizzling in simulation). 3] Then I started the image in the simulator. Turns out the image start-up raises error if libgit cannot be loaded, and then the start-up script is not executed due to the exception. So I made that change: LibGitLibrary>>startUp: isImageStarting "isImageStarting ifTrue: [ self uniqueInstance initializeLibGit2 ]" 4] Turns out ZnEasy does not work well in the simulator. So I preloaded this line aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl. into a Global variable. The rest of the script remains the same. I can finally run your script in the simulator! Usually we simulate Squeak image and all these preliminary steps are not required. But! It is still easier to reproduce this bug that most bugs I have to deal with for Android at work, at least I don't need to buy an uncommon device from an obscure chinese vendor to reproduce :-). 5] To shortcut simulation time, since the bug happened around the 60th save for me, I build a different script which snapshots the image to different image names. With a crash at snapshot 59 (only change file written to disk), image 57 was the latest non corrupted image. I then started the simulator (The StackSimulator since we are debugging a GC bug, not the Cog simulator, simulation is faster and simpler). I used the standard script available in the workspace of the Cog dev image built from the guidelines. [2] | sis | sis := StackInterpreterSimulator newWithOptions: #(ObjectMemory Spur64BitMemoryManager). "sis desiredNumStackPages: 8." "Speeds up scavenging when simulating. Set to e.g. 64 for something like the real VM." "sis assertValidExecutionPointersAtEachStep: false." "Set this to true to turn on an expensive assert that checks for valid stack, frame pointers etc on each bytecode. Useful when you're adding new bytecodes or exotic execution primitives." sis openOn: 'Save57.image'. sis openAsMorph; run I then let the simulator simulate, went swimming for 1h, and came back 1h30 later (with commute time). The bug happened in the simulator at save 90, I don't know how long it took to reproduce, but < 1h30. Then I had an assertion failure in the compactor: self assert: (self validRelocationPlanInPass: finalPass) = 0. Good! From there I debugged using lemming debugging (technique described in [3], Section 3.2). When the assertion has failed, simulation is the clone. I went up in the debugger to the point where the clone was made, and restarted the same GC approximately 40 times during debugging because once the heap is corrupted you cannot know anymore what the problem is, but you need to trigger the problem to understand. 40 lemmings over that cliff :-) Good lemmings. Then I quickly figured out that the GC was performing two successive compactions, and that the second compaction is broken right at the start (tries to move objects upward). Then I looked at the glue code in-between the 2 compactions, and yeah, in the case where the first compaction has compacted everything, the variables are incorrectly set for the second compaction. I tried fixing the variables but it's not that easy, so instead I just aborted compaction in that case (See VMMaker.oscog-cb.2595). 6] I then compiled a VM from the sources to check Slang translator would not complain, it did not. I then built a stack VM (Cog VM seems to be broken on tip of tree due on-going work for ARMv8 support) and run your script again. I was able to run the 400 iterations without crash. Bug seems to be fixed! @Eliot now needs to fix tip of tree, generate the code and produce new VMs. ARMv8 support is quite exciting though, giving that MacBooks do not support 32 bits any more and that the next Macbooks are rumoured to be on ARMv8. One wouldn't want to run the VM in a virtual box intel image :-). Alistair, let me know if you have questions. I hope you can work with the simulator as efficiently as we can. If you've not seen it, there's this screencast where I showed how I used the simulator to debug JIT bugs [4]. Audio is not very good because my spoken English sucks, but it shows the main ideas. On Wed, Nov 27, 2019 at 12:24 PM Alistair Grant <[hidden email]> wrote:
|
In reply to this post by David T Lewis
Fixed in VMMaker.oscog.2595. I could run all the iterations of the script with the fix. Eliot now needs to fix tip of tree (Cog VM, stack works) and regenerate the source code here. I don't mark it as fixed since I'd like the reporter to confirm he cannot reproduce with a fixed VM once available. — |
Thanks a lot, Clement! Doru > On Nov 28, 2019, at 10:40 PM, Clement Bera <[hidden email]> wrote: > > Fixed in VMMaker.oscog.2595. I could run all the iterations of the script with the fix. > > Eliot now needs to fix tip of tree (Cog VM, stack works) and regenerate the source code here. > > I don't mark it as fixed since I'd like the reporter to confirm he cannot reproduce with a fixed VM once available. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub, or unsubscribe. > -- feenk.com "Innovation comes in the least expected form. That is, if it is expected, it already happened." |
In reply to this post by David T Lewis
Hi Clement, On Thu, 28 Nov 2019 at 22:40, Clement Bera <[hidden email]> wrote: > > Fixed in VMMaker.oscog.2595. I could run all the iterations of the script with the fix. Thanks! I can confirm that the script completes all 400 iterations without crashing, in fact without any assertion warnings at all. We'll do some additional testing in the coming days, but I think it's safe to assume that the root cause has been solved. > Eliot now needs to fix tip of tree (Cog VM, stack works) and regenerate the source code here. > > I don't mark it as fixed since I'd like the reporter to confirm he cannot reproduce with a fixed VM once available. Thanks also for your detailed write-up (sent separately). I'll go through it step by step and respond tomorrow (a bit late now :-)). Thanks again, Alistair |
In reply to this post by Clément Béra
Hi Clément, Hi All, On Thu, Nov 28, 2019 at 1:36 PM Clément Béra <[hidden email]> wrote:
Yeah :-) For those unaware of what this is, it is simply the ability to repeat GC crashes in the simulator. What we do, bu default, is every time we run a GC in the simulator we first create a complete deep copy of the VM interpreter, memory manager and heap, and then run the GC in the deep copy. If that crashes because of a bug in GC we can loop from then on, taking fresh copies of the VM state and running the GC until we understand and then fix the bug. The problem with normal GHC debugging is that the final state of the heap is so corrupted one cannot make sense of it and cannot figure out where the bug is. Here we alway preserve the initial, valid, state of the heap.
Fantastic, Clément. Thank you.
This is a critical bug to fix. It also blind sided me form m y own bug in the compactor. Apologies to all for my stubbornness in pointing the finger at FreeType. But the bug here is still serious. Further, it is at least conceptually possible to simulate FFI calls, or rather simulate the marshaling of FFI calls so that FFI calls can be made within simulation. We could therefore check in the simulator for unpinned objects, etc. but we would still be vulnerable to usafe foreign code corrupting the system. Any FFI call is potentially unsafe, because it is infeasible (performance wise) to e.g. page-protect the Smalltalk heap and VM state during an FFI call, and completely impossible to do this for a threaded (non-blocking) FFI call. .
:-)
Right. I'm committing improvements to the ImageLeakChecker first so that it detects out-of-bounds oops as created by the compactor bug. I have this working. I am just committing/generating first. Then I'll fix source generation, which is temporarily damaged by the ARMv8 work. It should be easy to fix. Then I can generate fresh VMs.
_,,,^..^,,,_ best, Eliot |
Free forum by Nabble | Edit this page |