Executing the following script produces a segmentation fault: | aJson anArray |
aJson := ZnEasy get: 'https://data.nasa.gov/resource/y77d-th95.json' asZnUrl.
Array streamContents: [ :aStream |
400 timesRepeat: [
aStream nextPutAll: (STON fromString: aJson contents).
Smalltalk saveSession ] ]. Reproduced on Pharo 8 Mac OS:
Also reproduced on Pharo 8 Mac OS:
Also reproduced on Pharo 8 Linux Ubuntu 18.04:
— |
This #391 (comment) issue comment report a similar reproducible script with a different stack dump on segmentation fault. — |
In reply to this post by David T Lewis
Also crashes with the latest Pharo Stable VM on Mac OS Mojave
— |
In reply to this post by David T Lewis
This corroborates what pablo told me and pavel found. — |
In reply to this post by David T Lewis
If one starts up the latest Pharo 8 64-bit image using an assert VM with leak checking turned on one ends that the image is already corrupted. It is not the script, nor the VM that is at fault, but the initial image, which is already corrupted. Here's what I get when I launch the latest 64-bit 8.0 image using an assert VM with leak checking turned on (commentary after the run):
My script pharo64cavm runs an assert VM, specifically /Users/eliot/oscogvm/build.macos64x64/pharo.cog.spur/PharoAssert.app/Contents/MacOS/Pharo. Supplying the --lldb argument has the script launch lldb (a low-level debugger for native executables) on the VM. This command places a breakpoint in the error/warning output routine called when the VM wants to report that asserts have failed, leaks in the heap have been found, etc.
This command then launches the VM under the control of the debugger.
The arguments to leak check are a combination of the following flags:
If GCModeFull is set then the VM performs a leak check on loading the initial image. From the back trace you can see that the Vm has not yet started running, loadInitialCOntext being the routine that sets up the VM to run from the context that performed the snapshot:
The following leak report shows that there are many leaks in this image:
Let's take a look at some of these objects. In lldb we can call the VM's debug printing routines, just as we can in the simulator:
So the first suspect (to me) looks like external C memory management in FreeType font management. Let me suggest you add a step in the release process which involves checking the validity of images before they're released. Let me also suggest that you appoint a team to look at FreeType font management using the leak checker, et al, to find and fix these issues which I think have been around for quite a while. — |
An interesting question to ask here is can you tag the image memory as read only during a FFI call out for debugging purposes? If writes to image memory are required can they sandboxed? If writes to a display area are required can that be protected by no read/write pages before/after the screen buffer to trap overwrites or reads? ....
|
In reply to this post by David T Lewis
John, one could easily add that facility, but I believe that the problem is more likely to do with dangling pointers than FreeType writing into the heap. I suspect that what happens is that on a previous save or restart, pointers to C memory that was allocated in the run before the current one are not invalidated and still used. I believe the problem is that the FFI is not being used properly and that it is not at fault. Instead, stale pointers are being followed abd memory corruption occurring. As I said above the necessity is in checking that a valid image is created and that stake pointers are invalidated. This is an age old problem with Smalltalk programs that use external memory, external handles, descriptors, etc. There is a style which desks well with this and it should be followed.
Using this style we do not have to close and reopen around a snapshot, but we do have to perform the invalidation early enough so that there is no chance of accessing anything external before all invalidations are complete. Further, using a registry of objects is much much better than using, for example, allInstances because typically there are few (tens, hundreds at most, not thousands) of objects that reference external resources, and they may be of various classes, so the registry is able to reference them in more or less linear time in the size of the registry, independent of image size, while using allInstances accesses objects in time proportional to the product of the number of classes and the image size. Clearly this does not scale as the system gets more complex and the image size grows. Startup time is very important. I led the VisualWorks team through this exercise and we were able to reduce start up times from hundreds of milliseconds to forty milliseconds (IIRC) in the VW 3.0 timeframe. — |
In reply to this post by David T Lewis
This issue is also recurrent in Pharo 7:
— |
In reply to this post by David T Lewis
This also happens in Pharo 6:
— |
In reply to this post by David T Lewis
Just to confirm that it's probably not a garbageCollect problem, I could not reproduce in latest Squeak trunk. I did not use Zinc because it's too much difficult to install in Squeak and just replaced with WebClient. STON is available (installed thru Squit/Squot git support):
The resulting image file is 540Mbytes long. — |
In reply to this post by David T Lewis
Hi Eliot and Nicolas, Nicolas, thanks for checking on Squeak, that is useful to know. Juraj and I have both built assert VMs and been trying to reproduce Eliot's findings. If I add a couple of print statements to the code and run with a normal VM, I get a number of:
messages. If I run gccrash.st with the assert VM I get:
before the process seg faults. Do these provide any additional information to help track down the issue? (I'll include more complete information below) I tried running a headless VM and printing instance counts about FreeType external objects in a clean image:
Once the image has been started normally, the pointer in FT2Library becomes non-zero. Which to me seems to suggest that rather than the image being delivered in a corrupt state, it's something that happens early in session resumption. — |
In reply to this post by David T Lewis
More complete crash dump with normal VM:
— |
In reply to this post by David T Lewis
Running with the assert VM in lldb:
Thanks, Eliot! — |
In reply to this post by David T Lewis
Hi Alistair,
lldb> b Pharo`warning — |
In reply to this post by David T Lewis
Hi All, to simplify checking there is now a generated image checker. This is a cut-down VM that only loads an image and runs the leak checker, answering 0 (unix's OK exit code) if the image is free of leaks, and non-zero if it is leaky. The program takes a -verbose/--verbose argument that will cause it to list the leaks or write a reassuring message if there are none. This can be built for mac in build.macos64x64/squeak.stack.spur & build.macos32x86/squeak.stack.spur by saying make production (image leak checker) and it produces a program called validImage in squeak.stack.spur/build/vm. I saw that it took 2 seconds to load and check a 1Gb image so it should be fast enough to be used in a CI context. See f83bde2 HTH — |
Bravo, this is a really good idea :-) Dave On Sat, Nov 16, 2019 at 04:47:01PM -0800, Eliot Miranda wrote: > > Hi All, to simplify checking there is now a generated image checker. This is a cut-down VM that only loads an image and runs the leak checker, answering 0 (unix's OK exit code) if the image is free of leaks, and non-zero if it is leaky. The program takes a -verbose/--verbose argument that will cause it to list the leaks or write a reassuring message if there are none. > > This can be built for mac in build.macos64x64/squeak.stack.spur & build.macos32x86/squeak.stack.spur by saying make production (image leak checker) and it produces a program called validImage in squeak.stack.spur/build/vm. I saw that it took 2 seconds to load and check a 1Gb image so it should be fast enough to be used in a CI context. > > See https://github.com/OpenSmalltalk/opensmalltalk-vm/commit/f83bde2bf5c325ce26f3368bc221578a752a9631 > > HTH > > -- > You are receiving this because you commented. > Reply to this email directly or view it on GitHub: > https://github.com/OpenSmalltalk/opensmalltalk-vm/issues/444#issuecomment-554689372 |
In reply to this post by David T Lewis
Hi Eliot, I've spent some more time trying to track this down... I've been working with a minimal pharo image, which can be downloaded from: http://files.pharo.org/image/80/latest-minimal-64.zip The minimal image doesn't have FreeType yet loaded, so we can rule out FreeType as the cause of this particular issue (not to say that it doesn't have problems). Running the following script with the minimal image and a debug VM:
Shows two things:
So there appears to be no memory corruption up to this stage. Modifying the script once more to save the image instead of just garbage collecting:
Results in the segmentation fault. In this case it was while saving the 90th image:
I'll try attaching a file containing the terminal output (as much as was buffered). Please let me know if you disagree with any of my reasoning. The only difference between the two scripts is that the second one writes the image to disk, which seems to suggest that it's the image saving that could be the cause of the issue. What do you think? Thanks! — |
In reply to this post by David T Lewis
To paraphrase github's error message: A short extract:
— |
In reply to this post by David T Lewis
Just a bit more:
— |
In reply to this post by David T Lewis
P.P.S. It would be great to be able to — |
Free forum by Nabble | Edit this page |