Hi all, Thanks for all the hints, I finally got that thing flying! Building with MacPorts clang-6.0 and editing flags finally worked when I found all the places (OFLAGS and LDFLAGS took a while to figure out). Building with Apple clang gave me "leak detection not supported", so that one is probably too old for what I want to do. Now I still have to trigger the leak I am after. I will provide a bug report in the leaking component and I will report the diff on flags I had to make, they might be useful for others. It was pretty dirty, e.g. I had to symlink some objects in the MacPorts AddressSanitizer installation as some binaries were linked with wrong paths, ew. Again, thanks for encouraging me by telling me that I should be possible, I was already close to giving up. This Xcode env is a beast I try to avoid, too opaque to me. Below is an example report, with the symbolizer it even tells you the source location, pretty handy. I guess this is not really a leak, but expected behavior for FFI. Cheers, Manuel ================================================================= ==12156==ERROR: LeakSanitizer: detected memory leaks Direct leak of 9080 byte(s) in 15 object(s) allocated from: #0 0x10a2cb2cc in wrap_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x592cc) #1 0x10a091410 in primitiveFFIAllocate (Pharo:x86_64+0x100315410) #2 0x1137369b7 (<unknown module>) #3 0x109d7e56c in interpret gcc3x-cointerp.c:2754 #4 0x109f7dc29 in -[sqSqueakMainApplication runSqueak] sqSqueakMainApplication.m:201 #5 0x7fffc93786fc in __NSFirePerformWithOrder (Foundation:x86_64+0xd76fc) #6 0x7fffc78cfc56 in __CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ (CoreFoundation:x86_64h+0xa6c56) #7 0x7fffc78cfbc6 in __CFRunLoopDoObservers (CoreFoundation:x86_64h+0xa6bc6) #8 0x7fffc78b05f8 in __CFRunLoopRun (CoreFoundation:x86_64h+0x875f8) #9 0x7fffc78b0033 in CFRunLoopRunSpecific (CoreFoundation:x86_64h+0x87033) #10 0x7fffc6e10ebb in RunCurrentEventLoopInMode (HIToolbox:x86_64+0x30ebb) #11 0x7fffc6e10bf8 in ReceiveNextEventCommon (HIToolbox:x86_64+0x30bf8) #12 0x7fffc6e10b25 in _BlockUntilNextEventMatchingListInModeWithFilter (HIToolbox:x86_64+0x30b25) #13 0x7fffc53a5a53 in _DPSNextEvent (AppKit:x86_64+0x46a53) #14 0x7fffc5b217ed in -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] (AppKit:x86_64+0x7c27ed) #15 0x7fffc539a3da in -[NSApplication run] (AppKit:x86_64+0x3b3da) #16 0x7fffc5364e0d in NSApplicationMain (AppKit:x86_64+0x5e0d) #17 0x109f6e553 in main main.m:52 #18 0x7fffdd0a2234 in start (libdyld.dylib:x86_64+0x5234) ...
|
Hm, something is still not working. I compiled pharo.cog.spur and libMoz2D with clang-7.0 and '-fsanitize=address -fsanitize-recover=address -mllvm -asan-stack=0 -mllvm -asan-globals=0 -fno-omit-frame-pointer -g3' or '-fsanitize=leak -fno-omit-frame-pointer -g3', and then run it with 'ASAN_OPTIONS=detect_leaks=1 PharoDebug.app/Contents/MacOS/Pharo Pharo.image'. This gives me the leak reports on shutdown, but I cannot view any Bloc element, I always get a primitive failed error in BlExternalForm >> primCreateManualSurfaceWidth: width height: height rowPitch: rowPitch depth: depth isMSB: isMSB <primitive: 'primitiveCreateManualSurface' module: 'SqueakFFIPrims'> self primitiveFailed How can I debug this? I do not get stderr that would give me any hints what failed exactly. So, did the primitive fail during execution or was it not found at all? I looked into the TravisCI build and compiled with 'ARCH="macos64x64" FLAVOR="pharo.cog.spur" ./mvm -d', but that did not change anything. I noticed that the SqueakFFIPrims plugin defines primitiveCreateManualSurface on all architectures, but how do I ensure and check that the built VM actually includes this primitive? Am I missing another flag or env variable? Cheers, Manuel
|
Hi Manuel, On Mon, 28 Jan 2019 at 11:51, Manuel Leuenberger <[hidden email]> wrote: > > BlExternalForm >> primCreateManualSurfaceWidth: width height: height rowPitch: rowPitch depth: depth isMSB: isMSB > <primitive: 'primitiveCreateManualSurface' module: 'SqueakFFIPrims'> > self primitiveFailed > > How can I debug this? My first suggestion would be to see if the primitive is returning a descriptive error. The method would become (untested): BlExternalForm >> primCreateManualSurfaceWidth: width height: height rowPitch: rowPitch depth: depth isMSB: isMSB <primitive: 'primitiveCreateManualSurface' module: 'SqueakFFIPrims' error: error> self primitiveFailed When the primitive fails, open the debugger, inspect this method in the call stack and check the value of error. If the primitive hasn't been found, error will be something like #'not found' (from memory). If it's something else other than nil, report back with the value. If error is nil it will be a matter of diving in to the Slang / C code and figuring it out :-). If you've successfully built a debug version of the VM, maybe the easiest next step is to set a breakpoint in primitiveCreateManualSurface and step through? Cheers, Alistair |
Hi Alistair, Thanks for the hint, the primitive error is indeed nil. And it seems to my sanitizer flags that cause the trouble, because I cannot reproduce the error when I build the VM without them. I tried different version of clang (Apple clang 9, MacPorts clang 6, MacPorts clang 7, Chromium clang 9) with different sanitizer flags, but could either not even start up the VM properly due to a reported stack-buffer-underflow, or the primitive failed. Debugging revealed that createManualSurface returns -1 because registerSurface is not initialized. Halting in initSurfacePluginFunctionPointers then shows that findOrLoadModule("SurfacePlugin", 0) is 0, so there seems no plugin to be found. I need to continue digging. Cheers, Manuel > On 28 Jan 2019, at 19:29, Alistair Grant <[hidden email]> wrote: > > > Hi Manuel, > > On Mon, 28 Jan 2019 at 11:51, Manuel Leuenberger > <[hidden email]> wrote: >> >> BlExternalForm >> primCreateManualSurfaceWidth: width height: height rowPitch: rowPitch depth: depth isMSB: isMSB >> <primitive: 'primitiveCreateManualSurface' module: 'SqueakFFIPrims'> >> self primitiveFailed >> >> How can I debug this? > > My first suggestion would be to see if the primitive is returning a > descriptive error. The method would become (untested): > > BlExternalForm >> primCreateManualSurfaceWidth: width height: height > rowPitch: rowPitch depth: depth isMSB: isMSB > <primitive: 'primitiveCreateManualSurface' module: > 'SqueakFFIPrims' error: error> > self primitiveFailed > > When the primitive fails, open the debugger, inspect this method in > the call stack and check the value of error. If the primitive hasn't > been found, error will be something like #'not found' (from memory). > If it's something else other than nil, report back with the value. > > If error is nil it will be a matter of diving in to the Slang / C code > and figuring it out :-). If you've successfully built a debug version > of the VM, maybe the easiest next step is to set a breakpoint in > primitiveCreateManualSurface and step through? > > Cheers, > Alistair |
As it turns out, I am just a special kind of an idiot. After Alistair's hint at just stepping through the VM, I threw all my fears of gdb etc. over board an stepped around in lldb. The failed primitive was indeed due to the missing SurfacePlugin, which I had in the standard build, but not in my leak build. So I looked at my git diff again and found that I disabled the CroquetPlugin, because it caused some trouble a while ago. It did not seem too harmful, as most stuff worked, and I thought I do not need this as it seemed related to the Croquet project for multi-user applications. What could possibly go wrong? Well, apparently I couldn't really draw without it. It's funny how initially seemingly unrelated decisions lead to side-effects that seem unexplainable because I forget that I made this decision. Anyway, now I have those finger-licking good leak reports from ASan for GToolkit builds including resolved symbols and source line numbers for both the VM and Moz2D. Time for some leak hunting now, I'll keep you posted if I find something in the VM that needs attention. Sorry for the spam and that you had to observe an idiot trying to hammer the ball into the square hole. Maybe it was at least entertaining for you. Thanks for all the help, you managed to take away my fear of the VM, it no longer intimidates me. Cheers, Manuel Direct leak of 184 byte(s) in 1 object(s) allocated from: #0 0x10c58748c in wrap_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x5d48c) #1 0x12599889d in moz_xmalloc mozalloc.cpp:83 #2 0x123ce2895 in mozilla::gfx::Factory::CreateDrawTargetForData(mozilla::gfx::BackendType, unsigned char*, mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&, int, mozilla::gfx::SurfaceFormat, bool) mozalloc.h:194 #3 0x126282c4b in moz2d_draw_target_create_for_data_type draw_target.cpp:35 #4 0x10b76eca6 in primitiveCalloutWithArgs (Pharo:x86_64+0x10035cca6) #5 0x10b47f472 in primitiveExternalCall gcc3x-cointerp.c:76887 #6 0x10b4765ed in executeNewMethod gcc3x-cointerp.c:22341 #7 0x10b47b7fd in ceSendsupertonumArgs gcc3x-cointerp.c:16540 #8 0x1151a8134 (<unknown module>) #9 0x10b414ac9 in interpret gcc3x-cointerp.c:2754 #10 0x10b672afa in -[sqSqueakMainApplication runSqueak] sqSqueakMainApplication.m:201 #11 0x7fffc93786fc in __NSFirePerformWithOrder (Foundation:x86_64+0xd76fc) #12 0x7fffc78cfc56 in __CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ (CoreFoundation:x86_64h+0xa6c56) #13 0x7fffc78cfbc6 in __CFRunLoopDoObservers (CoreFoundation:x86_64h+0xa6bc6) #14 0x7fffc78b05f8 in __CFRunLoopRun (CoreFoundation:x86_64h+0x875f8) #15 0x7fffc78b0033 in CFRunLoopRunSpecific (CoreFoundation:x86_64h+0x87033) #16 0x7fffc6e10ebb in RunCurrentEventLoopInMode (HIToolbox:x86_64+0x30ebb) #17 0x7fffc6e10bf8 in ReceiveNextEventCommon (HIToolbox:x86_64+0x30bf8) #18 0x7fffc6e10b25 in _BlockUntilNextEventMatchingListInModeWithFilter (HIToolbox:x86_64+0x30b25) #19 0x7fffc53a5a53 in _DPSNextEvent (AppKit:x86_64+0x46a53) #20 0x7fffc5b217ed in -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] (AppKit:x86_64+0x7c27ed) #21 0x7fffc539a3da in -[NSApplication run] (AppKit:x86_64+0x3b3da) #22 0x7fffc5364e0d in NSApplicationMain (AppKit:x86_64+0x5e0d) #23 0x7fffdd0a2234 in start (libdyld.dylib:x86_64+0x5234)
|
Hi Manuel,
The advantage of the mailing list is that it won’t go away any time soon. So how about writing a new message in this thread which summarizes? It could be written as more of a guide “Storage leaks come in more than one variety... To diagnose a leak in the C heap... To link against a C leak checker...” and include links to the other messages in this thread on the mail list serve for details “Specifics on how to modify the Mac build to link in the leak checker are ...”. What do you think? What to others think? How do we best capture this for posterity?
_,,,^..^,,,_ (phone) |
On Tue, 29 Jan 2019 at 22:32, Eliot Miranda <[hidden email]> wrote:
Just brainstorming, a markdown page here... cheers -ben |
In reply to this post by Manuel Leuenberger
> On 2019-01-29, at 3:16 AM, Manuel Leuenberger <[hidden email]> wrote: > > > Thanks for all the help, you managed to take away my fear of the VM, it no longer intimidates me. It's just luring you into a very false sense of security. You remember 'Alien'? That was based on the life of a VM developer.. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Oxymorons: Political science |
In reply to this post by Eliot Miranda-2
> On 2019-01-29, at 6:32 AM, Eliot Miranda <[hidden email]> wrote: > > > What do you think? What to others think? How do we best capture this for posterity? Mail list posts are one form of immortality but can easily get lost in the growing cruft as time passes. It's also difficult to establish the currency of the info a year (or ten!) later. Blog posts and forum posts have the same problems. Pages/books on github or similar have the virtue of being hosted in a reasonably secure manner and carry date /version metadata with them, so that could be helpful. Linking to the doc in the list, in blogs or tweets spreads the signal. Linking to the (hopefully) large and growing tree of well written and well cared for doc from the sources would be very useful. More important than any of the details is simply getting the doc produced and reviewed. Collectively we (software dweebs) are *useless* at doc, even though it constantly bites us. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Strange OpCodes: PFM: Pray For Miracle |
In reply to this post by Eliot Miranda-2
On Tue, 29 Jan 2019 at 15:32, Eliot Miranda <[hidden email]> wrote: > > > ... > > So this is great. The final step is seeing which FFI method invokes moz2d_draw_target_create_for_data_type and then the storage leak can be fixed. > What would be wonderful is to capture your experience in diagnosing this storage leak as a guide to others. But in what form? One would be a blog post. I’m nervous though because blogs sometimes disappear. One would be some section in one of the READMEs in the source tree. One would be a carefully written email to this list recapitulating or summarizing the experience. > > The advantage of the mailing list is that it won’t go away any time soon. So how about writing a new message in this thread which summarizes? It could be written as more of a guide “Storage leaks come in more than one variety... To diagnose a leak in the C heap... To link against a C leak checker...” and include links to the other messages in this thread on the mail list serve for details “Specifics on how to modify the Mac build to link in the leak checker are ...”. > > What do you think? What to others think? How do we best capture this for posterity? I would split this in two: - The basic instructions on compiling with the leak checker should go in the HowToBuild. - The description you're suggesting above should go in http://opensmalltalk.org/ as Ben suggested, or a wiki page in the OpenSmalltalk-VM repository. Each should provide a link to the other. I'm not a big fan of github wiki pages, but they have two advantages: - They keep the content with the code. - They're relatively easy to download and back-up offline (in the event that we decide to move the code elsewhere). My preference would be http://opensmalltalk.org/, but I'm not sure what the access rights on it are. Actually, I didn't even know it existed, thanks Ben! As Tim said, email lists are hard to search, and can't be easily updated. Blog posts don't really lend themselves to being updated either. My 2c... Cheers, Alistair |
I thought about adding a flag for all ASan magic in the build scripts, then document the procedure in the readme, maybe also more extensive somewhere else. You probably have a better idea where a good place would be than I do. I dug into this today again and wrote a parser for the reports (https://github.com/maenu/leak-reporter). It has a not-so-good visualization of merged stacks showing heavy frames as big and solid, it also includes example reports (https://github.com/maenu/leak-reporter/tree/master/res). The reported leaks only amount for about 1% of leaked memory, not too many objects leaking, but I am not sure if ASan just counts object layout size or objects graphs (probably not). What I notice is that OS events are leaking, as they are copied. Might be an issue. But I am not entirely sure if this explains everything. Maybe the VM just allocates too many objects of some type, but frees them on shutdown, so they might not even appear in the reports. Can a C heap analyzer give me stats about struct instances per struct type? Might be useful to analyze core dumps then. Cheers, Manuel Direct leak of 219512 byte(s) in 1193 object(s) allocated from: #0 0x10bb1fffc in __sanitizer_mz_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x5dffc) #1 0x7fffdd224281 in malloc_zone_malloc (libsystem_malloc.dylib:x86_64+0x2281) #2 0x7fffc6de33d5 in AllocateAndInitializeEvent(__CFAllocator const*) (HIToolbox:x86_64+0x33d5) #3 0x7fffc6e3fc4d in CopyEventInternal(__CFAllocator const*, OpaqueEventRef*) (HIToolbox:x86_64+0x5fc4d) #4 0x7fffc6e49228 in CopyEventAs (HIToolbox:x86_64+0x69228) #5 0x7fffc6e1d349 in CreateEventWithCGEvent (HIToolbox:x86_64+0x3d349) #6 0x7fffc6e1b0e2 in CreateAndPostEventWithCGEvent(__CGEvent*, unsigned int, unsigned char, __CFMachPortBoost*) (HIToolbox:x86_64+0x3b0e2) #7 0x7fffc6e275cd in Convert1CGEvent(unsigned char) (HIToolbox:x86_64+0x475cd) #8 0x7fffc6e2745e in MainLoopObserver(unsigned int, OpaqueEventRef*, void*) (HIToolbox:x86_64+0x4745e) #9 0x7fffc6de8368 in _NotifyEventLoopObservers (HIToolbox:x86_64+0x8368) #10 0x7fffc6e10ea5 in RunCurrentEventLoopInMode (HIToolbox:x86_64+0x30ea5) #11 0x7fffc6e10bf8 in ReceiveNextEventCommon (HIToolbox:x86_64+0x30bf8) #12 0x7fffc6e10b25 in _BlockUntilNextEventMatchingListInModeWithFilter (HIToolbox:x86_64+0x30b25) #13 0x7fffc53a5a53 in _DPSNextEvent (AppKit:x86_64+0x46a53) #14 0x7fffc5b217ed in -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] (AppKit:x86_64+0x7c27ed) #15 0x10b7ce07c in -[sqSqueakOSXApplication(events) pumpRunLoopEventSendAndSignal:] sqSqueakOSXApplication+events.m:78 #16 0x10b7ce7d9 in -[sqSqueakOSXApplication(events) pumpRunLoop] sqSqueakOSXApplication+events.m:105 #17 0x10b7e4898 in vmIOProcessEvents sqSqueakEventsAPI.m:80 #18 0x10b7e49c7 in ioProcessEvents sqSqueakEventsAPI.m:103 #19 0x10b5b5653 in checkForEventsMayContextSwitch gcc3x-cointerp.c:62680 #20 0x10b5c17ed in ceCheckForInterrupts gcc3x-cointerp.c:15188 #21 0x119a75794 (<unknown module>) #22 0x10b566712 in interpret gcc3x-cointerp.c:2754 #23 0x10b7ebada in -[sqSqueakMainApplication runSqueak] sqSqueakMainApplication.m:201 #24 0x7fffc93786fc in __NSFirePerformWithOrder (Foundation:x86_64+0xd76fc) #25 0x7fffc78cfc56 in __CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ (CoreFoundation:x86_64h+0xa6c56) #26 0x7fffc78cfbc6 in __CFRunLoopDoObservers (CoreFoundation:x86_64h+0xa6bc6) #27 0x7fffc78b05f8 in __CFRunLoopRun (CoreFoundation:x86_64h+0x875f8) #28 0x7fffc78b0033 in CFRunLoopRunSpecific (CoreFoundation:x86_64h+0x87033) #29 0x7fffc6e10ebb in RunCurrentEventLoopInMode (HIToolbox:x86_64+0x30ebb)
|
Alright, back in the game, this time using gperftools. I finally managed to reproduce a leaking run, but it took me hours to get there, randomly rendering GtCoders and messing around. gperftools instruments malloc and friends, capturing heap profiles every 30 seconds. Now I have profiles from a freshly started, small image, before it went berserk on memory hogging, and when it allocates over 3GB. Now, I have visualizations of allocation sites at these instants, and diffs between them, as well as some thread dumps on the berserk state. I managed to rule out the in-image objects to contribute to memory by cleaning images with garbage collect and cache resets. Some handy resets are these: NECController reset. FreeTypeCache current removeAll. ASTCache reset. Smalltalk garbageCollect. These get rid of a leak in NECController, resets the font cache (creates a lot of bitmaps), and ASTs, created by the coders. With SpaceTally printSpaceAnalysis. I get objects stats from within the image, and I managed to keep them in the same ballpark for all instants I took profiles of, by inspecting over-populated instances, like this instances := WeakOrderedCollection withAll: RBMessageNode allInstances. ReferenceFinder findPathTo: instances atRandom. With FFIExternalResourceManager uniqueInstance instVarNamed: #registry I also made sure that external objects are clean. Now, I am pretty confident that I can clean images, so the memory-overload should be sourced in the VM. Still, I cannot reliably reproduce the leak. It just happens after running an image for hours, doing stuff, idling, going to sleep, wake up. Unfortunately, I did not run it on a debug VM, so nearly no symbols, but I prepared a summary of the analysis anyway, as it already gives some hints that some of you can maybe interpret better than me. Have look at the PDFs in particular, they show the allocation sites at the three interesting instants, (clean, before berserk, after berserk), as well as the diffs in between. I find that berserk mode is caused by a huge pile of Objective-C objects. The thread samples show that while idling, it seems to be busy processing OS events. But I do not know what kind of objects they actually are. My current hypothesis is: The leak is caused by processing a huge pile of OS events being processed (reprocessed?). For some reason, this only happens after a long run time. Now, how can I accept/reject this hypothesis? Is there a way to log events received by the VM? I also have a core dump of the 3GB process, but I have no idea how I could check it for Objective-C objects grouped by type. If anybody knows how to do that, I might be able to come up with a comprehensive list of objects that lead to the berserk mode. Maybe it is not even are real leak, but just accumulating objects that should be thrown away at some point. Also, I can still not fully reject that the image is involved, maybe it forgets external resources it should not forget at some point. As I now know again that I can trigger berserk mode, I will switch to the debug VM/Moz2D again with ASan. More symbols, leak reports, might reveal better information. Anyway, if anybody made it through the whole email and maybe even looked at the ZIP, I would appreciate ideas on how to proceed, what to test, what to inspect. I recently had a few incidents where I realized that one of my implicit assumptions/axioms does not hold, and I kept digging in the wrong direction for too long, so any feedback on this is welcome. Cheers, Manuel
|
Not sure if I found something, but looking at sqSqueakOSXApplication+events.m I find - (void) pumpRunLoopEventSendAndSignal:(BOOL)signal { NSEvent *event; NSMutableArray *alienEventQueue = [[NSMutableArray alloc] init]; while ((event = [NSApp nextEventMatchingMask:NSEventMaskAny untilDate:nil inMode:NSEventTrackingRunLoopMode dequeue:YES])) { // If the event is not a system event or an event of *this* window, queue the event // Otherwise treat the event normally and send it to the app if (event.window && event.window != gDelegateApp.window){ [alienEventQueue addObject: event]; }else{ [NSApp sendEvent: event]; if (signal) { interpreterProxy->signalSemaphoreWithIndex(gDelegateApp.squeakApplication.inputSemaphoreIndex); } } } // Put back in the queue all events that did not belong to this window // They will be managed by other windowing systems // (or by a subsequent invocation to this function) while ((event = [alienEventQueue firstObject])) { [NSApp postEvent: event atStart: NO]; [alienEventQueue removeObject: event]; } } I have no clue about Objective-C, but I think NSMutableArray *alienEventQueue = [[NSMutableArray alloc] init]; makes me own the queue on the heap, isn't there a [alienEventQueue release]; missing at the end? Cheers, Manuel
|
Hi Manuel, Disclaimer: I'm not an objectiveC expert either :) When I did that (and compiled, and tested), I remember having seen that it was using the automatic reference counting from objectiveC. Actually, putting the [release] was producing for me a segmentation fault. Did you try it yourself and did it fix the problem for you? On Wed, Feb 27, 2019 at 10:52 AM Manuel Leuenberger <[hidden email]> wrote:
|
Hi, so if you point your git tool to the squeak vm tree at commit fae3d7eb403158f177c306f668cb461c35c99419 The you can download the JMM/IOS64 branch which should let you build a OSX VM from xCode and do address sanitation, leak detection etc, from the comforts of xCode and Instruments. (check Instruments out) allocations & generations etc. if you turn on address sanitizer, detect use of stack after return, undefined behavior sanitizer, (pause on issues), Main Thread checker (Pause on issues) Likely there is 'stuff' there that could be fixed... As for the alloc/init the vm can be build as arc or non-arc. And or a mixture used at a file level, plus of course bridging from core foundation to swift and indicating who has ownership of the object. .... ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Wednesday, February 27, 2019 3:05 AM, Guillermo Polito <[hidden email]> wrote:
|
Found it, reproduced it, fixed it. I think memory consumption should be a point in the review process, and there should be memory performance tests (are there any already?). Took me two months to write the necessary two lines of code. Cheers, Manuel
|
Hi, > On 28.02.2019, at 11:57, Manuel Leuenberger <[hidden email]> wrote: > > Found it, reproduced it, fixed it. > > https://twitter.com/maenuleu/status/1101068937750634496 > https://github.com/OpenSmalltalk/opensmalltalk-vm/pull/373 > > I think memory consumption should be a point in the review process, and there should be memory performance tests (are there any already?). > > Took me two months to write the necessary two lines of code. Read it. Makes sense. I first had the suspicion that the thing allocated ehre may be sent/processed somewhere else, too. But I read around and it looks good. Have added a few comments to the PR. Am fine with the PR. Best regards -tObias > > Cheers, > Manuel > >> On 28 Feb 2019, at 02:52, John M McIntosh <[hidden email]> wrote: >> >> Hi, so if you point your git tool to the squeak vm tree at commit >> fae3d7eb403158f177c306f668cb461c35c99419 >> The you can download the JMM/IOS64 branch which should let you build a OSX VM from xCode and do address sanitation, leak detection etc, from the comforts of xCode and Instruments. (check Instruments out) allocations & generations etc. >> >> if you turn on address sanitizer, detect use of stack after return, undefined behavior sanitizer, (pause on issues), Main Thread checker (Pause on issues) Likely there is 'stuff' there that could be fixed... >> >> >> As for the alloc/init the vm can be build as arc or non-arc. And or a mixture used at a file level, plus of course bridging from core foundation to swift and indicating who has ownership of the object. >> >> >> >> .... >> John M. McIntosh. Corporate Smalltalk Consulting Ltd https://www.linkedin.com/in/smalltalk >> >> >> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >> On Wednesday, February 27, 2019 3:05 AM, Guillermo Polito <[hidden email]> wrote: >> >>> Hi Manuel, >>> >>> Disclaimer: I'm not an objectiveC expert either :) >>> >>> When I did that (and compiled, and tested), I remember having seen that it was using the automatic reference counting from objectiveC. >>> Actually, putting the [release] was producing for me a segmentation fault. >>> Did you try it yourself and did it fix the problem for you? >>> >>> >>> On Wed, Feb 27, 2019 at 10:52 AM Manuel Leuenberger <[hidden email]> wrote: >>> >>> Not sure if I found something, but looking at sqSqueakOSXApplication+events.m I find >>> >>> - (void) pumpRunLoopEventSendAndSignal:(BOOL)signal { >>> NSEvent *event; >>> NSMutableArray *alienEventQueue = [[NSMutableArray alloc] init]; >>> while ((event = [NSApp nextEventMatchingMask:NSEventMaskAny >>> untilDate:nil >>> inMode:NSEventTrackingRunLoopMode >>> dequeue:YES])) { >>> // If the event is not a system event or an event of *this* window, queue the event >>> // Otherwise treat the event normally and send it to the app >>> if (event.window && event.window != gDelegateApp.window){ >>> [alienEventQueue addObject: event]; >>> }else{ >>> [NSApp sendEvent: event]; >>> if (signal) { >>> interpreterProxy->signalSemaphoreWithIndex(gDelegateApp.squeakApplication.inputSemaphoreIndex); >>> } >>> } >>> } >>> >>> // Put back in the queue all events that did not belong to this window >>> // They will be managed by other windowing systems >>> // (or by a subsequent invocation to this function) >>> while ((event = [alienEventQueue firstObject])) { >>> [NSApp postEvent: event atStart: NO]; >>> [alienEventQueue removeObject: event]; >>> } >>> } >>> >>> I have no clue about Objective-C, but I think >>> >>> NSMutableArray *alienEventQueue = [[NSMutableArray alloc] init]; >>> >>> makes me own the queue on the heap, isn't there a >>> >>> [alienEventQueue release]; >>> >>> missing at the end? >>> >>> Cheers, >>> Manuel >>> >>>> On 26 Feb 2019, at 23:31, Manuel Leuenberger <[hidden email]> wrote: >>>> >>>> Alright, back in the game, this time using gperftools. >>>> >>>> I finally managed to reproduce a leaking run, but it took me hours to get there, randomly rendering GtCoders and messing around. gperftools instruments malloc and friends, capturing heap profiles every 30 seconds. Now I have profiles from a freshly started, small image, before it went berserk on memory hogging, and when it allocates over 3GB. Now, I have visualizations of allocation sites at these instants, and diffs between them, as well as some thread dumps on the berserk state. I managed to rule out the in-image objects to contribute to memory by cleaning images with garbage collect and cache resets. Some handy resets are these: >>>> >>>> NECController reset. >>>> FreeTypeCache current removeAll. >>>> ASTCache reset. >>>> Smalltalk garbageCollect. >>>> >>>> These get rid of a leak in NECController, resets the font cache (creates a lot of bitmaps), and ASTs, created by the coders. With >>>> >>>> SpaceTally printSpaceAnalysis. >>>> >>>> I get objects stats from within the image, and I managed to keep them in the same ballpark for all instants I took profiles of, by inspecting over-populated instances, like this >>>> >>>> instances := WeakOrderedCollection withAll: RBMessageNode allInstances. >>>> ReferenceFinder findPathTo: instances atRandom. >>>> >>>> With >>>> >>>> FFIExternalResourceManager uniqueInstance instVarNamed: #registry >>>> >>>> I also made sure that external objects are clean. >>>> >>>> Now, I am pretty confident that I can clean images, so the memory-overload should be sourced in the VM. Still, I cannot reliably reproduce the leak. It just happens after running an image for hours, doing stuff, idling, going to sleep, wake up. >>>> >>>> Unfortunately, I did not run it on a debug VM, so nearly no symbols, but I prepared a summary of the analysis anyway, as it already gives some hints that some of you can maybe interpret better than me. Have look at the PDFs in particular, they show the allocation sites at the three interesting instants, (clean, before berserk, after berserk), as well as the diffs in between. >>>> >>>> https://www.dropbox.com/s/zusnfn5mhyw3rmp/memory-analysis.zip?dl=0 >>>> >>>> I find that berserk mode is caused by a huge pile of Objective-C objects. The thread samples show that while idling, it seems to be busy processing OS events. But I do not know what kind of objects they actually are. >>>> >>>> My current hypothesis is: >>>> >>>> The leak is caused by processing a huge pile of OS events being processed (reprocessed?). For some reason, this only happens after a long run time. >>>> >>>> Now, how can I accept/reject this hypothesis? Is there a way to log events received by the VM? I also have a core dump of the 3GB process, but I have no idea how I could check it for Objective-C objects grouped by type. If anybody knows how to do that, I might be able to come up with a comprehensive list of objects that lead to the berserk mode. Maybe it is not even are real leak, but just accumulating objects that should be thrown away at some point. Also, I can still not fully reject that the image is involved, maybe it forgets external resources it should not forget at some point. >>>> >>>> As I now know again that I can trigger berserk mode, I will switch to the debug VM/Moz2D again with ASan. More symbols, leak reports, might reveal better information. >>>> >>>> Anyway, if anybody made it through the whole email and maybe even looked at the ZIP, I would appreciate ideas on how to proceed, what to test, what to inspect. I recently had a few incidents where I realized that one of my implicit assumptions/axioms does not hold, and I kept digging in the wrong direction for too long, so any feedback on this is welcome. >>>> >>>> Cheers, >>>> Manuel >>>> >>>>> On 30 Jan 2019, at 18:40, Manuel Leuenberger <[hidden email]> wrote: >>>>> >>>>> I thought about adding a flag for all ASan magic in the build scripts, then document the procedure in the readme, maybe also more extensive somewhere else. You probably have a better idea where a good place would be than I do. >>>>> >>>>> I dug into this today again and wrote a parser for the reports (https://github.com/maenu/leak-reporter). It has a not-so-good visualization of merged stacks showing heavy frames as big and solid, it also includes example reports (https://github.com/maenu/leak-reporter/tree/master/res). The reported leaks only amount for about 1% of leaked memory, not too many objects leaking, but I am not sure if ASan just counts object layout size or objects graphs (probably not). What I notice is that OS events are leaking, as they are copied. Might be an issue. But I am not entirely sure if this explains everything. Maybe the VM just allocates too many objects of some type, but frees them on shutdown, so they might not even appear in the reports. Can a C heap analyzer give me stats about struct instances per struct type? Might be useful to analyze core dumps then. >>>>> >>>>> Cheers, >>>>> Manuel >>>>> >>>>> <Screen Shot 2019-01-30 at 18.27.22.png> >>>>> Direct leak of 219512 byte(s) in 1193 object(s) allocated from: >>>>> #0 0x10bb1fffc in __sanitizer_mz_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x5dffc) >>>>> #1 0x7fffdd224281 in malloc_zone_malloc (libsystem_malloc.dylib:x86_64+0x2281) >>>>> #2 0x7fffc6de33d5 in AllocateAndInitializeEvent(__CFAllocator const*) (HIToolbox:x86_64+0x33d5) >>>>> #3 0x7fffc6e3fc4d in CopyEventInternal(__CFAllocator const*, OpaqueEventRef*) (HIToolbox:x86_64+0x5fc4d) >>>>> #4 0x7fffc6e49228 in CopyEventAs (HIToolbox:x86_64+0x69228) >>>>> #5 0x7fffc6e1d349 in CreateEventWithCGEvent (HIToolbox:x86_64+0x3d349) >>>>> #6 0x7fffc6e1b0e2 in CreateAndPostEventWithCGEvent(__CGEvent*, unsigned int, unsigned char, __CFMachPortBoost*) (HIToolbox:x86_64+0x3b0e2) >>>>> #7 0x7fffc6e275cd in Convert1CGEvent(unsigned char) (HIToolbox:x86_64+0x475cd) >>>>> #8 0x7fffc6e2745e in MainLoopObserver(unsigned int, OpaqueEventRef*, void*) (HIToolbox:x86_64+0x4745e) >>>>> #9 0x7fffc6de8368 in _NotifyEventLoopObservers (HIToolbox:x86_64+0x8368) >>>>> #10 0x7fffc6e10ea5 in RunCurrentEventLoopInMode (HIToolbox:x86_64+0x30ea5) >>>>> #11 0x7fffc6e10bf8 in ReceiveNextEventCommon (HIToolbox:x86_64+0x30bf8) >>>>> #12 0x7fffc6e10b25 in _BlockUntilNextEventMatchingListInModeWithFilter (HIToolbox:x86_64+0x30b25) >>>>> #13 0x7fffc53a5a53 in _DPSNextEvent (AppKit:x86_64+0x46a53) >>>>> #14 0x7fffc5b217ed in -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] (AppKit:x86_64+0x7c27ed) >>>>> #15 0x10b7ce07c in -[sqSqueakOSXApplication(events) pumpRunLoopEventSendAndSignal:] sqSqueakOSXApplication+events.m:78 >>>>> #16 0x10b7ce7d9 in -[sqSqueakOSXApplication(events) pumpRunLoop] sqSqueakOSXApplication+events.m:105 >>>>> #17 0x10b7e4898 in vmIOProcessEvents sqSqueakEventsAPI.m:80 >>>>> #18 0x10b7e49c7 in ioProcessEvents sqSqueakEventsAPI.m:103 >>>>> #19 0x10b5b5653 in checkForEventsMayContextSwitch gcc3x-cointerp.c:62680 >>>>> #20 0x10b5c17ed in ceCheckForInterrupts gcc3x-cointerp.c:15188 >>>>> #21 0x119a75794 (<unknown module>) >>>>> #22 0x10b566712 in interpret gcc3x-cointerp.c:2754 >>>>> #23 0x10b7ebada in -[sqSqueakMainApplication runSqueak] sqSqueakMainApplication.m:201 >>>>> #24 0x7fffc93786fc in __NSFirePerformWithOrder (Foundation:x86_64+0xd76fc) >>>>> #25 0x7fffc78cfc56 in __CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ (CoreFoundation:x86_64h+0xa6c56) >>>>> #26 0x7fffc78cfbc6 in __CFRunLoopDoObservers (CoreFoundation:x86_64h+0xa6bc6) >>>>> #27 0x7fffc78b05f8 in __CFRunLoopRun (CoreFoundation:x86_64h+0x875f8) >>>>> #28 0x7fffc78b0033 in CFRunLoopRunSpecific (CoreFoundation:x86_64h+0x87033) >>>>> #29 0x7fffc6e10ebb in RunCurrentEventLoopInMode (HIToolbox:x86_64+0x30ebb) >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >>> -- >>> >>> Guille Polito >>> Research Engineer >>> >>> Centre de Recherche en Informatique, Signal et Automatique de Lille >>> CRIStAL - UMR 9189 >>> French National Center for Scientific Research - http://www.cnrs.fr >>> >>> Web: http://guillep.github.io >>> Phone: +33 06 52 70 66 13 >> > |
Yeah, I am not sure if my fix only has the effects wanted to release the alienEventQueue, we need expert judgment here.
|
In reply to this post by Manuel Leuenberger
Great work Manuel. Thanks for your perseverance. cheers -ben On Thu, 28 Feb 2019 at 18:57, Manuel Leuenberger <[hidden email]> wrote:
|
Ok, as @krono pointed out NSMutableArray *alienEventQueue = AUTORELEASEOBJ([[NSMutableArray alloc] init]); with the autoreleasepool. I'm not sure anything is build without ARC anymore, but it should be there to be correct for people who need to build on older platforms. AUTORELEASEOBJ with ARC does nothing, and the autoreleasepool wrapper then is freeing autorelease objects in the scope of this routine. Since it is pumped from the VM processing requests there isn't a wrapping autorelease pool higher up in the stack. Easy test is to do invoke the "About Menu" ----------------------- Years back I would fire up Instruments and go looking, but the transition away from xCode made that harder, but as pointed out it is possible to do all this checking via the cmd line if someone (helpfully) puts in the time to make it happen. Thanks. .... ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Thursday, February 28, 2019 5:48 AM, Ben Coman <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |