Ephemerons and VM crash

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Ephemerons and VM crash

Guillermo Polito
 
Hi Eliot, list

I'm working here with Pablo (Tesone) on moving forward the Ephemeron
implementation. We first installed Eliot's changeset, added a #mourn
method and an EphemeronDictionary collection, and then started testing
something like this:

f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.

d at: f put: f.

f := nil.
Smalltalk garbageCollect.


However, as soon as we garbage collect twice, we have a VM crash. We
started debugging the VM to see if we could have some more clues.

The first thing we noticed is that the first time the GC runs, the
mournQueue is nil. This is of course expected because the new
finalization mechanism was not active and then there was no need to
create the mournQueue. We saw that the mournQueue is actually created in
a lazy fashion when putting queuing a mourned object (I refer myself to
#queueMourner: and #ensureRoomOnObjStackAt:). So the second time the GC
passes, the mournQueue is there. So far ok, but still crashing.

The crash happens in the call to

markAndTraceObjStackandContents(GIV(mournQueue), 1);

after the

     if (!markAndTraceContents) {
        return;
     }

But when understanding why, it starts being less clear to us :). We used
the printObjStack() function and we saw that:

call printObjStack(markStack)
call printObjStack(weaklingStack)

and we saw in the console some output that makes sense. However,
printing the mournQueue in the same manner produces some strange output

call printObjStack(mournQueue)

head  0xb06e980 cx 18 (18) fmt 10 (10) sz 4092 (4092) myx: 4098 (4098) unmkd
     topx: 14 next:        0x0 free:        0x0

We noticed that free and next are 0x0 while the others are not...

Finally we saw there is isValidObjStack(), that gave us the following
results:

call isValidObjStack(markStack) => 1

call isValidObjStack(weaklingStack) => 0
p objStackInvalidBecause = "marking but page is unmarked"

call isValidObjStack(mournQueue) => 0
p objStackInvalidBecause = "marking but page is unmarked"


So we assume that the stack creation is wrong? We are a bit lost in here.

Guille and Pablo
Reply | Threaded
Open this post in threaded view
|

Re: Ephemerons and VM crash

Guillermo Polito
 
Now we found that the stackObject becomes invalid within compact().

Before arriving to eliminateAndFreeForwardersForPigCompact() the stack
is already invalid.

-------- Original Message --------

> Hi Eliot, list
>
> I'm working here with Pablo (Tesone) on moving forward the Ephemeron
> implementation. We first installed Eliot's changeset, added a #mourn
> method and an EphemeronDictionary collection, and then started testing
> something like this:
>
> f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
> d := EphemeronDictionary new.
>
> d at: f put: f.
>
> f := nil.
> Smalltalk garbageCollect.
>
>
> However, as soon as we garbage collect twice, we have a VM crash. We
> started debugging the VM to see if we could have some more clues.
>
> The first thing we noticed is that the first time the GC runs, the
> mournQueue is nil. This is of course expected because the new
> finalization mechanism was not active and then there was no need to
> create the mournQueue. We saw that the mournQueue is actually created
> in a lazy fashion when putting queuing a mourned object (I refer
> myself to #queueMourner: and #ensureRoomOnObjStackAt:). So the second
> time the GC passes, the mournQueue is there. So far ok, but still
> crashing.
>
> The crash happens in the call to
>
> markAndTraceObjStackandContents(GIV(mournQueue), 1);
>
> after the
>
>     if (!markAndTraceContents) {
>        return;
>     }
>
> But when understanding why, it starts being less clear to us :). We
> used the printObjStack() function and we saw that:
>
> call printObjStack(markStack)
> call printObjStack(weaklingStack)
>
> and we saw in the console some output that makes sense. However,
> printing the mournQueue in the same manner produces some strange output
>
> call printObjStack(mournQueue)
>
> head  0xb06e980 cx 18 (18) fmt 10 (10) sz 4092 (4092) myx: 4098 (4098)
> unmkd
>     topx: 14 next:        0x0 free:        0x0
>
> We noticed that free and next are 0x0 while the others are not...
>
> Finally we saw there is isValidObjStack(), that gave us the following
> results:
>
> call isValidObjStack(markStack) => 1
>
> call isValidObjStack(weaklingStack) => 0
> p objStackInvalidBecause = "marking but page is unmarked"
>
> call isValidObjStack(mournQueue) => 0
> p objStackInvalidBecause = "marking but page is unmarked"
>
>
> So we assume that the stack creation is wrong? We are a bit lost in here.
>
> Guille and Pablo

Reply | Threaded
Open this post in threaded view
|

Re: Ephemerons and VM crash

Eliot Miranda-2
In reply to this post by Guillermo Polito
 
Hi Guille, Hi Pablo (and welcome),

On Tue, May 17, 2016 at 8:37 AM, Guille Polito <[hidden email]> wrote:

Hi Eliot, list

I'm working here with Pablo (Tesone) on moving forward the Ephemeron implementation. We first installed Eliot's changeset, added a #mourn method and an EphemeronDictionary collection, and then started testing something like this:

f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.

d at: f put: f.

f := nil.
Smalltalk garbageCollect.

So this looks like something simulate able.  Are you able to use the simulator?  If not, why not?  

When debugging the VM there are two main levels of support, one is the simulator where there is maximum support for debugging:
- asserts on all the time
- arbitrary breakpoints
- attempting every GC in a copy of the heap before doing the real GC so that bugs in the GC can be investigated without needing to construct a reproducible case after a crash
- the Smalltalk environment to inspect and browse

The next level is the assert and debug VMs.  If you look in the build directories on the Cog svn branch you'll see that all of them build three VMs, a production VM with maximum optimisation and asserts excluded, an assert VM with -O1 and asserts enabled, and a debug VM with -O0 and asserts enabled.  So if you either don't see the bug in the simulator, or the simulator is too slow for the case being examined, or if the bug doesn't show up in the simulator (the worst of all worlds), build both assert and debug VMs and run with the assert VM first.

Note that there is a heap leak checker which can be enabled both in the simulator and the assert and debug VMs.  See the checkForLeaks method and the -leakcheck argument.

Without the simulator or the assert and debug VMs you are flying blind.  It is /really/ productive to use the simulator for debugging, provided the bug is reproducible within a short amount of time, as for example your case is above.




However, as soon as we garbage collect twice, we have a VM crash. We started debugging the VM to see if we could have some more clues.

The first thing we noticed is that the first time the GC runs, the mournQueue is nil. This is of course expected because the new finalization mechanism was not active and then there was no need to create the mournQueue. We saw that the mournQueue is actually created in a lazy fashion when putting queuing a mourned object (I refer myself to #queueMourner: and #ensureRoomOnObjStackAt:). So the second time the GC passes, the mournQueue is there. So far ok, but still crashing.

The crash happens in the call to

markAndTraceObjStackandContents(GIV(mournQueue), 1);

after the

    if (!markAndTraceContents) {
       return;
    }

But when understanding why, it starts being less clear to us :). We used the printObjStack() function and we saw that:

call printObjStack(markStack)
call printObjStack(weaklingStack)

and we saw in the console some output that makes sense. However, printing the mournQueue in the same manner produces some strange output

call printObjStack(mournQueue)

head  0xb06e980 cx 18 (18) fmt 10 (10) sz 4092 (4092) myx: 4098 (4098) unmkd
    topx: 14 next:        0x0 free:        0x0

We noticed that free and next are 0x0 while the others are not...

Finally we saw there is isValidObjStack(), that gave us the following results:

call isValidObjStack(markStack) => 1

call isValidObjStack(weaklingStack) => 0
p objStackInvalidBecause = "marking but page is unmarked"

call isValidObjStack(mournQueue) => 0
p objStackInvalidBecause = "marking but page is unmarked"


So we assume that the stack creation is wrong? We are a bit lost in here.

Guille and Pablo



--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Ephemerons and VM crash

Guillermo Polito
 
Hello,

-------- Original Message --------
 


Hi Guille, Hi Pablo (and welcome),

On Tue, May 17, 2016 at 8:37 AM, Guille Polito <[hidden email]> wrote:

Hi Eliot, list

I'm working here with Pablo (Tesone) on moving forward the Ephemeron implementation. We first installed Eliot's changeset, added a #mourn method and an EphemeronDictionary collection, and then started testing something like this:

f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.

d at: f put: f.

f := nil.
Smalltalk garbageCollect.

So this looks like something simulate able.  Are you able to use the simulator?  If not, why not? 

For some reason I have that bytesToShift when opening the image is negative.

    bytesToShift := objectMemory memoryBaseForImageRead - oldBaseAddr.  "adjust pointers for zero base address"

So I cannot continue loading because addresses become negative and I have "Improper Store into indexable object kind of errors".

When debugging the VM there are two main levels of support, one is the simulator where there is maximum support for debugging:
- asserts on all the time
- arbitrary breakpoints
- attempting every GC in a copy of the heap before doing the real GC so that bugs in the GC can be investigated without needing to construct a reproducible case after a crash
- the Smalltalk environment to inspect and browse

The next level is the assert and debug VMs.  If you look in the build directories on the Cog svn branch you'll see that all of them build three VMs, a production VM with maximum optimisation and asserts excluded, an assert VM with -O1 and asserts enabled, and a debug VM with -O0 and asserts enabled.  So if you either don't see the bug in the simulator, or the simulator is too slow for the case being examined, or if the bug doesn't show up in the simulator (the worst of all worlds), build both assert and debug VMs and run with the assert VM first.

Well so far we were using a VM compiled for debug with a graphical C debugger. It was not so bad. However, I cannot say I'm missing a better debugger.

Note that there is a heap leak checker which can be enabled both in the simulator and the assert and debug VMs.  See the checkForLeaks method and the -leakcheck argument.

ok!

Without the simulator or the assert and debug VMs you are flying blind.  It is /really/ productive to use the simulator for debugging, provided the bug is reproducible within a short amount of time, as for example your case is above.

Ok, gotcha! By this afternoon I'll have some news probably.

Thanks a lot!




However, as soon as we garbage collect twice, we have a VM crash. We started debugging the VM to see if we could have some more clues.

The first thing we noticed is that the first time the GC runs, the mournQueue is nil. This is of course expected because the new finalization mechanism was not active and then there was no need to create the mournQueue. We saw that the mournQueue is actually created in a lazy fashion when putting queuing a mourned object (I refer myself to #queueMourner: and #ensureRoomOnObjStackAt:). So the second time the GC passes, the mournQueue is there. So far ok, but still crashing.

The crash happens in the call to

markAndTraceObjStackandContents(GIV(mournQueue), 1);

after the

    if (!markAndTraceContents) {
       return;
    }

But when understanding why, it starts being less clear to us :). We used the printObjStack() function and we saw that:

call printObjStack(markStack)
call printObjStack(weaklingStack)

and we saw in the console some output that makes sense. However, printing the mournQueue in the same manner produces some strange output

call printObjStack(mournQueue)

head  0xb06e980 cx 18 (18) fmt 10 (10) sz 4092 (4092) myx: 4098 (4098) unmkd
    topx: 14 next:        0x0 free:        0x0

We noticed that free and next are 0x0 while the others are not...

Finally we saw there is isValidObjStack(), that gave us the following results:

call isValidObjStack(markStack) => 1

call isValidObjStack(weaklingStack) => 0
p objStackInvalidBecause = "marking but page is unmarked"

call isValidObjStack(mournQueue) => 0
p objStackInvalidBecause = "marking but page is unmarked"


So we assume that the stack creation is wrong? We are a bit lost in here.

Guille and Pablo



--
_,,,^..^,,,_
best, Eliot

Reply | Threaded
Open this post in threaded view
|

Re: Ephemerons and VM crash

Eliot Miranda-2
 
Hi Guille,

On Wed, May 18, 2016 at 12:46 AM, Guille Polito <[hidden email]> wrote:
Hello,

-------- Original Message --------
 


Hi Guille, Hi Pablo (and welcome),

On Tue, May 17, 2016 at 8:37 AM, Guille Polito <[hidden email]> wrote:

Hi Eliot, list

I'm working here with Pablo (Tesone) on moving forward the Ephemeron implementation.


Where's "here"?   Are you in Lille?


We first installed Eliot's changeset, added a #mourn method and an EphemeronDictionary collection, and then started testing something like this:

f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.

d at: f put: f.

f := nil.
Smalltalk garbageCollect.

So this looks like something simulate able.  Are you able to use the simulator?  If not, why not? 

For some reason I have that bytesToShift when opening the image is negative.

That is to be expected.  In the real VM the heap is located somewhere well above the bottom of the address space, typically above the program code.  In the simulator the heap is located either at 0 (an interpreter or stack VM) or immediately above the code zone (in a Cogit VM).  So when an image that has been saved on the real VM is loaded into the simulator all oops have to be adjusted down and hence bytesToShift is negative.


    bytesToShift := objectMemory memoryBaseForImageRead - oldBaseAddr.  "adjust pointers for zero base address"

So I cannot continue loading because addresses become negative and I have "Improper Store into indexable object kind of errors".

Can you post a back trace?  Where is this happening?  What version of VMMaker.oscog are you using?  Are you running in Pharo or Squeak?  If you're in Lille you could perhaps visit Clément's office and get him to take a look.  Clément, would that be ok?
When debugging the VM there are two main levels of support, one is the simulator where there is maximum support for debugging:
- asserts on all the time
- arbitrary breakpoints
- attempting every GC in a copy of the heap before doing the real GC so that bugs in the GC can be investigated without needing to construct a reproducible case after a crash
- the Smalltalk environment to inspect and browse

The next level is the assert and debug VMs.  If you look in the build directories on the Cog svn branch you'll see that all of them build three VMs, a production VM with maximum optimisation and asserts excluded, an assert VM with -O1 and asserts enabled, and a debug VM with -O0 and asserts enabled.  So if you either don't see the bug in the simulator, or the simulator is too slow for the case being examined, or if the bug doesn't show up in the simulator (the worst of all worlds), build both assert and debug VMs and run with the assert VM first.

Well so far we were using a VM compiled for debug with a graphical C debugger. It was not so bad. However, I cannot say I'm missing a better debugger.

"Compiled for debug" is vague.  Do you mean it is compiled with -g -O0, or in addition is compiled with -g -O0 -DDEBUGVM=0 -DNDEBUG=1?
 
Note that there is a heap leak checker which can be enabled both in the simulator and the assert and debug VMs.  See the checkForLeaks method and the -leakcheck argument.

ok!

Without the simulator or the assert and debug VMs you are flying blind.  It is /really/ productive to use the simulator for debugging, provided the bug is reproducible within a short amount of time, as for example your case is above.

Ok, gotcha! By this afternoon I'll have some news probably.

Thanks a lot!





However, as soon as we garbage collect twice, we have a VM crash. We started debugging the VM to see if we could have some more clues.

The first thing we noticed is that the first time the GC runs, the mournQueue is nil. This is of course expected because the new finalization mechanism was not active and then there was no need to create the mournQueue. We saw that the mournQueue is actually created in a lazy fashion when putting queuing a mourned object (I refer myself to #queueMourner: and #ensureRoomOnObjStackAt:). So the second time the GC passes, the mournQueue is there. So far ok, but still crashing.

The crash happens in the call to

markAndTraceObjStackandContents(GIV(mournQueue), 1);

after the

    if (!markAndTraceContents) {
       return;
    }

But when understanding why, it starts being less clear to us :). We used the printObjStack() function and we saw that:

call printObjStack(markStack)
call printObjStack(weaklingStack)

and we saw in the console some output that makes sense. However, printing the mournQueue in the same manner produces some strange output

call printObjStack(mournQueue)

head  0xb06e980 cx 18 (18) fmt 10 (10) sz 4092 (4092) myx: 4098 (4098) unmkd
    topx: 14 next:        0x0 free:        0x0

We noticed that free and next are 0x0 while the others are not...

Finally we saw there is isValidObjStack(), that gave us the following results:

call isValidObjStack(markStack) => 1

call isValidObjStack(weaklingStack) => 0
p objStackInvalidBecause = "marking but page is unmarked"

call isValidObjStack(mournQueue) => 0
p objStackInvalidBecause = "marking but page is unmarked"


So we assume that the stack creation is wrong? We are a bit lost in here.

Guille and Pablo



--
_,,,^..^,,,_
best, Eliot




--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Ephemerons and VM crash

Guillermo Polito
 
Hi again!

I snip most of the old thread to keep the relevant.

-------- Original Message --------
1 - The mourn table is being collected because it is not marked upon creation
===============

I decided to move the assignment of marking out into markObjects:

Ok, it looks good for me, I did not know what would be a good place to put them without breaking something :)


2 - Accesses to unscannedEphemerons are unaligned
===============

Looks good.  I've changed all the ephemeron methods to use bytesPerOop, cuz the ephemeron queue contains oops, but your fix is just as good.

cool! I thought so that would be the best.
 


3 - While debugging I also had a DNU #intAt: in #primitiveIntegerAt
==========================

Question: besides reviewing, how should I commit these fixes? one issue per commit? all at once?

Typically all at once.  What I try and do is commit a large set of related changes in a single commit, so I wouldn't mix the fixes to ephemerons with, say, fixes to time.  But equally I'd commit any changes I made to help debugging in the same commit.  And all rules are made to be broken ;-).
ok!

Hey,  _thank you_!  This is really great and ephemerons won't be usable without testing like this.  I'm in your debt.

Ha, there is no debt (or at least it would be in the opposite direction).

Now I'm working on another crash that I can easily reproduce using this script:

Smalltalk supportsQueueingFinalization: true.
e := (1 to: 200000) collect: [ :i | Ephemeron key: (ObjectFinalizer receiver: 'test', 'asd' selector: #logCr) value: Object new ].
Smalltalk garbageCollect.


While debugging that in the simulator, we saw with clement that during #fireEphemeronsInRememberedSet, #fireEphemeronsOnEphemeronList we are always firing, tenuring and scavenging ephemerons regardless they are marked or not.

fireEphemeronsInRememberedSet
[SNIP]
         coInterpreter fireEphemeron: ephemeron.
         manager
            storePointerUnchecked: 0
            ofObject: ephemeron
            withValue: (self copyAndForward: (manager keyOfEphemeron: ephemeron)).
        (self scavengeReferentsOf: ephemeron)
[SNIP]

So I put an assertion to check that the key of the ephemeron is marked when scavenging and it is not always the case...
         self assert: (manager isMarked: (manager keyOfEphemeron: ephemeron)).

Is the code wrong? or I'm missing something?

Guille

 


Guille

-------- Original Message --------
I see, I probably I need to leave the simulator run for a longer time... Because running the image using the assertion VM took some time to crash, it may be the same case.

-------- Original Message --------
Some advance on my side,

- Yes, apparently the image was already broken. I started a new one and I can open it with the simulator. It seems that as soon as I activate Ephemerons and I have alive ephemerons the heap corrupts (and since saving the image fires a gc...).

- Then, to catch the moment where the image breaks, I prepared my image for debug with the following script:

Smalltalk snapshotPrimitive.
f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.
d at: f put: f.
f := nil.
Smalltalk supportsQueueingFinalization: true.
1 to: 2 do: [ :i | Smalltalk garbageCollect].

Then I opened it with the simulator (latest version: VMMaker.oscog-tfel.1864) and I had several assertion failures in

keyOfEphemeron: objOop
    "Answer the object the ephemeron guards.  This is its first element."
    self assert: ((self isNonImmediate: objOop) and: [self isEphemeron: objOop]).
    ^self fetchPointer: 0 ofObject: objOop

followedKeyOfEphemeron: objOop
    "Answer the object the ephemeron guards.  This is its first element."
    self assert: ((self isNonImmediate: objOop) and: [self isEphemeron: objOop]).
    ^self followOopField: 0 ofObject: objOop

So, I noticed the following:

- Ephemerons in the unscanned list are marked, fired, and the format of ephemerons is changed to a non-ephemeron format
- But they are not removed from the unscanned list
- then the mark #markAllUnscannedEphemerons fails tons of assertions because it expects ephemeron objects and the (not-anymore-)ephemeron is still there in the unscanned list

     self markInactiveEphemerons ifFalse:
        [self fireAllUnscannedEphemerons].
     self markAllUnscannedEphemerons

The strange thing is that if I ignore the assertions by proceeding, the thing continues normally and it the image does not break...

So I then compiled the VM, with the latest version, and in trace mode to compare. I attach the log of the stdout. But here there is a résumé

* First I see the same assertions I saw in the simulator

(isNonImmediate(objOop)) && (isEphemeron(objOop)) 41155

(isNonImmediate(objOop)) && (isEphemeron(objOop)) 55477

* Then there are tons of repetitions of the same assertion:

!(((GIV(mournQueue) != GIV(nilObj)) && (isonObjStack(anEphemeronOrWeakArray, GIV(mournQueue))))) 60030

Which I understand it means that we are adding to the mourn queue all the time the same object. Which is strange because it is a WeakArray with a single slot which turns out to be nil (at least at the moment I inspected it).

* Then something breaks (!!) Looks like we are trying to mark a forwarder. And then I have a follow up of assertions which are probably caused by the same thing.

!(isForwarded(field)) 49489

(classIndexOf(objToScan)) > (isForwardedObjectClassIndexPun()) 49537

(classIndex <= (tagMask())) || (classIndex >= (arrayClassIndexPun())) 49403

objCouldBeClassObj(classObj) 49415

addressCouldBeOop(objOop) 59997

addressCouldBeObj(topOfObjStack((0 == (fetchPointerofObject(ObjStackTopx, objStack)) ? fetchPointerofObject(ObjStackNextx, objStack) : objStack))) 60002

addressCouldBeOop(objOop) 59997

* Finally there are some more

!(isForwarded(field)) 49489

until a segmentation fault arrives while executing a fullGC

../results/pharo[0x80ceaa4]
../results/pharo[0x80cfde4]
../results/pharo(fullGC+0x9b)[0x80c95ea]
../results/pharo[0x80e2d49]
../results/pharo[0x80e5f4a]
../results/pharo[0x80ebca3]
../results/pharo(ceStackOverflow+0xd7)[0x809356a]
[0x91002c6]
../results/pharo(interpret+0x128)[0x8082b38]
../results/pharo(main+0x23a)[0x810a7db]
/lib/i386-linux-gnu/i686/cmov/libc.so.6(__libc_start_main+0xf3)[0xf73b5a63]
[0x1000]

I'll continue digging a bit more,
Guille

-------- Original Message --------
Hi Guille,

On Thu, May 19, 2016 at 12:53 AM, Guille Polito <[hidden email]> wrote:


-------- Original Message --------
Hi Guille,

On Wed, May 18, 2016 at 12:46 AM, Guille Polito <[hidden email]> wrote:
Hello,

-------- Original Message --------
 


Hi Guille, Hi Pablo (and welcome),

On Tue, May 17, 2016 at 8:37 AM, Guille Polito <[hidden email]> wrote:

Hi Eliot, list

I'm working here with Pablo (Tesone) on moving forward the Ephemeron implementation.


Where's "here"?   Are you in Lille?
Yup. Pablo is a new Phd student here :)


We first installed Eliot's changeset, added a #mourn method and an EphemeronDictionary collection, and then started testing something like this:

f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.

d at: f put: f.

f := nil.
Smalltalk garbageCollect.

So this looks like something simulate able.  Are you able to use the simulator?  If not, why not? 

For some reason I have that bytesToShift when opening the image is negative.

That is to be expected.  In the real VM the heap is located somewhere well above the bottom of the address space, typically above the program code.  In the simulator the heap is located either at 0 (an interpreter or stack VM) or immediately above the code zone (in a Cogit VM).  So when an image that has been saved on the real VM is loaded into the simulator all oops have to be adjusted down and hence bytesToShift is negative.


    bytesToShift := objectMemory memoryBaseForImageRead - oldBaseAddr.  "adjust pointers for zero base address"

So I cannot continue loading because addresses become negative and I have "Improper Store into indexable object kind of errors".

Can you post a back trace?  Where is this happening? 
I'm working in latest pharo (5.0), using the same source code that is used in the Pharo CI jobs to build the VM (pharo branded). I would expect this branch to be stable, maybe it could be a bit behind your branch of development, but I think Esteban merges regularly.

Esteban is very busy trying to get UFFI working, and having just got Pharo 5 released and working on Pharo 6 to be released.  I'd prefer it if you used the tip of VMMaker.oscog.  Further, that's the package that Clément is using and he's your expert source of local help.

 

I'm executing this,

vm := StackInterpreterSimulator newWithOptions: #(#ObjectMemory #Spur32BitMemoryManager ).
vm desiredNumStackPages: 8. "Makes simulation faster by creating fewer stack pages."
(vm openOn: '/home/guille/Pharo/temp/ephemerons/broken.image')
    openAsMorph; run.

And here is the trace. The error happens while swizzling.

LittleEndianBitmap(Object)>>error:
LittleEndianBitmap(Object)>>errorImproperStore
LittleEndianBitmap(Object)>>at:put:
Spur32BitMMLESimulator>>longAt:put:
Spur32BitMMLESimulator(SpurMemoryManager)>>swizzleFieldsOfObject:
Spur32BitMMLESimulator(SpurMemoryManager)>>adjustAllOopsBy:
Spur32BitMMLESimulator(SpurMemoryManager)>>initializeObjectMemory:
StackInterpreterSimulatorLSB(StackInterpreter)>>initializeInterpreter:
[ self initializeInterpreter: bytesToShift ] in StackInterpreterSimulatorLSB(StackInterpreterSimulator)>>openOn:extraMemory: in Block: [ self initializeInterpreter: bytesToShift ]

 So the image is presumably already broken?  Looks to me like it was corrupted when saved.   Can you reproduce the problem that led to this image being broken from a working image?  
What version of VMMaker.oscog are you using?
Monticello reports VMMaker.oscog-eem.1855.
However, I'm not sure I have the right number, because source code is now in git.

It would be so great if you could use Monticello, not git, and hence be able to contribute back.
 
  Are you running in Pharo or Squeak?  If you're in Lille you could perhaps visit Clément's office and get him to take a look.  Clément, would that be ok?
I'll maybe see it with clement, but today I have to work on some other project...
When debugging the VM there are two main levels of support, one is the simulator where there is maximum support for debugging:
- asserts on all the time
- arbitrary breakpoints
- attempting every GC in a copy of the heap before doing the real GC so that bugs in the GC can be investigated without needing to construct a reproducible case after a crash
- the Smalltalk environment to inspect and browse

The next level is the assert and debug VMs.  If you look in the build directories on the Cog svn branch you'll see that all of them build three VMs, a production VM with maximum optimisation and asserts excluded, an assert VM with -O1 and asserts enabled, and a debug VM with -O0 and asserts enabled.  So if you either don't see the bug in the simulator, or the simulator is too slow for the case being examined, or if the bug doesn't show up in the simulator (the worst of all worlds), build both assert and debug VMs and run with the assert VM first.

Well so far we were using a VM compiled for debug with a graphical C debugger. It was not so bad. However, I cannot say I'm missing a better debugger.

"Compiled for debug" is vague.  Do you mean it is compiled with -g -O0, or in addition is compiled with -g -O0 -DDEBUGVM=0 -DNDEBUG=1?
 
That would exactly be:

{ '-g3'.
        '-O0'.
        '-msse2'.
        '-D_GNU_SOURCE'.
        '-DITIMER_HEARTBEAT=1'.
        '-DNO_VM_PROFILE=1'.
        '-DDEBUGVM=1'.
        '-DNDEBUG' }

And Esteban told me to compile it with '-UNDEBUG' to get more information, but I did not try this yet.

Yes.  NDEBUG is what turns off asserts.  You don't have a proper assert or debug VM unless you do not use -NDEBUG.

Thanks!

Note that there is a heap leak checker which can be enabled both in the simulator and the assert and debug VMs.  See the checkForLeaks method and the -leakcheck argument.

ok!

Without the simulator or the assert and debug VMs you are flying blind.  It is /really/ productive to use the simulator for debugging, provided the bug is reproducible within a short amount of time, as for example your case is above.

Ok, gotcha! By this afternoon I'll have some news probably.

Thanks a lot!





However, as soon as we garbage collect twice, we have a VM crash. We started debugging the VM to see if we could have some more clues.

The first thing we noticed is that the first time the GC runs, the mournQueue is nil. This is of course expected because the new finalization mechanism was not active and then there was no need to create the mournQueue. We saw that the mournQueue is actually created in a lazy fashion when putting queuing a mourned object (I refer myself to #queueMourner: and #ensureRoomOnObjStackAt:). So the second time the GC passes, the mournQueue is there. So far ok, but still crashing.

The crash happens in the call to

markAndTraceObjStackandContents(GIV(mournQueue), 1);

after the

    if (!markAndTraceContents) {
       return;
    }

But when understanding why, it starts being less clear to us :). We used the printObjStack() function and we saw that:

call printObjStack(markStack)
call printObjStack(weaklingStack)

and we saw in the console some output that makes sense. However, printing the mournQueue in the same manner produces some strange output

call printObjStack(mournQueue)

head  0xb06e980 cx 18 (18) fmt 10 (10) sz 4092 (4092) myx: 4098 (4098) unmkd
    topx: 14 next:        0x0 free:        0x0

We noticed that free and next are 0x0 while the others are not...

Finally we saw there is isValidObjStack(), that gave us the following results:

call isValidObjStack(markStack) => 1

call isValidObjStack(weaklingStack) => 0
p objStackInvalidBecause = "marking but page is unmarked"

call isValidObjStack(mournQueue) => 0
p objStackInvalidBecause = "marking but page is unmarked"


So we assume that the stack creation is wrong? We are a bit lost in here.

Guille and Pablo



--
_,,,^..^,,,_
best, Eliot




--
_,,,^..^,,,_
best, Eliot




--
_,,,^..^,,,_
best, Eliot






--
_,,,^..^,,,_
best, Eliot

Reply | Threaded
Open this post in threaded view
|

Re: Ephemerons and VM crash

Eliot Miranda-2
 
Hi Guille,

    first, v important. can you put the image(s) you're working with somewhere I can download, and send me a URL?  I can be more helpful if I can run the tests too.


On Wed, May 25, 2016 at 6:24 AM, Guille Polito <[hidden email]> wrote:
 
Hi again!

I snip most of the old thread to keep the relevant.

-------- Original Message --------
1 - The mourn table is being collected because it is not marked upon creation
===============

I decided to move the assignment of marking out into markObjects:

Ok, it looks good for me, I did not know what would be a good place to put them without breaking something :)


2 - Accesses to unscannedEphemerons are unaligned
===============

Looks good.  I've changed all the ephemeron methods to use bytesPerOop, cuz the ephemeron queue contains oops, but your fix is just as good.

cool! I thought so that would be the best.
 


3 - While debugging I also had a DNU #intAt: in #primitiveIntegerAt
==========================

Question: besides reviewing, how should I commit these fixes? one issue per commit? all at once?

Typically all at once.  What I try and do is commit a large set of related changes in a single commit, so I wouldn't mix the fixes to ephemerons with, say, fixes to time.  But equally I'd commit any changes I made to help debugging in the same commit.  And all rules are made to be broken ;-).
ok!

Hey,  _thank you_!  This is really great and ephemerons won't be usable without testing like this.  I'm in your debt.

Ha, there is no debt (or at least it would be in the opposite direction).

Now I'm working on another crash that I can easily reproduce using this script:

Smalltalk supportsQueueingFinalization: true.
e := (1 to: 200000) collect: [ :i | Ephemeron key: (ObjectFinalizer receiver: 'test', 'asd' selector: #logCr) value: Object new ].
Smalltalk garbageCollect.


While debugging that in the simulator, we saw with clement that during #fireEphemeronsInRememberedSet, #fireEphemeronsOnEphemeronList we are always firing, tenuring and scavenging ephemerons regardless they are marked or not.

fireEphemeronsInRememberedSet
[SNIP]
         coInterpreter fireEphemeron: ephemeron.
         manager
            storePointerUnchecked: 0
            ofObject: ephemeron
            withValue: (self copyAndForward: (manager keyOfEphemeron: ephemeron)).
        (self scavengeReferentsOf: ephemeron)
[SNIP]

So I put an assertion to check that the key of the ephemeron is marked when scavenging and it is not always the case...
         self assert: (manager isMarked: (manager keyOfEphemeron: ephemeron)).

Is the code wrong? or I'm missing something?

The implementation of ephemerons in the scavenger and the implementation in mark-sweep are quite different.

Remember that what we're trying to do is find out if an object is only referenced from the transitive closure of ephemerons or not.  And the way that we do that in both cases is avoid processing ephemerons whose keys are not yet reachable from the roots until all objects reachable from the roots have been reached.  We do this by putting "unscanned" ephemerons in a queue, saving them until later.

In the scavenger the roots are the objects in the remembered set, the interpreter state (newMethod etc) and the stack zone, and GC is performed by copying all objects reachable from these roots in past and new spaces into future space, possibly tenuring overflowed objects into old space.  In the mark-sweep the roots are the specialObjectsArray, the interpreter state (newMethod etc) and the stack zone.

In the scavenger (a copying collector) this means that when we process the unscanned ephemerons their keys will either have been copied into future space or tenured to old space, in which case they were reachable from the roots, or they will not have been copied yet, in which case they are reachable only from ephemerons.  So in the scavenger marking is irrelevant; in fact /no/ objects should be marked when scavenging.  The important thing is whether a key is in past and new spaces or is in future and old spaces.

In the mark-sweep this means that when we process the unscanned ephemerons their keys will either be marked, in which case they were reachable from the roots, or unmarked, in which case they are reachable only from ephemerons. So the assert for marked-ness only makes sense in ephemeron processing in the mark-sweep collector.

HTH

 
Guille

 


Guille

-------- Original Message --------
I see, I probably I need to leave the simulator run for a longer time... Because running the image using the assertion VM took some time to crash, it may be the same case.

-------- Original Message --------
Some advance on my side,

- Yes, apparently the image was already broken. I started a new one and I can open it with the simulator. It seems that as soon as I activate Ephemerons and I have alive ephemerons the heap corrupts (and since saving the image fires a gc...).

- Then, to catch the moment where the image breaks, I prepared my image for debug with the following script:

Smalltalk snapshotPrimitive.
f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.
d at: f put: f.
f := nil.
Smalltalk supportsQueueingFinalization: true.
1 to: 2 do: [ :i | Smalltalk garbageCollect].

Then I opened it with the simulator (latest version: VMMaker.oscog-tfel.1864) and I had several assertion failures in

keyOfEphemeron: objOop
    "Answer the object the ephemeron guards.  This is its first element."
    self assert: ((self isNonImmediate: objOop) and: [self isEphemeron: objOop]).
    ^self fetchPointer: 0 ofObject: objOop

followedKeyOfEphemeron: objOop
    "Answer the object the ephemeron guards.  This is its first element."
    self assert: ((self isNonImmediate: objOop) and: [self isEphemeron: objOop]).
    ^self followOopField: 0 ofObject: objOop

So, I noticed the following:

- Ephemerons in the unscanned list are marked, fired, and the format of ephemerons is changed to a non-ephemeron format
- But they are not removed from the unscanned list
- then the mark #markAllUnscannedEphemerons fails tons of assertions because it expects ephemeron objects and the (not-anymore-)ephemeron is still there in the unscanned list

     self markInactiveEphemerons ifFalse:
        [self fireAllUnscannedEphemerons].
     self markAllUnscannedEphemerons

The strange thing is that if I ignore the assertions by proceeding, the thing continues normally and it the image does not break...

So I then compiled the VM, with the latest version, and in trace mode to compare. I attach the log of the stdout. But here there is a résumé

* First I see the same assertions I saw in the simulator

(isNonImmediate(objOop)) && (isEphemeron(objOop)) 41155

(isNonImmediate(objOop)) && (isEphemeron(objOop)) 55477

* Then there are tons of repetitions of the same assertion:

!(((GIV(mournQueue) != GIV(nilObj)) && (isonObjStack(anEphemeronOrWeakArray, GIV(mournQueue))))) 60030

Which I understand it means that we are adding to the mourn queue all the time the same object. Which is strange because it is a WeakArray with a single slot which turns out to be nil (at least at the moment I inspected it).

* Then something breaks (!!) Looks like we are trying to mark a forwarder. And then I have a follow up of assertions which are probably caused by the same thing.

!(isForwarded(field)) 49489

(classIndexOf(objToScan)) > (isForwardedObjectClassIndexPun()) 49537

(classIndex <= (tagMask())) || (classIndex >= (arrayClassIndexPun())) 49403

objCouldBeClassObj(classObj) 49415

addressCouldBeOop(objOop) 59997

addressCouldBeObj(topOfObjStack((0 == (fetchPointerofObject(ObjStackTopx, objStack)) ? fetchPointerofObject(ObjStackNextx, objStack) : objStack))) 60002

addressCouldBeOop(objOop) 59997

* Finally there are some more

!(isForwarded(field)) 49489

until a segmentation fault arrives while executing a fullGC

../results/pharo[0x80ceaa4]
../results/pharo[0x80cfde4]
../results/pharo(fullGC+0x9b)[0x80c95ea]
../results/pharo[0x80e2d49]
../results/pharo[0x80e5f4a]
../results/pharo[0x80ebca3]
../results/pharo(ceStackOverflow+0xd7)[0x809356a]
[0x91002c6]
../results/pharo(interpret+0x128)[0x8082b38]
../results/pharo(main+0x23a)[0x810a7db]
/lib/i386-linux-gnu/i686/cmov/libc.so.6(__libc_start_main+0xf3)[0xf73b5a63]
[0x1000]

I'll continue digging a bit more,
Guille

-------- Original Message --------
Hi Guille,

On Thu, May 19, 2016 at 12:53 AM, Guille Polito <[hidden email][hidden email]> wrote:


-------- Original Message --------
Hi Guille,

On Wed, May 18, 2016 at 12:46 AM, Guille Polito <[hidden email][hidden email]> wrote:
Hello,

-------- Original Message --------
 


Hi Guille, Hi Pablo (and welcome),

On Tue, May 17, 2016 at 8:37 AM, Guille Polito <[hidden email][hidden email]> wrote:

Hi Eliot, list

I'm working here with Pablo (Tesone) on moving forward the Ephemeron implementation.


Where's "here"?   Are you in Lille?
Yup. Pablo is a new Phd student here :)


We first installed Eliot's changeset, added a #mourn method and an EphemeronDictionary collection, and then started testing something like this:

f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.

d at: f put: f.

f := nil.
Smalltalk garbageCollect.

So this looks like something simulate able.  Are you able to use the simulator?  If not, why not? 

For some reason I have that bytesToShift when opening the image is negative.

That is to be expected.  In the real VM the heap is located somewhere well above the bottom of the address space, typically above the program code.  In the simulator the heap is located either at 0 (an interpreter or stack VM) or immediately above the code zone (in a Cogit VM).  So when an image that has been saved on the real VM is loaded into the simulator all oops have to be adjusted down and hence bytesToShift is negative.


    bytesToShift := objectMemory memoryBaseForImageRead - oldBaseAddr.  "adjust pointers for zero base address"

So I cannot continue loading because addresses become negative and I have "Improper Store into indexable object kind of errors".

Can you post a back trace?  Where is this happening? 
I'm working in latest pharo (5.0), using the same source code that is used in the Pharo CI jobs to build the VM (pharo branded). I would expect this branch to be stable, maybe it could be a bit behind your branch of development, but I think Esteban merges regularly.

Esteban is very busy trying to get UFFI working, and having just got Pharo 5 released and working on Pharo 6 to be released.  I'd prefer it if you used the tip of VMMaker.oscog.  Further, that's the package that Clément is using and he's your expert source of local help.

 

I'm executing this,

vm := StackInterpreterSimulator newWithOptions: #(#ObjectMemory #Spur32BitMemoryManager ).
vm desiredNumStackPages: 8. "Makes simulation faster by creating fewer stack pages."
(vm openOn: '/home/guille/Pharo/temp/ephemerons/broken.image')
    openAsMorph; run.

And here is the trace. The error happens while swizzling.

LittleEndianBitmap(Object)>>error:
LittleEndianBitmap(Object)>>errorImproperStore
LittleEndianBitmap(Object)>>at:put:
Spur32BitMMLESimulator>>longAt:put:
Spur32BitMMLESimulator(SpurMemoryManager)>>swizzleFieldsOfObject:
Spur32BitMMLESimulator(SpurMemoryManager)>>adjustAllOopsBy:
Spur32BitMMLESimulator(SpurMemoryManager)>>initializeObjectMemory:
StackInterpreterSimulatorLSB(StackInterpreter)>>initializeInterpreter:
[ self initializeInterpreter: bytesToShift ] in StackInterpreterSimulatorLSB(StackInterpreterSimulator)>>openOn:extraMemory: in Block: [ self initializeInterpreter: bytesToShift ]

 So the image is presumably already broken?  Looks to me like it was corrupted when saved.   Can you reproduce the problem that led to this image being broken from a working image?  
What version of VMMaker.oscog are you using?
Monticello reports VMMaker.oscog-eem.1855.
However, I'm not sure I have the right number, because source code is now in git.

It would be so great if you could use Monticello, not git, and hence be able to contribute back.
 
  Are you running in Pharo or Squeak?  If you're in Lille you could perhaps visit Clément's office and get him to take a look.  Clément, would that be ok?
I'll maybe see it with clement, but today I have to work on some other project...
When debugging the VM there are two main levels of support, one is the simulator where there is maximum support for debugging:
- asserts on all the time
- arbitrary breakpoints
- attempting every GC in a copy of the heap before doing the real GC so that bugs in the GC can be investigated without needing to construct a reproducible case after a crash
- the Smalltalk environment to inspect and browse

The next level is the assert and debug VMs.  If you look in the build directories on the Cog svn branch you'll see that all of them build three VMs, a production VM with maximum optimisation and asserts excluded, an assert VM with -O1 and asserts enabled, and a debug VM with -O0 and asserts enabled.  So if you either don't see the bug in the simulator, or the simulator is too slow for the case being examined, or if the bug doesn't show up in the simulator (the worst of all worlds), build both assert and debug VMs and run with the assert VM first.

Well so far we were using a VM compiled for debug with a graphical C debugger. It was not so bad. However, I cannot say I'm missing a better debugger.

"Compiled for debug" is vague.  Do you mean it is compiled with -g -O0, or in addition is compiled with -g -O0 -DDEBUGVM=0 -DNDEBUG=1?
 
That would exactly be:

{ '-g3'.
        '-O0'.
        '-msse2'.
        '-D_GNU_SOURCE'.
        '-DITIMER_HEARTBEAT=1'.
        '-DNO_VM_PROFILE=1'.
        '-DDEBUGVM=1'.
        '-DNDEBUG' }

And Esteban told me to compile it with '-UNDEBUG' to get more information, but I did not try this yet.

Yes.  NDEBUG is what turns off asserts.  You don't have a proper assert or debug VM unless you do not use -NDEBUG.

Thanks!

Note that there is a heap leak checker which can be enabled both in the simulator and the assert and debug VMs.  See the checkForLeaks method and the -leakcheck argument.

ok!

Without the simulator or the assert and debug VMs you are flying blind.  It is /really/ productive to use the simulator for debugging, provided the bug is reproducible within a short amount of time, as for example your case is above.

Ok, gotcha! By this afternoon I'll have some news probably.

Thanks a lot!





However, as soon as we garbage collect twice, we have a VM crash. We started debugging the VM to see if we could have some more clues.

The first thing we noticed is that the first time the GC runs, the mournQueue is nil. This is of course expected because the new finalization mechanism was not active and then there was no need to create the mournQueue. We saw that the mournQueue is actually created in a lazy fashion when putting queuing a mourned object (I refer myself to #queueMourner: and #ensureRoomOnObjStackAt:). So the second time the GC passes, the mournQueue is there. So far ok, but still crashing.

The crash happens in the call to

markAndTraceObjStackandContents(GIV(mournQueue), 1);

after the

    if (!markAndTraceContents) {
       return;
    }

But when understanding why, it starts being less clear to us :). We used the printObjStack() function and we saw that:

call printObjStack(markStack)
call printObjStack(weaklingStack)

and we saw in the console some output that makes sense. However, printing the mournQueue in the same manner produces some strange output

call printObjStack(mournQueue)

head  0xb06e980 cx 18 (18) fmt 10 (10) sz 4092 (4092) myx: 4098 (4098) unmkd
    topx: 14 next:        0x0 free:        0x0

We noticed that free and next are 0x0 while the others are not...

Finally we saw there is isValidObjStack(), that gave us the following results:

call isValidObjStack(markStack) => 1

call isValidObjStack(weaklingStack) => 0
p objStackInvalidBecause = "marking but page is unmarked"

call isValidObjStack(mournQueue) => 0
p objStackInvalidBecause = "marking but page is unmarked"


So we assume that the stack creation is wrong? We are a bit lost in here.

Guille and Pablo



--
_,,,^..^,,,_
best, Eliot




--
_,,,^..^,,,_
best, Eliot




--
_,,,^..^,,,_
best, Eliot






--
_,,,^..^,,,_
best, Eliot





--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Ephemerons and VM crash

Eliot Miranda-2
 
Hi Guille,

On Wed, May 25, 2016 at 10:43 AM, Eliot Miranda <[hidden email]> wrote:
Hi Guille,

    first, v important. can you put the image(s) you're working with somewhere I can download, and send me a URL?  I can be more helpful if I can run the tests too.


On Wed, May 25, 2016 at 6:24 AM, Guille Polito <[hidden email]> wrote:
 
Hi again!

I snip most of the old thread to keep the relevant.

-------- Original Message --------
1 - The mourn table is being collected because it is not marked upon creation
===============

I decided to move the assignment of marking out into markObjects:

Ok, it looks good for me, I did not know what would be a good place to put them without breaking something :)


2 - Accesses to unscannedEphemerons are unaligned
===============

Looks good.  I've changed all the ephemeron methods to use bytesPerOop, cuz the ephemeron queue contains oops, but your fix is just as good.

cool! I thought so that would be the best.
 


3 - While debugging I also had a DNU #intAt: in #primitiveIntegerAt
==========================

Question: besides reviewing, how should I commit these fixes? one issue per commit? all at once?

Typically all at once.  What I try and do is commit a large set of related changes in a single commit, so I wouldn't mix the fixes to ephemerons with, say, fixes to time.  But equally I'd commit any changes I made to help debugging in the same commit.  And all rules are made to be broken ;-).
ok!

Hey,  _thank you_!  This is really great and ephemerons won't be usable without testing like this.  I'm in your debt.

Ha, there is no debt (or at least it would be in the opposite direction).

Now I'm working on another crash that I can easily reproduce using this script:

Smalltalk supportsQueueingFinalization: true.
e := (1 to: 200000) collect: [ :i | Ephemeron key: (ObjectFinalizer receiver: 'test', 'asd' selector: #logCr) value: Object new ].
Smalltalk garbageCollect.

Why the extra level of indirection?  The Ephemeron /is/ the object finaliser.  It can send finalise to its key.  Why bother with the extra level of wrapper?  It's wasteful; especially if we're attaching lots of ephemerons to things cuz we want to finalise something that has lots of instances.
 
While debugging that in the simulator, we saw with clement that during #fireEphemeronsInRememberedSet, #fireEphemeronsOnEphemeronList we are always firing, tenuring and scavenging ephemerons regardless they are marked or not.

fireEphemeronsInRememberedSet
[SNIP]
         coInterpreter fireEphemeron: ephemeron.
         manager
            storePointerUnchecked: 0
            ofObject: ephemeron
            withValue: (self copyAndForward: (manager keyOfEphemeron: ephemeron)).
        (self scavengeReferentsOf: ephemeron)
[SNIP]

So I put an assertion to check that the key of the ephemeron is marked when scavenging and it is not always the case...
         self assert: (manager isMarked: (manager keyOfEphemeron: ephemeron)).

Is the code wrong? or I'm missing something?

The implementation of ephemerons in the scavenger and the implementation in mark-sweep are quite different.

Remember that what we're trying to do is find out if an object is only referenced from the transitive closure of ephemerons or not.  And the way that we do that in both cases is avoid processing ephemerons whose keys are not yet reachable from the roots until all objects reachable from the roots have been reached.  We do this by putting "unscanned" ephemerons in a queue, saving them until later.

In the scavenger the roots are the objects in the remembered set, the interpreter state (newMethod etc) and the stack zone, and GC is performed by copying all objects reachable from these roots in past and new spaces into future space, possibly tenuring overflowed objects into old space.  In the mark-sweep the roots are the specialObjectsArray, the interpreter state (newMethod etc) and the stack zone.

In the scavenger (a copying collector) this means that when we process the unscanned ephemerons their keys will either have been copied into future space or tenured to old space, in which case they were reachable from the roots, or they will not have been copied yet, in which case they are reachable only from ephemerons.  So in the scavenger marking is irrelevant; in fact /no/ objects should be marked when scavenging.  The important thing is whether a key is in past and new spaces or is in future and old spaces.

In the mark-sweep this means that when we process the unscanned ephemerons their keys will either be marked, in which case they were reachable from the roots, or unmarked, in which case they are reachable only from ephemerons. So the assert for marked-ness only makes sense in ephemeron processing in the mark-sweep collector.

HTH

 
Guille


 


Guille

-------- Original Message --------
I see, I probably I need to leave the simulator run for a longer time... Because running the image using the assertion VM took some time to crash, it may be the same case.

-------- Original Message --------
Some advance on my side,

- Yes, apparently the image was already broken. I started a new one and I can open it with the simulator. It seems that as soon as I activate Ephemerons and I have alive ephemerons the heap corrupts (and since saving the image fires a gc...).

- Then, to catch the moment where the image breaks, I prepared my image for debug with the following script:

Smalltalk snapshotPrimitive.
f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.
d at: f put: f.
f := nil.
Smalltalk supportsQueueingFinalization: true.
1 to: 2 do: [ :i | Smalltalk garbageCollect].

Then I opened it with the simulator (latest version: VMMaker.oscog-tfel.1864) and I had several assertion failures in

keyOfEphemeron: objOop
    "Answer the object the ephemeron guards.  This is its first element."
    self assert: ((self isNonImmediate: objOop) and: [self isEphemeron: objOop]).
    ^self fetchPointer: 0 ofObject: objOop

followedKeyOfEphemeron: objOop
    "Answer the object the ephemeron guards.  This is its first element."
    self assert: ((self isNonImmediate: objOop) and: [self isEphemeron: objOop]).
    ^self followOopField: 0 ofObject: objOop

So, I noticed the following:

- Ephemerons in the unscanned list are marked, fired, and the format of ephemerons is changed to a non-ephemeron format
- But they are not removed from the unscanned list
- then the mark #markAllUnscannedEphemerons fails tons of assertions because it expects ephemeron objects and the (not-anymore-)ephemeron is still there in the unscanned list

     self markInactiveEphemerons ifFalse:
        [self fireAllUnscannedEphemerons].
     self markAllUnscannedEphemerons

The strange thing is that if I ignore the assertions by proceeding, the thing continues normally and it the image does not break...

So I then compiled the VM, with the latest version, and in trace mode to compare. I attach the log of the stdout. But here there is a résumé

* First I see the same assertions I saw in the simulator

(isNonImmediate(objOop)) && (isEphemeron(objOop)) 41155

(isNonImmediate(objOop)) && (isEphemeron(objOop)) 55477

* Then there are tons of repetitions of the same assertion:

!(((GIV(mournQueue) != GIV(nilObj)) && (isonObjStack(anEphemeronOrWeakArray, GIV(mournQueue))))) 60030

Which I understand it means that we are adding to the mourn queue all the time the same object. Which is strange because it is a WeakArray with a single slot which turns out to be nil (at least at the moment I inspected it).

* Then something breaks (!!) Looks like we are trying to mark a forwarder. And then I have a follow up of assertions which are probably caused by the same thing.

!(isForwarded(field)) 49489

(classIndexOf(objToScan)) > (isForwardedObjectClassIndexPun()) 49537

(classIndex <= (tagMask())) || (classIndex >= (arrayClassIndexPun())) 49403

objCouldBeClassObj(classObj) 49415

addressCouldBeOop(objOop) 59997

addressCouldBeObj(topOfObjStack((0 == (fetchPointerofObject(ObjStackTopx, objStack)) ? fetchPointerofObject(ObjStackNextx, objStack) : objStack))) 60002

addressCouldBeOop(objOop) 59997

* Finally there are some more

!(isForwarded(field)) 49489

until a segmentation fault arrives while executing a fullGC

../results/pharo[0x80ceaa4]
../results/pharo[0x80cfde4]
../results/pharo(fullGC+0x9b)[0x80c95ea]
../results/pharo[0x80e2d49]
../results/pharo[0x80e5f4a]
../results/pharo[0x80ebca3]
../results/pharo(ceStackOverflow+0xd7)[0x809356a]
[0x91002c6]
../results/pharo(interpret+0x128)[0x8082b38]
../results/pharo(main+0x23a)[0x810a7db]
/lib/i386-linux-gnu/i686/cmov/libc.so.6(__libc_start_main+0xf3)[0xf73b5a63]
[0x1000]

I'll continue digging a bit more,
Guille

-------- Original Message --------
Hi Guille,

On Thu, May 19, 2016 at 12:53 AM, Guille Polito <[hidden email][hidden email]> wrote:


-------- Original Message --------
Hi Guille,

On Wed, May 18, 2016 at 12:46 AM, Guille Polito <[hidden email][hidden email]> wrote:
Hello,

-------- Original Message --------
 


Hi Guille, Hi Pablo (and welcome),

On Tue, May 17, 2016 at 8:37 AM, Guille Polito <[hidden email][hidden email]> wrote:

Hi Eliot, list

I'm working here with Pablo (Tesone) on moving forward the Ephemeron implementation.


Where's "here"?   Are you in Lille?
Yup. Pablo is a new Phd student here :)


We first installed Eliot's changeset, added a #mourn method and an EphemeronDictionary collection, and then started testing something like this:

f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.

d at: f put: f.

f := nil.
Smalltalk garbageCollect.

So this looks like something simulate able.  Are you able to use the simulator?  If not, why not? 

For some reason I have that bytesToShift when opening the image is negative.

That is to be expected.  In the real VM the heap is located somewhere well above the bottom of the address space, typically above the program code.  In the simulator the heap is located either at 0 (an interpreter or stack VM) or immediately above the code zone (in a Cogit VM).  So when an image that has been saved on the real VM is loaded into the simulator all oops have to be adjusted down and hence bytesToShift is negative.


    bytesToShift := objectMemory memoryBaseForImageRead - oldBaseAddr.  "adjust pointers for zero base address"

So I cannot continue loading because addresses become negative and I have "Improper Store into indexable object kind of errors".

Can you post a back trace?  Where is this happening? 
I'm working in latest pharo (5.0), using the same source code that is used in the Pharo CI jobs to build the VM (pharo branded). I would expect this branch to be stable, maybe it could be a bit behind your branch of development, but I think Esteban merges regularly.

Esteban is very busy trying to get UFFI working, and having just got Pharo 5 released and working on Pharo 6 to be released.  I'd prefer it if you used the tip of VMMaker.oscog.  Further, that's the package that Clément is using and he's your expert source of local help.

 

I'm executing this,

vm := StackInterpreterSimulator newWithOptions: #(#ObjectMemory #Spur32BitMemoryManager ).
vm desiredNumStackPages: 8. "Makes simulation faster by creating fewer stack pages."
(vm openOn: '/home/guille/Pharo/temp/ephemerons/broken.image')
    openAsMorph; run.

And here is the trace. The error happens while swizzling.

LittleEndianBitmap(Object)>>error:
LittleEndianBitmap(Object)>>errorImproperStore
LittleEndianBitmap(Object)>>at:put:
Spur32BitMMLESimulator>>longAt:put:
Spur32BitMMLESimulator(SpurMemoryManager)>>swizzleFieldsOfObject:
Spur32BitMMLESimulator(SpurMemoryManager)>>adjustAllOopsBy:
Spur32BitMMLESimulator(SpurMemoryManager)>>initializeObjectMemory:
StackInterpreterSimulatorLSB(StackInterpreter)>>initializeInterpreter:
[ self initializeInterpreter: bytesToShift ] in StackInterpreterSimulatorLSB(StackInterpreterSimulator)>>openOn:extraMemory: in Block: [ self initializeInterpreter: bytesToShift ]

 So the image is presumably already broken?  Looks to me like it was corrupted when saved.   Can you reproduce the problem that led to this image being broken from a working image?  
What version of VMMaker.oscog are you using?
Monticello reports VMMaker.oscog-eem.1855.
However, I'm not sure I have the right number, because source code is now in git.

It would be so great if you could use Monticello, not git, and hence be able to contribute back.
 
  Are you running in Pharo or Squeak?  If you're in Lille you could perhaps visit Clément's office and get him to take a look.  Clément, would that be ok?
I'll maybe see it with clement, but today I have to work on some other project...
When debugging the VM there are two main levels of support, one is the simulator where there is maximum support for debugging:
- asserts on all the time
- arbitrary breakpoints
- attempting every GC in a copy of the heap before doing the real GC so that bugs in the GC can be investigated without needing to construct a reproducible case after a crash
- the Smalltalk environment to inspect and browse

The next level is the assert and debug VMs.  If you look in the build directories on the Cog svn branch you'll see that all of them build three VMs, a production VM with maximum optimisation and asserts excluded, an assert VM with -O1 and asserts enabled, and a debug VM with -O0 and asserts enabled.  So if you either don't see the bug in the simulator, or the simulator is too slow for the case being examined, or if the bug doesn't show up in the simulator (the worst of all worlds), build both assert and debug VMs and run with the assert VM first.

Well so far we were using a VM compiled for debug with a graphical C debugger. It was not so bad. However, I cannot say I'm missing a better debugger.

"Compiled for debug" is vague.  Do you mean it is compiled with -g -O0, or in addition is compiled with -g -O0 -DDEBUGVM=0 -DNDEBUG=1?
 
That would exactly be:

{ '-g3'.
        '-O0'.
        '-msse2'.
        '-D_GNU_SOURCE'.
        '-DITIMER_HEARTBEAT=1'.
        '-DNO_VM_PROFILE=1'.
        '-DDEBUGVM=1'.
        '-DNDEBUG' }

And Esteban told me to compile it with '-UNDEBUG' to get more information, but I did not try this yet.

Yes.  NDEBUG is what turns off asserts.  You don't have a proper assert or debug VM unless you do not use -NDEBUG.

Thanks!

Note that there is a heap leak checker which can be enabled both in the simulator and the assert and debug VMs.  See the checkForLeaks method and the -leakcheck argument.

ok!

Without the simulator or the assert and debug VMs you are flying blind.  It is /really/ productive to use the simulator for debugging, provided the bug is reproducible within a short amount of time, as for example your case is above.

Ok, gotcha! By this afternoon I'll have some news probably.

Thanks a lot!





However, as soon as we garbage collect twice, we have a VM crash. We started debugging the VM to see if we could have some more clues.

The first thing we noticed is that the first time the GC runs, the mournQueue is nil. This is of course expected because the new finalization mechanism was not active and then there was no need to create the mournQueue. We saw that the mournQueue is actually created in a lazy fashion when putting queuing a mourned object (I refer myself to #queueMourner: and #ensureRoomOnObjStackAt:). So the second time the GC passes, the mournQueue is there. So far ok, but still crashing.

The crash happens in the call to

markAndTraceObjStackandContents(GIV(mournQueue), 1);

after the

    if (!markAndTraceContents) {
       return;
    }

But when understanding why, it starts being less clear to us :). We used the printObjStack() function and we saw that:

call printObjStack(markStack)
call printObjStack(weaklingStack)

and we saw in the console some output that makes sense. However, printing the mournQueue in the same manner produces some strange output

call printObjStack(mournQueue)

head  0xb06e980 cx 18 (18) fmt 10 (10) sz 4092 (4092) myx: 4098 (4098) unmkd
    topx: 14 next:        0x0 free:        0x0

We noticed that free and next are 0x0 while the others are not...

Finally we saw there is isValidObjStack(), that gave us the following results:

call isValidObjStack(markStack) => 1

call isValidObjStack(weaklingStack) => 0
p objStackInvalidBecause = "marking but page is unmarked"

call isValidObjStack(mournQueue) => 0
p objStackInvalidBecause = "marking but page is unmarked"


So we assume that the stack creation is wrong? We are a bit lost in here.

Guille and Pablo



--
_,,,^..^,,,_
best, Eliot




--
_,,,^..^,,,_
best, Eliot




--
_,,,^..^,,,_
best, Eliot






--
_,,,^..^,,,_
best, Eliot





--
_,,,^..^,,,_
best, Eliot



--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Ephemerons and VM crash

Eliot Miranda-2
 


On Wed, May 25, 2016 at 10:46 AM, Eliot Miranda <[hidden email]> wrote:
Hi Guille,

On Wed, May 25, 2016 at 10:43 AM, Eliot Miranda <[hidden email]> wrote:
Hi Guille,

    first, v important. can you put the image(s) you're working with somewhere I can download, and send me a URL?  I can be more helpful if I can run the tests too.


On Wed, May 25, 2016 at 6:24 AM, Guille Polito <[hidden email]> wrote:
 
Hi again!

I snip most of the old thread to keep the relevant.

-------- Original Message --------
1 - The mourn table is being collected because it is not marked upon creation
===============

I decided to move the assignment of marking out into markObjects:

Ok, it looks good for me, I did not know what would be a good place to put them without breaking something :)


2 - Accesses to unscannedEphemerons are unaligned
===============

Looks good.  I've changed all the ephemeron methods to use bytesPerOop, cuz the ephemeron queue contains oops, but your fix is just as good.

cool! I thought so that would be the best.
 


3 - While debugging I also had a DNU #intAt: in #primitiveIntegerAt
==========================

Question: besides reviewing, how should I commit these fixes? one issue per commit? all at once?

Typically all at once.  What I try and do is commit a large set of related changes in a single commit, so I wouldn't mix the fixes to ephemerons with, say, fixes to time.  But equally I'd commit any changes I made to help debugging in the same commit.  And all rules are made to be broken ;-).
ok!

Hey,  _thank you_!  This is really great and ephemerons won't be usable without testing like this.  I'm in your debt.

Ha, there is no debt (or at least it would be in the opposite direction).

Now I'm working on another crash that I can easily reproduce using this script:

Smalltalk supportsQueueingFinalization: true.
e := (1 to: 200000) collect: [ :i | Ephemeron key: (ObjectFinalizer receiver: 'test', 'asd' selector: #logCr) value: Object new ].
Smalltalk garbageCollect.

Why the extra level of indirection?  The Ephemeron /is/ the object finaliser.  It can send finalise to its key.  Why bother with the extra level of wrapper?  It's wasteful; especially if we're attaching lots of ephemerons to things cuz we want to finalise something that has lots of instances.

Ignore this stupid question.  I see it's an old facility.  And it is a helpful example :-).  Sorry for the noise!
 
 
While debugging that in the simulator, we saw with clement that during #fireEphemeronsInRememberedSet, #fireEphemeronsOnEphemeronList we are always firing, tenuring and scavenging ephemerons regardless they are marked or not.

fireEphemeronsInRememberedSet
[SNIP]
         coInterpreter fireEphemeron: ephemeron.
         manager
            storePointerUnchecked: 0
            ofObject: ephemeron
            withValue: (self copyAndForward: (manager keyOfEphemeron: ephemeron)).
        (self scavengeReferentsOf: ephemeron)
[SNIP]

So I put an assertion to check that the key of the ephemeron is marked when scavenging and it is not always the case...
         self assert: (manager isMarked: (manager keyOfEphemeron: ephemeron)).

Is the code wrong? or I'm missing something?

The implementation of ephemerons in the scavenger and the implementation in mark-sweep are quite different.

Remember that what we're trying to do is find out if an object is only referenced from the transitive closure of ephemerons or not.  And the way that we do that in both cases is avoid processing ephemerons whose keys are not yet reachable from the roots until all objects reachable from the roots have been reached.  We do this by putting "unscanned" ephemerons in a queue, saving them until later.

In the scavenger the roots are the objects in the remembered set, the interpreter state (newMethod etc) and the stack zone, and GC is performed by copying all objects reachable from these roots in past and new spaces into future space, possibly tenuring overflowed objects into old space.  In the mark-sweep the roots are the specialObjectsArray, the interpreter state (newMethod etc) and the stack zone.

In the scavenger (a copying collector) this means that when we process the unscanned ephemerons their keys will either have been copied into future space or tenured to old space, in which case they were reachable from the roots, or they will not have been copied yet, in which case they are reachable only from ephemerons.  So in the scavenger marking is irrelevant; in fact /no/ objects should be marked when scavenging.  The important thing is whether a key is in past and new spaces or is in future and old spaces.

In the mark-sweep this means that when we process the unscanned ephemerons their keys will either be marked, in which case they were reachable from the roots, or unmarked, in which case they are reachable only from ephemerons. So the assert for marked-ness only makes sense in ephemeron processing in the mark-sweep collector.

HTH

 
Guille


 


Guille

-------- Original Message --------
I see, I probably I need to leave the simulator run for a longer time... Because running the image using the assertion VM took some time to crash, it may be the same case.

-------- Original Message --------
Some advance on my side,

- Yes, apparently the image was already broken. I started a new one and I can open it with the simulator. It seems that as soon as I activate Ephemerons and I have alive ephemerons the heap corrupts (and since saving the image fires a gc...).

- Then, to catch the moment where the image breaks, I prepared my image for debug with the following script:

Smalltalk snapshotPrimitive.
f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.
d at: f put: f.
f := nil.
Smalltalk supportsQueueingFinalization: true.
1 to: 2 do: [ :i | Smalltalk garbageCollect].

Then I opened it with the simulator (latest version: VMMaker.oscog-tfel.1864) and I had several assertion failures in

keyOfEphemeron: objOop
    "Answer the object the ephemeron guards.  This is its first element."
    self assert: ((self isNonImmediate: objOop) and: [self isEphemeron: objOop]).
    ^self fetchPointer: 0 ofObject: objOop

followedKeyOfEphemeron: objOop
    "Answer the object the ephemeron guards.  This is its first element."
    self assert: ((self isNonImmediate: objOop) and: [self isEphemeron: objOop]).
    ^self followOopField: 0 ofObject: objOop

So, I noticed the following:

- Ephemerons in the unscanned list are marked, fired, and the format of ephemerons is changed to a non-ephemeron format
- But they are not removed from the unscanned list
- then the mark #markAllUnscannedEphemerons fails tons of assertions because it expects ephemeron objects and the (not-anymore-)ephemeron is still there in the unscanned list

     self markInactiveEphemerons ifFalse:
        [self fireAllUnscannedEphemerons].
     self markAllUnscannedEphemerons

The strange thing is that if I ignore the assertions by proceeding, the thing continues normally and it the image does not break...

So I then compiled the VM, with the latest version, and in trace mode to compare. I attach the log of the stdout. But here there is a résumé

* First I see the same assertions I saw in the simulator

(isNonImmediate(objOop)) && (isEphemeron(objOop)) 41155

(isNonImmediate(objOop)) && (isEphemeron(objOop)) 55477

* Then there are tons of repetitions of the same assertion:

!(((GIV(mournQueue) != GIV(nilObj)) && (isonObjStack(anEphemeronOrWeakArray, GIV(mournQueue))))) 60030

Which I understand it means that we are adding to the mourn queue all the time the same object. Which is strange because it is a WeakArray with a single slot which turns out to be nil (at least at the moment I inspected it).

* Then something breaks (!!) Looks like we are trying to mark a forwarder. And then I have a follow up of assertions which are probably caused by the same thing.

!(isForwarded(field)) 49489

(classIndexOf(objToScan)) > (isForwardedObjectClassIndexPun()) 49537

(classIndex <= (tagMask())) || (classIndex >= (arrayClassIndexPun())) 49403

objCouldBeClassObj(classObj) 49415

addressCouldBeOop(objOop) 59997

addressCouldBeObj(topOfObjStack((0 == (fetchPointerofObject(ObjStackTopx, objStack)) ? fetchPointerofObject(ObjStackNextx, objStack) : objStack))) 60002

addressCouldBeOop(objOop) 59997

* Finally there are some more

!(isForwarded(field)) 49489

until a segmentation fault arrives while executing a fullGC

../results/pharo[0x80ceaa4]
../results/pharo[0x80cfde4]
../results/pharo(fullGC+0x9b)[0x80c95ea]
../results/pharo[0x80e2d49]
../results/pharo[0x80e5f4a]
../results/pharo[0x80ebca3]
../results/pharo(ceStackOverflow+0xd7)[0x809356a]
[0x91002c6]
../results/pharo(interpret+0x128)[0x8082b38]
../results/pharo(main+0x23a)[0x810a7db]
/lib/i386-linux-gnu/i686/cmov/libc.so.6(__libc_start_main+0xf3)[0xf73b5a63]
[0x1000]

I'll continue digging a bit more,
Guille

-------- Original Message --------
Hi Guille,

On Thu, May 19, 2016 at 12:53 AM, Guille Polito <[hidden email][hidden email]> wrote:


-------- Original Message --------
Hi Guille,

On Wed, May 18, 2016 at 12:46 AM, Guille Polito <[hidden email][hidden email]> wrote:
Hello,

-------- Original Message --------
 


Hi Guille, Hi Pablo (and welcome),

On Tue, May 17, 2016 at 8:37 AM, Guille Polito <[hidden email][hidden email]> wrote:

Hi Eliot, list

I'm working here with Pablo (Tesone) on moving forward the Ephemeron implementation.


Where's "here"?   Are you in Lille?
Yup. Pablo is a new Phd student here :)


We first installed Eliot's changeset, added a #mourn method and an EphemeronDictionary collection, and then started testing something like this:

f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.

d at: f put: f.

f := nil.
Smalltalk garbageCollect.

So this looks like something simulate able.  Are you able to use the simulator?  If not, why not? 

For some reason I have that bytesToShift when opening the image is negative.

That is to be expected.  In the real VM the heap is located somewhere well above the bottom of the address space, typically above the program code.  In the simulator the heap is located either at 0 (an interpreter or stack VM) or immediately above the code zone (in a Cogit VM).  So when an image that has been saved on the real VM is loaded into the simulator all oops have to be adjusted down and hence bytesToShift is negative.


    bytesToShift := objectMemory memoryBaseForImageRead - oldBaseAddr.  "adjust pointers for zero base address"

So I cannot continue loading because addresses become negative and I have "Improper Store into indexable object kind of errors".

Can you post a back trace?  Where is this happening? 
I'm working in latest pharo (5.0), using the same source code that is used in the Pharo CI jobs to build the VM (pharo branded). I would expect this branch to be stable, maybe it could be a bit behind your branch of development, but I think Esteban merges regularly.

Esteban is very busy trying to get UFFI working, and having just got Pharo 5 released and working on Pharo 6 to be released.  I'd prefer it if you used the tip of VMMaker.oscog.  Further, that's the package that Clément is using and he's your expert source of local help.

 

I'm executing this,

vm := StackInterpreterSimulator newWithOptions: #(#ObjectMemory #Spur32BitMemoryManager ).
vm desiredNumStackPages: 8. "Makes simulation faster by creating fewer stack pages."
(vm openOn: '/home/guille/Pharo/temp/ephemerons/broken.image')
    openAsMorph; run.

And here is the trace. The error happens while swizzling.

LittleEndianBitmap(Object)>>error:
LittleEndianBitmap(Object)>>errorImproperStore
LittleEndianBitmap(Object)>>at:put:
Spur32BitMMLESimulator>>longAt:put:
Spur32BitMMLESimulator(SpurMemoryManager)>>swizzleFieldsOfObject:
Spur32BitMMLESimulator(SpurMemoryManager)>>adjustAllOopsBy:
Spur32BitMMLESimulator(SpurMemoryManager)>>initializeObjectMemory:
StackInterpreterSimulatorLSB(StackInterpreter)>>initializeInterpreter:
[ self initializeInterpreter: bytesToShift ] in StackInterpreterSimulatorLSB(StackInterpreterSimulator)>>openOn:extraMemory: in Block: [ self initializeInterpreter: bytesToShift ]

 So the image is presumably already broken?  Looks to me like it was corrupted when saved.   Can you reproduce the problem that led to this image being broken from a working image?  
What version of VMMaker.oscog are you using?
Monticello reports VMMaker.oscog-eem.1855.
However, I'm not sure I have the right number, because source code is now in git.

It would be so great if you could use Monticello, not git, and hence be able to contribute back.
 
  Are you running in Pharo or Squeak?  If you're in Lille you could perhaps visit Clément's office and get him to take a look.  Clément, would that be ok?
I'll maybe see it with clement, but today I have to work on some other project...
When debugging the VM there are two main levels of support, one is the simulator where there is maximum support for debugging:
- asserts on all the time
- arbitrary breakpoints
- attempting every GC in a copy of the heap before doing the real GC so that bugs in the GC can be investigated without needing to construct a reproducible case after a crash
- the Smalltalk environment to inspect and browse

The next level is the assert and debug VMs.  If you look in the build directories on the Cog svn branch you'll see that all of them build three VMs, a production VM with maximum optimisation and asserts excluded, an assert VM with -O1 and asserts enabled, and a debug VM with -O0 and asserts enabled.  So if you either don't see the bug in the simulator, or the simulator is too slow for the case being examined, or if the bug doesn't show up in the simulator (the worst of all worlds), build both assert and debug VMs and run with the assert VM first.

Well so far we were using a VM compiled for debug with a graphical C debugger. It was not so bad. However, I cannot say I'm missing a better debugger.

"Compiled for debug" is vague.  Do you mean it is compiled with -g -O0, or in addition is compiled with -g -O0 -DDEBUGVM=0 -DNDEBUG=1?
 
That would exactly be:

{ '-g3'.
        '-O0'.
        '-msse2'.
        '-D_GNU_SOURCE'.
        '-DITIMER_HEARTBEAT=1'.
        '-DNO_VM_PROFILE=1'.
        '-DDEBUGVM=1'.
        '-DNDEBUG' }

And Esteban told me to compile it with '-UNDEBUG' to get more information, but I did not try this yet.

Yes.  NDEBUG is what turns off asserts.  You don't have a proper assert or debug VM unless you do not use -NDEBUG.

Thanks!

Note that there is a heap leak checker which can be enabled both in the simulator and the assert and debug VMs.  See the checkForLeaks method and the -leakcheck argument.

ok!

Without the simulator or the assert and debug VMs you are flying blind.  It is /really/ productive to use the simulator for debugging, provided the bug is reproducible within a short amount of time, as for example your case is above.

Ok, gotcha! By this afternoon I'll have some news probably.

Thanks a lot!





However, as soon as we garbage collect twice, we have a VM crash. We started debugging the VM to see if we could have some more clues.

The first thing we noticed is that the first time the GC runs, the mournQueue is nil. This is of course expected because the new finalization mechanism was not active and then there was no need to create the mournQueue. We saw that the mournQueue is actually created in a lazy fashion when putting queuing a mourned object (I refer myself to #queueMourner: and #ensureRoomOnObjStackAt:). So the second time the GC passes, the mournQueue is there. So far ok, but still crashing.

The crash happens in the call to

markAndTraceObjStackandContents(GIV(mournQueue), 1);

after the

    if (!markAndTraceContents) {
       return;
    }

But when understanding why, it starts being less clear to us :). We used the printObjStack() function and we saw that:

call printObjStack(markStack)
call printObjStack(weaklingStack)

and we saw in the console some output that makes sense. However, printing the mournQueue in the same manner produces some strange output

call printObjStack(mournQueue)

head  0xb06e980 cx 18 (18) fmt 10 (10) sz 4092 (4092) myx: 4098 (4098) unmkd
    topx: 14 next:        0x0 free:        0x0

We noticed that free and next are 0x0 while the others are not...

Finally we saw there is isValidObjStack(), that gave us the following results:

call isValidObjStack(markStack) => 1

call isValidObjStack(weaklingStack) => 0
p objStackInvalidBecause = "marking but page is unmarked"

call isValidObjStack(mournQueue) => 0
p objStackInvalidBecause = "marking but page is unmarked"


So we assume that the stack creation is wrong? We are a bit lost in here.

Guille and Pablo



--
_,,,^..^,,,_
best, Eliot




--
_,,,^..^,,,_
best, Eliot




--
_,,,^..^,,,_
best, Eliot






--
_,,,^..^,,,_
best, Eliot





--
_,,,^..^,,,_
best, Eliot



--
_,,,^..^,,,_
best, Eliot



--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Ephemerons and VM crash

Guillermo Polito
 
Hi Eliot,

- Regarding yesterday's email: yesterday I took the time with Pablo to read and understand the entire scavenging and I understood that my first email was nonsense :).

* We were thinking on enhancing the comment in #scavengeUnfiredEphemeronsInRememberedSet to explain how the remembered set is managed. It was a bit unclear at first for us that the ephemerons to fire were being swapped to occupy the first places, which was important afterwards for firing them.

* Also, there is something that we think is a bug: in the whole ephemeron processing during scavenging, we check always if the key has survived the scavenge. However, we never check if the key is old. If the key is old it looks like the algorithm is treating the ephemerons as ephemerons to fire, while it should not, isn't it?


- Regarding the images to test, I'm actually using a plain Pharo image with your changes (+ modifications in the class builder to create Ephemeric classes). The snippets that I share crash those images.

I'll try to push these changes into Pharo to make it even easier to reproduce.

- Then, about the finalizer is just a facility as you say. The important thing is that with those snippets I can reproduce the crashes with almost 100% probability and no manual intervention :).

Guille

-------- Original Message --------
 




On Wed, May 25, 2016 at 10:46 AM, Eliot Miranda <[hidden email]> wrote:
Hi Guille,

On Wed, May 25, 2016 at 10:43 AM, Eliot Miranda <[hidden email]> wrote:
Hi Guille,

    first, v important. can you put the image(s) you're working with somewhere I can download, and send me a URL?  I can be more helpful if I can run the tests too.


On Wed, May 25, 2016 at 6:24 AM, Guille Polito <[hidden email]> wrote:
 
Hi again!

I snip most of the old thread to keep the relevant.

-------- Original Message --------
1 - The mourn table is being collected because it is not marked upon creation
===============

I decided to move the assignment of marking out into markObjects:

Ok, it looks good for me, I did not know what would be a good place to put them without breaking something :)


2 - Accesses to unscannedEphemerons are unaligned
===============

Looks good.  I've changed all the ephemeron methods to use bytesPerOop, cuz the ephemeron queue contains oops, but your fix is just as good.

cool! I thought so that would be the best.
 


3 - While debugging I also had a DNU #intAt: in #primitiveIntegerAt
==========================

Question: besides reviewing, how should I commit these fixes? one issue per commit? all at once?

Typically all at once.  What I try and do is commit a large set of related changes in a single commit, so I wouldn't mix the fixes to ephemerons with, say, fixes to time.  But equally I'd commit any changes I made to help debugging in the same commit.  And all rules are made to be broken ;-).
ok!

Hey,  _thank you_!  This is really great and ephemerons won't be usable without testing like this.  I'm in your debt.

Ha, there is no debt (or at least it would be in the opposite direction).

Now I'm working on another crash that I can easily reproduce using this script:

Smalltalk supportsQueueingFinalization: true.
e := (1 to: 200000) collect: [ :i | Ephemeron key: (ObjectFinalizer receiver: 'test', 'asd' selector: #logCr) value: Object new ].
Smalltalk garbageCollect.

Why the extra level of indirection?  The Ephemeron /is/ the object finaliser.  It can send finalise to its key.  Why bother with the extra level of wrapper?  It's wasteful; especially if we're attaching lots of ephemerons to things cuz we want to finalise something that has lots of instances.

Ignore this stupid question.  I see it's an old facility.  And it is a helpful example :-).  Sorry for the noise!
 
 
While debugging that in the simulator, we saw with clement that during #fireEphemeronsInRememberedSet, #fireEphemeronsOnEphemeronList we are always firing, tenuring and scavenging ephemerons regardless they are marked or not.

fireEphemeronsInRememberedSet
[SNIP]
         coInterpreter fireEphemeron: ephemeron.
         manager
            storePointerUnchecked: 0
            ofObject: ephemeron
            withValue: (self copyAndForward: (manager keyOfEphemeron: ephemeron)).
        (self scavengeReferentsOf: ephemeron)
[SNIP]

So I put an assertion to check that the key of the ephemeron is marked when scavenging and it is not always the case...
         self assert: (manager isMarked: (manager keyOfEphemeron: ephemeron)).

Is the code wrong? or I'm missing something?

The implementation of ephemerons in the scavenger and the implementation in mark-sweep are quite different.

Remember that what we're trying to do is find out if an object is only referenced from the transitive closure of ephemerons or not.  And the way that we do that in both cases is avoid processing ephemerons whose keys are not yet reachable from the roots until all objects reachable from the roots have been reached.  We do this by putting "unscanned" ephemerons in a queue, saving them until later.

In the scavenger the roots are the objects in the remembered set, the interpreter state (newMethod etc) and the stack zone, and GC is performed by copying all objects reachable from these roots in past and new spaces into future space, possibly tenuring overflowed objects into old space.  In the mark-sweep the roots are the specialObjectsArray, the interpreter state (newMethod etc) and the stack zone.

In the scavenger (a copying collector) this means that when we process the unscanned ephemerons their keys will either have been copied into future space or tenured to old space, in which case they were reachable from the roots, or they will not have been copied yet, in which case they are reachable only from ephemerons.  So in the scavenger marking is irrelevant; in fact /no/ objects should be marked when scavenging.  The important thing is whether a key is in past and new spaces or is in future and old spaces.

In the mark-sweep this means that when we process the unscanned ephemerons their keys will either be marked, in which case they were reachable from the roots, or unmarked, in which case they are reachable only from ephemerons. So the assert for marked-ness only makes sense in ephemeron processing in the mark-sweep collector.

HTH

 
Guille


 


Guille

-------- Original Message --------
I see, I probably I need to leave the simulator run for a longer time... Because running the image using the assertion VM took some time to crash, it may be the same case.

-------- Original Message --------
Some advance on my side,

- Yes, apparently the image was already broken. I started a new one and I can open it with the simulator. It seems that as soon as I activate Ephemerons and I have alive ephemerons the heap corrupts (and since saving the image fires a gc...).

- Then, to catch the moment where the image breaks, I prepared my image for debug with the following script:

Smalltalk snapshotPrimitive.
f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.
d at: f put: f.
f := nil.
Smalltalk supportsQueueingFinalization: true.
1 to: 2 do: [ :i | Smalltalk garbageCollect].

Then I opened it with the simulator (latest version: VMMaker.oscog-tfel.1864) and I had several assertion failures in

keyOfEphemeron: objOop
    "Answer the object the ephemeron guards.  This is its first element."
    self assert: ((self isNonImmediate: objOop) and: [self isEphemeron: objOop]).
    ^self fetchPointer: 0 ofObject: objOop

followedKeyOfEphemeron: objOop
    "Answer the object the ephemeron guards.  This is its first element."
    self assert: ((self isNonImmediate: objOop) and: [self isEphemeron: objOop]).
    ^self followOopField: 0 ofObject: objOop

So, I noticed the following:

- Ephemerons in the unscanned list are marked, fired, and the format of ephemerons is changed to a non-ephemeron format
- But they are not removed from the unscanned list
- then the mark #markAllUnscannedEphemerons fails tons of assertions because it expects ephemeron objects and the (not-anymore-)ephemeron is still there in the unscanned list

     self markInactiveEphemerons ifFalse:
        [self fireAllUnscannedEphemerons].
     self markAllUnscannedEphemerons

The strange thing is that if I ignore the assertions by proceeding, the thing continues normally and it the image does not break...

So I then compiled the VM, with the latest version, and in trace mode to compare. I attach the log of the stdout. But here there is a résumé

* First I see the same assertions I saw in the simulator

(isNonImmediate(objOop)) && (isEphemeron(objOop)) 41155

(isNonImmediate(objOop)) && (isEphemeron(objOop)) 55477

* Then there are tons of repetitions of the same assertion:

!(((GIV(mournQueue) != GIV(nilObj)) && (isonObjStack(anEphemeronOrWeakArray, GIV(mournQueue))))) 60030

Which I understand it means that we are adding to the mourn queue all the time the same object. Which is strange because it is a WeakArray with a single slot which turns out to be nil (at least at the moment I inspected it).

* Then something breaks (!!) Looks like we are trying to mark a forwarder. And then I have a follow up of assertions which are probably caused by the same thing.

!(isForwarded(field)) 49489

(classIndexOf(objToScan)) > (isForwardedObjectClassIndexPun()) 49537

(classIndex <= (tagMask())) || (classIndex >= (arrayClassIndexPun())) 49403

objCouldBeClassObj(classObj) 49415

addressCouldBeOop(objOop) 59997

addressCouldBeObj(topOfObjStack((0 == (fetchPointerofObject(ObjStackTopx, objStack)) ? fetchPointerofObject(ObjStackNextx, objStack) : objStack))) 60002

addressCouldBeOop(objOop) 59997

* Finally there are some more

!(isForwarded(field)) 49489

until a segmentation fault arrives while executing a fullGC

../results/pharo[0x80ceaa4]
../results/pharo[0x80cfde4]
../results/pharo(fullGC+0x9b)[0x80c95ea]
../results/pharo[0x80e2d49]
../results/pharo[0x80e5f4a]
../results/pharo[0x80ebca3]
../results/pharo(ceStackOverflow+0xd7)[0x809356a]
[0x91002c6]
../results/pharo(interpret+0x128)[0x8082b38]
../results/pharo(main+0x23a)[0x810a7db]
/lib/i386-linux-gnu/i686/cmov/libc.so.6(__libc_start_main+0xf3)[0xf73b5a63]
[0x1000]

I'll continue digging a bit more,
Guille

-------- Original Message --------
Hi Guille,

On Thu, May 19, 2016 at 12:53 AM, Guille Polito <[hidden email]> wrote:


-------- Original Message --------
Hi Guille,

On Wed, May 18, 2016 at 12:46 AM, Guille Polito <[hidden email]> wrote:
Hello,

-------- Original Message --------
 


Hi Guille, Hi Pablo (and welcome),

On Tue, May 17, 2016 at 8:37 AM, Guille Polito <[hidden email]> wrote:

Hi Eliot, list

I'm working here with Pablo (Tesone) on moving forward the Ephemeron implementation.


Where's "here"?   Are you in Lille?
Yup. Pablo is a new Phd student here :)


We first installed Eliot's changeset, added a #mourn method and an EphemeronDictionary collection, and then started testing something like this:

f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.

d at: f put: f.

f := nil.
Smalltalk garbageCollect.

So this looks like something simulate able.  Are you able to use the simulator?  If not, why not? 

For some reason I have that bytesToShift when opening the image is negative.

That is to be expected.  In the real VM the heap is located somewhere well above the bottom of the address space, typically above the program code.  In the simulator the heap is located either at 0 (an interpreter or stack VM) or immediately above the code zone (in a Cogit VM).  So when an image that has been saved on the real VM is loaded into the simulator all oops have to be adjusted down and hence bytesToShift is negative.


    bytesToShift := objectMemory memoryBaseForImageRead - oldBaseAddr.  "adjust pointers for zero base address"

So I cannot continue loading because addresses become negative and I have "Improper Store into indexable object kind of errors".

Can you post a back trace?  Where is this happening? 
I'm working in latest pharo (5.0), using the same source code that is used in the Pharo CI jobs to build the VM (pharo branded). I would expect this branch to be stable, maybe it could be a bit behind your branch of development, but I think Esteban merges regularly.

Esteban is very busy trying to get UFFI working, and having just got Pharo 5 released and working on Pharo 6 to be released.  I'd prefer it if you used the tip of VMMaker.oscog.  Further, that's the package that Clément is using and he's your expert source of local help.

 

I'm executing this,

vm := StackInterpreterSimulator newWithOptions: #(#ObjectMemory #Spur32BitMemoryManager ).
vm desiredNumStackPages: 8. "Makes simulation faster by creating fewer stack pages."
(vm openOn: '/home/guille/Pharo/temp/ephemerons/broken.image')
    openAsMorph; run.

And here is the trace. The error happens while swizzling.

LittleEndianBitmap(Object)>>error:
LittleEndianBitmap(Object)>>errorImproperStore
LittleEndianBitmap(Object)>>at:put:
Spur32BitMMLESimulator>>longAt:put:
Spur32BitMMLESimulator(SpurMemoryManager)>>swizzleFieldsOfObject:
Spur32BitMMLESimulator(SpurMemoryManager)>>adjustAllOopsBy:
Spur32BitMMLESimulator(SpurMemoryManager)>>initializeObjectMemory:
StackInterpreterSimulatorLSB(StackInterpreter)>>initializeInterpreter:
[ self initializeInterpreter: bytesToShift ] in StackInterpreterSimulatorLSB(StackInterpreterSimulator)>>openOn:extraMemory: in Block: [ self initializeInterpreter: bytesToShift ]

 So the image is presumably already broken?  Looks to me like it was corrupted when saved.   Can you reproduce the problem that led to this image being broken from a working image?  
What version of VMMaker.oscog are you using?
Monticello reports VMMaker.oscog-eem.1855.
However, I'm not sure I have the right number, because source code is now in git.

It would be so great if you could use Monticello, not git, and hence be able to contribute back.
 
  Are you running in Pharo or Squeak?  If you're in Lille you could perhaps visit Clément's office and get him to take a look.  Clément, would that be ok?
I'll maybe see it with clement, but today I have to work on some other project...
When debugging the VM there are two main levels of support, one is the simulator where there is maximum support for debugging:
- asserts on all the time
- arbitrary breakpoints
- attempting every GC in a copy of the heap before doing the real GC so that bugs in the GC can be investigated without needing to construct a reproducible case after a crash
- the Smalltalk environment to inspect and browse

The next level is the assert and debug VMs.  If you look in the build directories on the Cog svn branch you'll see that all of them build three VMs, a production VM with maximum optimisation and asserts excluded, an assert VM with -O1 and asserts enabled, and a debug VM with -O0 and asserts enabled.  So if you either don't see the bug in the simulator, or the simulator is too slow for the case being examined, or if the bug doesn't show up in the simulator (the worst of all worlds), build both assert and debug VMs and run with the assert VM first.

Well so far we were using a VM compiled for debug with a graphical C debugger. It was not so bad. However, I cannot say I'm missing a better debugger.

"Compiled for debug" is vague.  Do you mean it is compiled with -g -O0, or in addition is compiled with -g -O0 -DDEBUGVM=0 -DNDEBUG=1?
 
That would exactly be:

{ '-g3'.
        '-O0'.
        '-msse2'.
        '-D_GNU_SOURCE'.
        '-DITIMER_HEARTBEAT=1'.
        '-DNO_VM_PROFILE=1'.
        '-DDEBUGVM=1'.
        '-DNDEBUG' }

And Esteban told me to compile it with '-UNDEBUG' to get more information, but I did not try this yet.

Yes.  NDEBUG is what turns off asserts.  You don't have a proper assert or debug VM unless you do not use -NDEBUG.

Thanks!

Note that there is a heap leak checker which can be enabled both in the simulator and the assert and debug VMs.  See the checkForLeaks method and the -leakcheck argument.

ok!

Without the simulator or the assert and debug VMs you are flying blind.  It is /really/ productive to use the simulator for debugging, provided the bug is reproducible within a short amount of time, as for example your case is above.

Ok, gotcha! By this afternoon I'll have some news probably.

Thanks a lot!





However, as soon as we garbage collect twice, we have a VM crash. We started debugging the VM to see if we could have some more clues.

The first thing we noticed is that the first time the GC runs, the mournQueue is nil. This is of course expected because the new finalization mechanism was not active and then there was no need to create the mournQueue. We saw that the mournQueue is actually created in a lazy fashion when putting queuing a mourned object (I refer myself to #queueMourner: and #ensureRoomOnObjStackAt:). So the second time the GC passes, the mournQueue is there. So far ok, but still crashing.

The crash happens in the call to

markAndTraceObjStackandContents(GIV(mournQueue), 1);

after the

    if (!markAndTraceContents) {
       return;
    }

But when understanding why, it starts being less clear to us :). We used the printObjStack() function and we saw that:

call printObjStack(markStack)
call printObjStack(weaklingStack)

and we saw in the console some output that makes sense. However, printing the mournQueue in the same manner produces some strange output

call printObjStack(mournQueue)

head  0xb06e980 cx 18 (18) fmt 10 (10) sz 4092 (4092) myx: 4098 (4098) unmkd
    topx: 14 next:        0x0 free:        0x0

We noticed that free and next are 0x0 while the others are not...

Finally we saw there is isValidObjStack(), that gave us the following results:

call isValidObjStack(markStack) => 1

call isValidObjStack(weaklingStack) => 0
p objStackInvalidBecause = "marking but page is unmarked"

call isValidObjStack(mournQueue) => 0
p objStackInvalidBecause = "marking but page is unmarked"


So we assume that the stack creation is wrong? We are a bit lost in here.

Guille and Pablo



--
_,,,^..^,,,_
best, Eliot




--
_,,,^..^,,,_
best, Eliot




--
_,,,^..^,,,_
best, Eliot






--
_,,,^..^,,,_
best, Eliot





--
_,,,^..^,,,_
best, Eliot



--
_,,,^..^,,,_
best, Eliot



--
_,,,^..^,,,_
best, Eliot

Reply | Threaded
Open this post in threaded view
|

Re: Ephemerons and VM crash

Eliot Miranda-2
 
Hi Guille,


On May 26, 2016, at 4:32 AM, Guille Polito <[hidden email]> wrote:

Hi Eliot,

- Regarding yesterday's email: yesterday I took the time with Pablo to read and understand the entire scavenging and I understood that my first email was nonsense :).

* We were thinking on enhancing the comment in #scavengeUnfiredEphemeronsInRememberedSet to explain how the remembered set is managed. It was a bit unclear at first for us that the ephemerons to fire were being swapped to occupy the first places, which was important afterwards for firing them.

Yes please.  Feel free to commit (or send me a patch).

* Also, there is something that we think is a bug: in the whole ephemeron processing during scavenging, we check always if the key has survived the scavenge. However, we never check if the key is old. If the key is old it looks like the algorithm is treating the ephemerons as ephemerons to fire, while it should not, isn't it?

Look at isScavengeSurvivor, it says a survivor is either not young (hence old) or in future space.  So it does the right thing.  But it's great that you're criticizing the code.  In needs proper review.  You've already found several bugs and we're making good progress.

I hope today to get your example with 200k ObjectFinalizers working.  It works with 50k. Of course the difference is really important as I suspect that with 200k there is tenuring but with only 50k there is none (Eden is large at 4mb).  Ah, I remember I found I could get the same crash with only 100k elements but not with 75k.  So the smallest case we should investigate is 100k.

I might try running with a smaller eden to see if I can provoke the bug faster.  The 100k case in the simulator is slow enough (~ 5 mins) to be annoying and tempt me into picking up my guitar or some such :-/.

- Regarding the images to test, I'm actually using a plain Pharo image with your changes (+ modifications in the class builder to create Ephemeric classes). The snippets that I share crash those images.

I'll try to push these changes into Pharo to make it even easier to reproduce.

- Then, about the finalizer is just a facility as you say. The important thing is that with those snippets I can reproduce the crashes with almost 100% probability and no manual intervention :).

Guille

-------- Original Message --------
 




On Wed, May 25, 2016 at 10:46 AM, Eliot Miranda <[hidden email]> wrote:
Hi Guille,

On Wed, May 25, 2016 at 10:43 AM, Eliot Miranda <[hidden email][hidden email]> wrote:
Hi Guille,

    first, v important. can you put the image(s) you're working with somewhere I can download, and send me a URL?  I can be more helpful if I can run the tests too.


On Wed, May 25, 2016 at 6:24 AM, Guille Polito <[hidden email][hidden email]> wrote:
 
Hi again!

I snip most of the old thread to keep the relevant.

-------- Original Message --------
1 - The mourn table is being collected because it is not marked upon creation
===============

I decided to move the assignment of marking out into markObjects:

Ok, it looks good for me, I did not know what would be a good place to put them without breaking something :)


2 - Accesses to unscannedEphemerons are unaligned
===============

Looks good.  I've changed all the ephemeron methods to use bytesPerOop, cuz the ephemeron queue contains oops, but your fix is just as good.

cool! I thought so that would be the best.
 


3 - While debugging I also had a DNU #intAt: in #primitiveIntegerAt
==========================

Question: besides reviewing, how should I commit these fixes? one issue per commit? all at once?

Typically all at once.  What I try and do is commit a large set of related changes in a single commit, so I wouldn't mix the fixes to ephemerons with, say, fixes to time.  But equally I'd commit any changes I made to help debugging in the same commit.  And all rules are made to be broken ;-).
ok!

Hey,  _thank you_!  This is really great and ephemerons won't be usable without testing like this.  I'm in your debt.

Ha, there is no debt (or at least it would be in the opposite direction).

Now I'm working on another crash that I can easily reproduce using this script:

Smalltalk supportsQueueingFinalization: true.
e := (1 to: 200000) collect: [ :i | Ephemeron key: (ObjectFinalizer receiver: 'test', 'asd' selector: #logCr) value: Object new ].
Smalltalk garbageCollect.

Why the extra level of indirection?  The Ephemeron /is/ the object finaliser.  It can send finalise to its key.  Why bother with the extra level of wrapper?  It's wasteful; especially if we're attaching lots of ephemerons to things cuz we want to finalise something that has lots of instances.

Ignore this stupid question.  I see it's an old facility.  And it is a helpful example :-).  Sorry for the noise!
 
 
While debugging that in the simulator, we saw with clement that during #fireEphemeronsInRememberedSet, #fireEphemeronsOnEphemeronList we are always firing, tenuring and scavenging ephemerons regardless they are marked or not.

fireEphemeronsInRememberedSet
[SNIP]
         coInterpreter fireEphemeron: ephemeron.
         manager
            storePointerUnchecked: 0
            ofObject: ephemeron
            withValue: (self copyAndForward: (manager keyOfEphemeron: ephemeron)).
        (self scavengeReferentsOf: ephemeron)
[SNIP]

So I put an assertion to check that the key of the ephemeron is marked when scavenging and it is not always the case...
         self assert: (manager isMarked: (manager keyOfEphemeron: ephemeron)).

Is the code wrong? or I'm missing something?

The implementation of ephemerons in the scavenger and the implementation in mark-sweep are quite different.

Remember that what we're trying to do is find out if an object is only referenced from the transitive closure of ephemerons or not.  And the way that we do that in both cases is avoid processing ephemerons whose keys are not yet reachable from the roots until all objects reachable from the roots have been reached.  We do this by putting "unscanned" ephemerons in a queue, saving them until later.

In the scavenger the roots are the objects in the remembered set, the interpreter state (newMethod etc) and the stack zone, and GC is performed by copying all objects reachable from these roots in past and new spaces into future space, possibly tenuring overflowed objects into old space.  In the mark-sweep the roots are the specialObjectsArray, the interpreter state (newMethod etc) and the stack zone.

In the scavenger (a copying collector) this means that when we process the unscanned ephemerons their keys will either have been copied into future space or tenured to old space, in which case they were reachable from the roots, or they will not have been copied yet, in which case they are reachable only from ephemerons.  So in the scavenger marking is irrelevant; in fact /no/ objects should be marked when scavenging.  The important thing is whether a key is in past and new spaces or is in future and old spaces.

In the mark-sweep this means that when we process the unscanned ephemerons their keys will either be marked, in which case they were reachable from the roots, or unmarked, in which case they are reachable only from ephemerons. So the assert for marked-ness only makes sense in ephemeron processing in the mark-sweep collector.

HTH

 
Guille


 


Guille

-------- Original Message --------
I see, I probably I need to leave the simulator run for a longer time... Because running the image using the assertion VM took some time to crash, it may be the same case.

-------- Original Message --------
Some advance on my side,

- Yes, apparently the image was already broken. I started a new one and I can open it with the simulator. It seems that as soon as I activate Ephemerons and I have alive ephemerons the heap corrupts (and since saving the image fires a gc...).

- Then, to catch the moment where the image breaks, I prepared my image for debug with the following script:

Smalltalk snapshotPrimitive.
f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.
d at: f put: f.
f := nil.
Smalltalk supportsQueueingFinalization: true.
1 to: 2 do: [ :i | Smalltalk garbageCollect].

Then I opened it with the simulator (latest version: VMMaker.oscog-tfel.1864) and I had several assertion failures in

keyOfEphemeron: objOop
    "Answer the object the ephemeron guards.  This is its first element."
    self assert: ((self isNonImmediate: objOop) and: [self isEphemeron: objOop]).
    ^self fetchPointer: 0 ofObject: objOop

followedKeyOfEphemeron: objOop
    "Answer the object the ephemeron guards.  This is its first element."
    self assert: ((self isNonImmediate: objOop) and: [self isEphemeron: objOop]).
    ^self followOopField: 0 ofObject: objOop

So, I noticed the following:

- Ephemerons in the unscanned list are marked, fired, and the format of ephemerons is changed to a non-ephemeron format
- But they are not removed from the unscanned list
- then the mark #markAllUnscannedEphemerons fails tons of assertions because it expects ephemeron objects and the (not-anymore-)ephemeron is still there in the unscanned list

     self markInactiveEphemerons ifFalse:
        [self fireAllUnscannedEphemerons].
     self markAllUnscannedEphemerons

The strange thing is that if I ignore the assertions by proceeding, the thing continues normally and it the image does not break...

So I then compiled the VM, with the latest version, and in trace mode to compare. I attach the log of the stdout. But here there is a résumé

* First I see the same assertions I saw in the simulator

(isNonImmediate(objOop)) && (isEphemeron(objOop)) 41155

(isNonImmediate(objOop)) && (isEphemeron(objOop)) 55477

* Then there are tons of repetitions of the same assertion:

!(((GIV(mournQueue) != GIV(nilObj)) && (isonObjStack(anEphemeronOrWeakArray, GIV(mournQueue))))) 60030

Which I understand it means that we are adding to the mourn queue all the time the same object. Which is strange because it is a WeakArray with a single slot which turns out to be nil (at least at the moment I inspected it).

* Then something breaks (!!) Looks like we are trying to mark a forwarder. And then I have a follow up of assertions which are probably caused by the same thing.

!(isForwarded(field)) 49489

(classIndexOf(objToScan)) > (isForwardedObjectClassIndexPun()) 49537

(classIndex <= (tagMask())) || (classIndex >= (arrayClassIndexPun())) 49403

objCouldBeClassObj(classObj) 49415

addressCouldBeOop(objOop) 59997

addressCouldBeObj(topOfObjStack((0 == (fetchPointerofObject(ObjStackTopx, objStack)) ? fetchPointerofObject(ObjStackNextx, objStack) : objStack))) 60002

addressCouldBeOop(objOop) 59997

* Finally there are some more

!(isForwarded(field)) 49489

until a segmentation fault arrives while executing a fullGC

../results/pharo[0x80ceaa4]
../results/pharo[0x80cfde4]
../results/pharo(fullGC+0x9b)[0x80c95ea]
../results/pharo[0x80e2d49]
../results/pharo[0x80e5f4a]
../results/pharo[0x80ebca3]
../results/pharo(ceStackOverflow+0xd7)[0x809356a]
[0x91002c6]
../results/pharo(interpret+0x128)[0x8082b38]
../results/pharo(main+0x23a)[0x810a7db]
/lib/i386-linux-gnu/i686/cmov/libc.so.6(__libc_start_main+0xf3)[0xf73b5a63]
[0x1000]

I'll continue digging a bit more,
Guille

-------- Original Message --------
Hi Guille,

On Thu, May 19, 2016 at 12:53 AM, Guille Polito <[hidden email][hidden email]> wrote:


-------- Original Message --------
Hi Guille,

On Wed, May 18, 2016 at 12:46 AM, Guille Polito <[hidden email][hidden email]> wrote:
Hello,

-------- Original Message --------
 


Hi Guille, Hi Pablo (and welcome),

On Tue, May 17, 2016 at 8:37 AM, Guille Polito <[hidden email][hidden email]> wrote:

Hi Eliot, list

I'm working here with Pablo (Tesone) on moving forward the Ephemeron implementation.


Where's "here"?   Are you in Lille?
Yup. Pablo is a new Phd student here :)


We first installed Eliot's changeset, added a #mourn method and an EphemeronDictionary collection, and then started testing something like this:

f := ObjectFinalizer receiver: 'Hello' selector: #logCr.
d := EphemeronDictionary new.

d at: f put: f.

f := nil.
Smalltalk garbageCollect.

So this looks like something simulate able.  Are you able to use the simulator?  If not, why not? 

For some reason I have that bytesToShift when opening the image is negative.

That is to be expected.  In the real VM the heap is located somewhere well above the bottom of the address space, typically above the program code.  In the simulator the heap is located either at 0 (an interpreter or stack VM) or immediately above the code zone (in a Cogit VM).  So when an image that has been saved on the real VM is loaded into the simulator all oops have to be adjusted down and hence bytesToShift is negative.


    bytesToShift := objectMemory memoryBaseForImageRead - oldBaseAddr.  "adjust pointers for zero base address"

So I cannot continue loading because addresses become negative and I have "Improper Store into indexable object kind of errors".

Can you post a back trace?  Where is this happening? 
I'm working in latest pharo (5.0), using the same source code that is used in the Pharo CI jobs to build the VM (pharo branded). I would expect this branch to be stable, maybe it could be a bit behind your branch of development, but I think Esteban merges regularly.

Esteban is very busy trying to get UFFI working, and having just got Pharo 5 released and working on Pharo 6 to be released.  I'd prefer it if you used the tip of VMMaker.oscog.  Further, that's the package that Clément is using and he's your expert source of local help.

 

I'm executing this,

vm := StackInterpreterSimulator newWithOptions: #(#ObjectMemory #Spur32BitMemoryManager ).
vm desiredNumStackPages: 8. "Makes simulation faster by creating fewer stack pages."
(vm openOn: '/home/guille/Pharo/temp/ephemerons/broken.image')
    openAsMorph; run.

And here is the trace. The error happens while swizzling.

LittleEndianBitmap(Object)>>error:
LittleEndianBitmap(Object)>>errorImproperStore
LittleEndianBitmap(Object)>>at:put:
Spur32BitMMLESimulator>>longAt:put:
Spur32BitMMLESimulator(SpurMemoryManager)>>swizzleFieldsOfObject:
Spur32BitMMLESimulator(SpurMemoryManager)>>adjustAllOopsBy:
Spur32BitMMLESimulator(SpurMemoryManager)>>initializeObjectMemory:
StackInterpreterSimulatorLSB(StackInterpreter)>>initializeInterpreter:
[ self initializeInterpreter: bytesToShift ] in StackInterpreterSimulatorLSB(StackInterpreterSimulator)>>openOn:extraMemory: in Block: [ self initializeInterpreter: bytesToShift ]

 So the image is presumably already broken?  Looks to me like it was corrupted when saved.   Can you reproduce the problem that led to this image being broken from a working image?  
What version of VMMaker.oscog are you using?
Monticello reports VMMaker.oscog-eem.1855.
However, I'm not sure I have the right number, because source code is now in git.

It would be so great if you could use Monticello, not git, and hence be able to contribute back.
 
  Are you running in Pharo or Squeak?  If you're in Lille you could perhaps visit Clément's office and get him to take a look.  Clément, would that be ok?
I'll maybe see it with clement, but today I have to work on some other project...
When debugging the VM there are two main levels of support, one is the simulator where there is maximum support for debugging:
- asserts on all the time
- arbitrary breakpoints
- attempting every GC in a copy of the heap before doing the real GC so that bugs in the GC can be investigated without needing to construct a reproducible case after a crash
- the Smalltalk environment to inspect and browse

The next level is the assert and debug VMs.  If you look in the build directories on the Cog svn branch you'll see that all of them build three VMs, a production VM with maximum optimisation and asserts excluded, an assert VM with -O1 and asserts enabled, and a debug VM with -O0 and asserts enabled.  So if you either don't see the bug in the simulator, or the simulator is too slow for the case being examined, or if the bug doesn't show up in the simulator (the worst of all worlds), build both assert and debug VMs and run with the assert VM first.

Well so far we were using a VM compiled for debug with a graphical C debugger. It was not so bad. However, I cannot say I'm missing a better debugger.

"Compiled for debug" is vague.  Do you mean it is compiled with -g -O0, or in addition is compiled with -g -O0 -DDEBUGVM=0 -DNDEBUG=1?
 
That would exactly be:

{ '-g3'.
        '-O0'.
        '-msse2'.
        '-D_GNU_SOURCE'.
        '-DITIMER_HEARTBEAT=1'.
        '-DNO_VM_PROFILE=1'.
        '-DDEBUGVM=1'.
        '-DNDEBUG' }

And Esteban told me to compile it with '-UNDEBUG' to get more information, but I did not try this yet.

Yes.  NDEBUG is what turns off asserts.  You don't have a proper assert or debug VM unless you do not use -NDEBUG.

Thanks!

Note that there is a heap leak checker which can be enabled both in the simulator and the assert and debug VMs.  See the checkForLeaks method and the -leakcheck argument.

ok!

Without the simulator or the assert and debug VMs you are flying blind.  It is /really/ productive to use the simulator for debugging, provided the bug is reproducible within a short amount of time, as for example your case is above.

Ok, gotcha! By this afternoon I'll have some news probably.

Thanks a lot!





However, as soon as we garbage collect twice, we have a VM crash. We started debugging the VM to see if we could have some more clues.

The first thing we noticed is that the first time the GC runs, the mournQueue is nil. This is of course expected because the new finalization mechanism was not active and then there was no need to create the mournQueue. We saw that the mournQueue is actually created in a lazy fashion when putting queuing a mourned object (I refer myself to #queueMourner: and #ensureRoomOnObjStackAt:). So the second time the GC passes, the mournQueue is there. So far ok, but still crashing.

The crash happens in the call to

markAndTraceObjStackandContents(GIV(mournQueue), 1);

after the

    if (!markAndTraceContents) {
       return;
    }

But when understanding why, it starts being less clear to us :). We used the printObjStack() function and we saw that:

call printObjStack(markStack)
call printObjStack(weaklingStack)

and we saw in the console some output that makes sense. However, printing the mournQueue in the same manner produces some strange output

call printObjStack(mournQueue)

head  0xb06e980 cx 18 (18) fmt 10 (10) sz 4092 (4092) myx: 4098 (4098) unmkd
    topx: 14 next:        0x0 free:        0x0

We noticed that free and next are 0x0 while the others are not...

Finally we saw there is isValidObjStack(), that gave us the following results:

call isValidObjStack(markStack) => 1

call isValidObjStack(weaklingStack) => 0
p objStackInvalidBecause = "marking but page is unmarked"

call isValidObjStack(mournQueue) => 0
p objStackInvalidBecause = "marking but page is unmarked"


So we assume that the stack creation is wrong? We are a bit lost in here.

Guille and Pablo



--
_,,,^..^,,,_
best, Eliot




--
_,,,^..^,,,_
best, Eliot




--
_,,,^..^,,,_
best, Eliot






--
_,,,^..^,,,_
best, Eliot





--
_,,,^..^,,,_
best, Eliot



--
_,,,^..^,,,_
best, Eliot



--
_,,,^..^,,,_
best, Eliot

Reply | Threaded
Open this post in threaded view
|

Re: Ephemerons and VM crash

johnmci
 
Morning 

I'm curious if anyone has studied the time taken to morn & finalize a weak object. I ask this because in my past work with VW and VA the most interesting GC issues I tackled related to how much time finalization took versus how long the engineer thought it might take.  

Thought on this should relate to normal processing, or situations where the engineer is trying to force the finalization via interaction with the GC logic. 




On Thursday, 26 May 2016, Eliot Miranda <[hidden email]> wrote:
 


--
===========================================================================
John M. McIntosh. Corporate Smalltalk Consulting Ltd https://www.linkedin.com/in/smalltalk
===========================================================================

Reply | Threaded
Open this post in threaded view
|

Re: Ephemerons and VM crash

Andres Valloud-4
 
Yes, see the 2015 IWST paper re: Linked Weak Reference Arrays.  You can
download the paper using the link here:

http://www.esug.org/wiki/pier/Conferences/2015/International-Workshop-IWST_15/IWST15

See also the short version:

http://blogten.blogspot.com/2015/07/linked-weak-reference-arrays-paper.html

Andres.

On 5/28/16 10:14 , John McIntosh wrote:

>
>
>
>
> Morning
>
> I'm curious if anyone has studied the time taken to morn & finalize a
> weak object. I ask this because in my past work with VW and VA the most
> interesting GC issues I tackled related to how much time finalization
> took versus how long the engineer thought it might take.
>
> Thought on this should relate to normal processing, or situations where
> the engineer is trying to force the finalization via interaction with
> the GC logic.
>
>
>
>
> On Thursday, 26 May 2016, Eliot Miranda <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>
>
>
> --
> ===========================================================================
> John M. McIntosh. Corporate Smalltalk Consulting
> Ltd https://www.linkedin.com/in/smalltalk
> ===========================================================================
>