Garbage Collection (was Re: [Pharo-dev] Discussing the roadmap)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Garbage Collection (was Re: [Pharo-dev] Discussing the roadmap)

Ben Coman
 
Hi Eliot, Clement,

On 7 July 2017 at 00:41, Eliot Miranda <[hidden email]> wrote:

>
> > - Better support for large heaps (GC tuning API, incremental GC).
> > Pharo 64 bit is now able to manage large heap. However better
> > performance can be proposed by offering better settings for the
> > different GC zone.
>
> The most important thing here is the incremental GC.  Spur has a generation
> scavenger that collects garbage in newly created objects (new space),
> and a mark-compact collector that collects and compacts garbage in old space.
>
> Right now on my 2.3GHz MacMini doing normal development, the generation
> scavenger causes pauses of 1ms or less, and the mark-compact collector
> than causes pauses of around 200ms.  Both account for about 0.75% of
> entire execution time (1.5% total), so the scavenger is invoked frequently
> and the pauses that it creates are not noticeable.  But the occasional
> pauses created by the mark-compact collector /are/ noticeable,
> especially in games and music.
>
> The idea for the incremental collector is to implement a mark-sweep or
> mark-sweep-compact collector for old space that works incrementally,
> doing a little bit of work on each invocation, probably immediately after a scavenge.
> It is intended to avoid the long pauses caused by the non-incremental
> mark-compact collector and make the system more suitable for games, music, etc.

Reading http://www.mirandabanda.org/cogblog/2013/09/13/lazy-become-and-a-partial-read-barrier/
  "An alternative implementation, oft-used in Lisp systems, is to add a
  read barrier to all object access, and mark objects as forwarders.
  This can be used to implement a >>>lazy copying garbage collection<<<<
  where objects are copied from one semi-space to another in parallel to the
  main program (the “mutator”).  To become, or move an object one replaces the
  object’s header or first field with a forwarding pointer to the desired target
  or copy in a new location, marking the “corpse” as forwarded.  The program
  checks the forwarded flag on each access.  If there is hardware support,
  as in a Lisp machine, this can work well.  But without hardware support,
  and like the object table representation, it has costs, slowing down
  program execution due to the scattering of forwarding checks and
  forwarding pointer follows throughout program execution."

I'm curious... Given we now have forwarders with Spur, are we
already sufficiently paying the cost of forwarding checks that a lazy copying
garbage collector might be a feasible form of incremental garbage collection?

I presume "parallel to the main program" means garbage collection occuring
in a separate thread to the main vm thread, potentially resulting in
very low main program pause times for garbage collection.

I found this a useful summary of the terminology...
* https://www.dynatrace.com/resources/ebooks/javabook/reduce-garbage-collection-pause-time/
and I'm curious how our planned Incremental CG fits those categories.

That article got me contemplating our performance constraint of the VM only
operating in only in a single native thread. Even though GC is a only
a few percent of performance, I wondered what a concurrent GC might
look like for us.

I found this video describing concurrent GC in Go (ignore first 11:20
and the second half was not so interesting)
* https://pusher.com/sessions/meetup/the-realtime-guild/golangs-realtime-garbage-collector
where they present some interesting charts of their concern with latency pauses.
(btw they reference a multi-language GC latency benchmark
  https://github.com/WillSewell/gc-latency-experiment)

And for balance of that I found...
https://blog.plan99.net/modern-garbage-collection-911ef4f8bd8e


And then my mind wandered around implementation details of concurrent
garbage collection.
To organise and quiet my thoughts I needed to put pen to paper, so I
thought sharing that
might stir thoughts for others.  Probably naive and please excuse the
brain dump format...


Considering two threads...
* Main program thread "MP"
* Garbage collection thread "GC"
with object-space shared between them, consisting of objects split in
object-header & object-body...

1. Only "MP" mutates the object-body, updating slots creating new edges
   in the object-graph, and relocating objects in memory using forwarders.
   This rule avoids potential race conditions without needing to add
   synchronisation code affecting performance of "MP".


2. Concurrently "GC" performs a Marking Phase by following the object-graph
   tricolour tagging objects gray & black. it needs to mutate the object-header,
   in a synchronised way something like...
     a. Load object-header from shared object-space into local variable
          H <== object-header
     b. Modify header into another local variable
          H' <== H + updated GC color bits
     c. Atomic compare-and-swap H' back into object-space
          object-header <== if H then H'
     "MP" gets priority. Conflicts presumed rare.


Then considering object mutation by "MP" concurrent/overlapped with
"GC" marking...

3. TLDR; see 4.
   In old-space if 'mutated' object is linked to a 'target' object
     a. if 'mutated' isBlack, "MP" marks 'target' gray (to be later
processed by "GC").

     b. if 'mutated' isGray, overlapped 'mutated' gray->black by "GC"
while 'target'
        remains white would break tricolor invariant (is it a credible
case?), options...
          i. mark 'target' gray, the state anyway in case 'mutated'
gray -> black
          ii. re-mark 'mutated' gray, i.e. normally gray->gray, and
rarely black->gray
          iii. optimisticly just check after mutation if color changed
(cached reads
               cheaper than writes and conflicts presumed rare) then
set color to gray.

     c. if 'mutated' isWhite, options...
          i. do nothing and let "GC" reach 'target' normally, as long
as overlapped white->black cannot occur
          ii. if white->black possible, do same as (3.b) "MP" marks
'target' gray, to be later processed by "GC".


4. Overall simplification of (3.) might be...
     "MP" checks for any color change during mutation, and only then
marks 'mutated' object gray.
     How expensive would such a check be?  Presumed marking is infrequent and
     can be done safely like (2.)


5. Eden/survivor space is ignored by "GC" thread. No point in adding
work to the gray set
    until survivors filtered. Stop the world scavenging done as normal by "MP",
    only marking objects gray when they are moved to old-space.

    Post scavenging options...
       a. Resume the world and leave it for "GC" to process these
recently grayed objects.

       b. Keep world stopped for "MP" to complete marking, emptying gray set to
           transition to Sweep Phase, at which point "MP" resumes the world.
           It doesn't matter if subsequently objects are added to the gray set,
           since existing white objects can never again be referenced by "MP".


6. Sweeping can be safely done by "GC" since white-objects are
unreachable from "MP".
   "GC" can also take time to determine an optimum page P to compact and
   then (per 1.) passes to "MP" via a relocation-queue the objects to be
   relocated using forwarding pointers. "GC" could even spend extra time
   to determine the optimum relocation-destination without impacting the
   performance of "MP".  When "MP" empties the relocation-queue, "GC" starts
   on the next Marking Phase.


7. Now a question remains about multi-threaded flattening of
forwarding pointers.
    If two threads simultaneously perform an identical transform from...
       someObject-slot --> forwarder-b --> finalObject-c
    to...
       someObject-slot --> finalObject-c
    does it matter that these operation may be done twice overlapping?

   Options...

     a. One mitigation could be for "GC" to identify forwarders to be flattened
        and queue them for "MP" to process (reuse the compaction-queue).  This
        is work that "MP" would need to do anyway, but brings it forward to be
        dealt with at a convenient time.

     b. I guess "MP" and "GC" could play nice together if when encountering
        a slot containing with a forwarding pointer they both do
something similar to (2.) like...
          i. Load object-slot from shared object-space into local
variable S <== slot
         ii. Set local variable S' <== flattened/followed pointer
        iii. Atomic compare-and-swap S' back into object-space,
(object-slot <== if S then S')
             - "MP" gets priority but presume conflicts rare anyway.
             - While this violates rule (1.) the presumed frequency of
               encountering forwarding pointers is low for "MP", so performance
               should not be affected. Indeed by "GC" pre-emptively
flattening forwarders
               the frequency "MP" sees reduces.


8. The release of the compacted page back to the OS is held up by
forwarding pointers.
   Forwarders are part of the graph followed in the Marking Phase they get
   marked gray/black just like objects if they are referenced.  After
forwarders are
   fully flattened they are skipped by Marking and end up marked white
and released
   just like any other object. Once all forwarders are released, the
page is released
   to the OS.   If "GC" can effectively flatten pointers concurrently
with "MP" during its
   normal Marking Phase, then pages would be released back to the OS
in a timely manner.



Now one thing I am curious about how the GC tri-color marking is implemented.
At https://clementbera.wordpress.com/2014/01/16/spurs-new-object-format
it describes how Spur's object header has three bits for GC.
* isGray
* isRemembered
* isMarked  (I presume this means marked "black")
Do the bits imply the gray set is not stored in a separate data
structure on the heap, but rather distributed in-place, which I guess
would require multiple passes through memory to grow the gray set?


So this is not an area I'm set up to seriously work on, but I remain curious
and hopefully its a useful seed discussion for others.
cheers -ben


P.S. At https://clementbera.wordpress.com/2017/09/19/vm-learning-memory-management/
nice to hear that you have Sophie (I presume) continuing with the VM.
Reply | Threaded
Open this post in threaded view
|

Re: Garbage Collection (was Re: [Pharo-dev] Discussing the roadmap)

Clément Bera-4
 
hum...

The mail is very long so I did not read all of it.

Here are some ideas/things to say on the tip of my head:
- Changing an object to a forwarding object is non atomic (we need to maintain at least stack invariant)
- To decrease the pauses in full GC I have 2 plans:
-- incremental marking (split the mark pause in multiple small pauses): Not implemented right now.
-- selective compaction (compacts only part of the heap instead of all the heap and sweeps the rest, similar to G1, but uses forwarders instead of lots of card marking metadata): I implemented SpurSweeper which only sweeps but works very well.
- Currently the marking phase removes all forwarders and I would like incremental marking to maintain the same invariant (forwarders are always white).
- In general, Concurrent marking and sweeping have been implemented everywhere, but no concurrent compaction. For compaction you can make it selective (compact only part of the heap and the part which needs it the most) like I suggest and like in G1, which decreases considerably compaction pause time. Work on concurrent compaction is state of the art and not in production everywhere, see for example 
And I will watch at talk on this topic tomorrow for the Android GC.
- Some runtime, especially now with small servers being rent, are running on single core machines. So we need the low-pause GC to work incrementally aside from concurrently. So step 1 incremental GC. Step 2 concurrent marking and sweeping with low-pause for scavenge/compaction.

No more time right now. 

On Sun, Dec 3, 2017 at 6:33 AM, Ben Coman <[hidden email]> wrote:

Hi Eliot, Clement,

On 7 July 2017 at 00:41, Eliot Miranda <[hidden email]> wrote:
>
> > - Better support for large heaps (GC tuning API, incremental GC).
> > Pharo 64 bit is now able to manage large heap. However better
> > performance can be proposed by offering better settings for the
> > different GC zone.
>
> The most important thing here is the incremental GC.  Spur has a generation
> scavenger that collects garbage in newly created objects (new space),
> and a mark-compact collector that collects and compacts garbage in old space.
>
> Right now on my 2.3GHz MacMini doing normal development, the generation
> scavenger causes pauses of 1ms or less, and the mark-compact collector
> than causes pauses of around 200ms.  Both account for about 0.75% of
> entire execution time (1.5% total), so the scavenger is invoked frequently
> and the pauses that it creates are not noticeable.  But the occasional
> pauses created by the mark-compact collector /are/ noticeable,
> especially in games and music.
>
> The idea for the incremental collector is to implement a mark-sweep or
> mark-sweep-compact collector for old space that works incrementally,
> doing a little bit of work on each invocation, probably immediately after a scavenge.
> It is intended to avoid the long pauses caused by the non-incremental
> mark-compact collector and make the system more suitable for games, music, etc.

Reading http://www.mirandabanda.org/cogblog/2013/09/13/lazy-become-and-a-partial-read-barrier/
  "An alternative implementation, oft-used in Lisp systems, is to add a
  read barrier to all object access, and mark objects as forwarders.
  This can be used to implement a >>>lazy copying garbage collection<<<<
  where objects are copied from one semi-space to another in parallel to the
  main program (the “mutator”).  To become, or move an object one replaces the
  object’s header or first field with a forwarding pointer to the desired target
  or copy in a new location, marking the “corpse” as forwarded.  The program
  checks the forwarded flag on each access.  If there is hardware support,
  as in a Lisp machine, this can work well.  But without hardware support,
  and like the object table representation, it has costs, slowing down
  program execution due to the scattering of forwarding checks and
  forwarding pointer follows throughout program execution."

I'm curious... Given we now have forwarders with Spur, are we
already sufficiently paying the cost of forwarding checks that a lazy copying
garbage collector might be a feasible form of incremental garbage collection?

I presume "parallel to the main program" means garbage collection occuring
in a separate thread to the main vm thread, potentially resulting in
very low main program pause times for garbage collection.

I found this a useful summary of the terminology...
* https://www.dynatrace.com/resources/ebooks/javabook/reduce-garbage-collection-pause-time/
and I'm curious how our planned Incremental CG fits those categories.

That article got me contemplating our performance constraint of the VM only
operating in only in a single native thread. Even though GC is a only
a few percent of performance, I wondered what a concurrent GC might
look like for us.

I found this video describing concurrent GC in Go (ignore first 11:20
and the second half was not so interesting)
* https://pusher.com/sessions/meetup/the-realtime-guild/golangs-realtime-garbage-collector
where they present some interesting charts of their concern with latency pauses.
(btw they reference a multi-language GC latency benchmark
  https://github.com/WillSewell/gc-latency-experiment)

And for balance of that I found...
https://blog.plan99.net/modern-garbage-collection-911ef4f8bd8e


And then my mind wandered around implementation details of concurrent
garbage collection.
To organise and quiet my thoughts I needed to put pen to paper, so I
thought sharing that
might stir thoughts for others.  Probably naive and please excuse the
brain dump format...


Considering two threads...
* Main program thread "MP"
* Garbage collection thread "GC"
with object-space shared between them, consisting of objects split in
object-header & object-body...

1. Only "MP" mutates the object-body, updating slots creating new edges
   in the object-graph, and relocating objects in memory using forwarders.
   This rule avoids potential race conditions without needing to add
   synchronisation code affecting performance of "MP".


2. Concurrently "GC" performs a Marking Phase by following the object-graph
   tricolour tagging objects gray & black. it needs to mutate the object-header,
   in a synchronised way something like...
     a. Load object-header from shared object-space into local variable
          H <== object-header
     b. Modify header into another local variable
          H' <== H + updated GC color bits
     c. Atomic compare-and-swap H' back into object-space
          object-header <== if H then H'
     "MP" gets priority. Conflicts presumed rare.


Then considering object mutation by "MP" concurrent/overlapped with
"GC" marking...

3. TLDR; see 4.
   In old-space if 'mutated' object is linked to a 'target' object
     a. if 'mutated' isBlack, "MP" marks 'target' gray (to be later
processed by "GC").

     b. if 'mutated' isGray, overlapped 'mutated' gray->black by "GC"
while 'target'
        remains white would break tricolor invariant (is it a credible
case?), options...
          i. mark 'target' gray, the state anyway in case 'mutated'
gray -> black
          ii. re-mark 'mutated' gray, i.e. normally gray->gray, and
rarely black->gray
          iii. optimisticly just check after mutation if color changed
(cached reads
               cheaper than writes and conflicts presumed rare) then
set color to gray.

     c. if 'mutated' isWhite, options...
          i. do nothing and let "GC" reach 'target' normally, as long
as overlapped white->black cannot occur
          ii. if white->black possible, do same as (3.b) "MP" marks
'target' gray, to be later processed by "GC".


4. Overall simplification of (3.) might be...
     "MP" checks for any color change during mutation, and only then
marks 'mutated' object gray.
     How expensive would such a check be?  Presumed marking is infrequent and
     can be done safely like (2.)


5. Eden/survivor space is ignored by "GC" thread. No point in adding
work to the gray set
    until survivors filtered. Stop the world scavenging done as normal by "MP",
    only marking objects gray when they are moved to old-space.

    Post scavenging options...
       a. Resume the world and leave it for "GC" to process these
recently grayed objects.

       b. Keep world stopped for "MP" to complete marking, emptying gray set to
           transition to Sweep Phase, at which point "MP" resumes the world.
           It doesn't matter if subsequently objects are added to the gray set,
           since existing white objects can never again be referenced by "MP".


6. Sweeping can be safely done by "GC" since white-objects are
unreachable from "MP".
   "GC" can also take time to determine an optimum page P to compact and
   then (per 1.) passes to "MP" via a relocation-queue the objects to be
   relocated using forwarding pointers. "GC" could even spend extra time
   to determine the optimum relocation-destination without impacting the
   performance of "MP".  When "MP" empties the relocation-queue, "GC" starts
   on the next Marking Phase.


7. Now a question remains about multi-threaded flattening of
forwarding pointers.
    If two threads simultaneously perform an identical transform from...
       someObject-slot --> forwarder-b --> finalObject-c
    to...
       someObject-slot --> finalObject-c
    does it matter that these operation may be done twice overlapping?

   Options...

     a. One mitigation could be for "GC" to identify forwarders to be flattened
        and queue them for "MP" to process (reuse the compaction-queue).  This
        is work that "MP" would need to do anyway, but brings it forward to be
        dealt with at a convenient time.

     b. I guess "MP" and "GC" could play nice together if when encountering
        a slot containing with a forwarding pointer they both do
something similar to (2.) like...
          i. Load object-slot from shared object-space into local
variable S <== slot
         ii. Set local variable S' <== flattened/followed pointer
        iii. Atomic compare-and-swap S' back into object-space,
(object-slot <== if S then S')
             - "MP" gets priority but presume conflicts rare anyway.
             - While this violates rule (1.) the presumed frequency of
               encountering forwarding pointers is low for "MP", so performance
               should not be affected. Indeed by "GC" pre-emptively
flattening forwarders
               the frequency "MP" sees reduces.


8. The release of the compacted page back to the OS is held up by
forwarding pointers.
   Forwarders are part of the graph followed in the Marking Phase they get
   marked gray/black just like objects if they are referenced.  After
forwarders are
   fully flattened they are skipped by Marking and end up marked white
and released
   just like any other object. Once all forwarders are released, the
page is released
   to the OS.   If "GC" can effectively flatten pointers concurrently
with "MP" during its
   normal Marking Phase, then pages would be released back to the OS
in a timely manner.



Now one thing I am curious about how the GC tri-color marking is implemented.
At https://clementbera.wordpress.com/2014/01/16/spurs-new-object-format
it describes how Spur's object header has three bits for GC.
* isGray
* isRemembered
* isMarked  (I presume this means marked "black")
Do the bits imply the gray set is not stored in a separate data
structure on the heap, but rather distributed in-place, which I guess
would require multiple passes through memory to grow the gray set?


So this is not an area I'm set up to seriously work on, but I remain curious
and hopefully its a useful seed discussion for others.
cheers -ben


P.S. At https://clementbera.wordpress.com/2017/09/19/vm-learning-memory-management/
nice to hear that you have Sophie (I presume) continuing with the VM.



--
Clément Béra
Pharo consortium engineer
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
Reply | Threaded
Open this post in threaded view
|

Re: Garbage Collection (was Re: [Pharo-dev] Discussing the roadmap)

Tobias Pape
 

> On 04.12.2017, at 08:47, Clément Bera <[hidden email]> wrote:
>
> hum...
>
> The mail is very long so I did not read all of it.
>
> Here are some ideas/things to say on the tip of my head:
> - Changing an object to a forwarding object is non atomic (we need to maintain at least stack invariant)
> - To decrease the pauses in full GC I have 2 plans:
> -- incremental marking (split the mark pause in multiple small pauses): Not implemented right now.
> -- selective compaction (compacts only part of the heap instead of all the heap and sweeps the rest, similar to G1, but uses forwarders instead of lots of card marking metadata): I implemented SpurSweeper which only sweeps but works very well.
> - Currently the marking phase removes all forwarders and I would like incremental marking to maintain the same invariant (forwarders are always white).
> - In general, Concurrent marking and sweeping have been implemented everywhere, but no concurrent compaction. For compaction you can make it selective (compact only part of the heap and the part which needs it the most) like I suggest and like in G1, which decreases considerably compaction pause time. Work on concurrent compaction is state of the art and not in production everywhere, see for example
> Shenandoah Garbage Collector
> A. Shipilev

Note that Shenandoah is intended vor multi-hundereds-of-gigabyte Heaps so, while it is a really cool thing, may be not easy to apply.
Best regards
        -Tobias

> Pause-Less GC for Improving Java Responsiveness
> C. Gracie
> And I will watch at talk on this topic tomorrow for the Android GC.
> - Some runtime, especially now with small servers being rent, are running on single core machines. So we need the low-pause GC to work incrementally aside from concurrently. So step 1 incremental GC. Step 2 concurrent marking and sweeping with low-pause for scavenge/compaction.
>
> No more time right now.
>
> On Sun, Dec 3, 2017 at 6:33 AM, Ben Coman <[hidden email]> wrote:
>
> Hi Eliot, Clement,
>
> On 7 July 2017 at 00:41, Eliot Miranda <[hidden email]> wrote:
> >
> > > - Better support for large heaps (GC tuning API, incremental GC).
> > > Pharo 64 bit is now able to manage large heap. However better
> > > performance can be proposed by offering better settings for the
> > > different GC zone.
> >
> > The most important thing here is the incremental GC.  Spur has a generation
> > scavenger that collects garbage in newly created objects (new space),
> > and a mark-compact collector that collects and compacts garbage in old space.
> >
> > Right now on my 2.3GHz MacMini doing normal development, the generation
> > scavenger causes pauses of 1ms or less, and the mark-compact collector
> > than causes pauses of around 200ms.  Both account for about 0.75% of
> > entire execution time (1.5% total), so the scavenger is invoked frequently
> > and the pauses that it creates are not noticeable.  But the occasional
> > pauses created by the mark-compact collector /are/ noticeable,
> > especially in games and music.
> >
> > The idea for the incremental collector is to implement a mark-sweep or
> > mark-sweep-compact collector for old space that works incrementally,
> > doing a little bit of work on each invocation, probably immediately after a scavenge.
> > It is intended to avoid the long pauses caused by the non-incremental
> > mark-compact collector and make the system more suitable for games, music, etc.
>
> Reading http://www.mirandabanda.org/cogblog/2013/09/13/lazy-become-and-a-partial-read-barrier/
>   "An alternative implementation, oft-used in Lisp systems, is to add a
>   read barrier to all object access, and mark objects as forwarders.
>   This can be used to implement a >>>lazy copying garbage collection<<<<
>   where objects are copied from one semi-space to another in parallel to the
>   main program (the “mutator”).  To become, or move an object one replaces the
>   object’s header or first field with a forwarding pointer to the desired target
>   or copy in a new location, marking the “corpse” as forwarded.  The program
>   checks the forwarded flag on each access.  If there is hardware support,
>   as in a Lisp machine, this can work well.  But without hardware support,
>   and like the object table representation, it has costs, slowing down
>   program execution due to the scattering of forwarding checks and
>   forwarding pointer follows throughout program execution."
>
> I'm curious... Given we now have forwarders with Spur, are we
> already sufficiently paying the cost of forwarding checks that a lazy copying
> garbage collector might be a feasible form of incremental garbage collection?
>
> I presume "parallel to the main program" means garbage collection occuring
> in a separate thread to the main vm thread, potentially resulting in
> very low main program pause times for garbage collection.
>
> I found this a useful summary of the terminology...
> * https://www.dynatrace.com/resources/ebooks/javabook/reduce-garbage-collection-pause-time/
> and I'm curious how our planned Incremental CG fits those categories.
>
> That article got me contemplating our performance constraint of the VM only
> operating in only in a single native thread. Even though GC is a only
> a few percent of performance, I wondered what a concurrent GC might
> look like for us.
>
> I found this video describing concurrent GC in Go (ignore first 11:20
> and the second half was not so interesting)
> * https://pusher.com/sessions/meetup/the-realtime-guild/golangs-realtime-garbage-collector
> where they present some interesting charts of their concern with latency pauses.
> (btw they reference a multi-language GC latency benchmark
>   https://github.com/WillSewell/gc-latency-experiment)
>
> And for balance of that I found...
> https://blog.plan99.net/modern-garbage-collection-911ef4f8bd8e
>
>
> And then my mind wandered around implementation details of concurrent
> garbage collection.
> To organise and quiet my thoughts I needed to put pen to paper, so I
> thought sharing that
> might stir thoughts for others.  Probably naive and please excuse the
> brain dump format...
>
>
> Considering two threads...
> * Main program thread "MP"
> * Garbage collection thread "GC"
> with object-space shared between them, consisting of objects split in
> object-header & object-body...
>
> 1. Only "MP" mutates the object-body, updating slots creating new edges
>    in the object-graph, and relocating objects in memory using forwarders.
>    This rule avoids potential race conditions without needing to add
>    synchronisation code affecting performance of "MP".
>
>
> 2. Concurrently "GC" performs a Marking Phase by following the object-graph
>    tricolour tagging objects gray & black. it needs to mutate the object-header,
>    in a synchronised way something like...
>      a. Load object-header from shared object-space into local variable
>           H <== object-header
>      b. Modify header into another local variable
>           H' <== H + updated GC color bits
>      c. Atomic compare-and-swap H' back into object-space
>           object-header <== if H then H'
>      "MP" gets priority. Conflicts presumed rare.
>
>
> Then considering object mutation by "MP" concurrent/overlapped with
> "GC" marking...
>
> 3. TLDR; see 4.
>    In old-space if 'mutated' object is linked to a 'target' object
>      a. if 'mutated' isBlack, "MP" marks 'target' gray (to be later
> processed by "GC").
>
>      b. if 'mutated' isGray, overlapped 'mutated' gray->black by "GC"
> while 'target'
>         remains white would break tricolor invariant (is it a credible
> case?), options...
>           i. mark 'target' gray, the state anyway in case 'mutated'
> gray -> black
>           ii. re-mark 'mutated' gray, i.e. normally gray->gray, and
> rarely black->gray
>           iii. optimisticly just check after mutation if color changed
> (cached reads
>                cheaper than writes and conflicts presumed rare) then
> set color to gray.
>
>      c. if 'mutated' isWhite, options...
>           i. do nothing and let "GC" reach 'target' normally, as long
> as overlapped white->black cannot occur
>           ii. if white->black possible, do same as (3.b) "MP" marks
> 'target' gray, to be later processed by "GC".
>
>
> 4. Overall simplification of (3.) might be...
>      "MP" checks for any color change during mutation, and only then
> marks 'mutated' object gray.
>      How expensive would such a check be?  Presumed marking is infrequent and
>      can be done safely like (2.)
>
>
> 5. Eden/survivor space is ignored by "GC" thread. No point in adding
> work to the gray set
>     until survivors filtered. Stop the world scavenging done as normal by "MP",
>     only marking objects gray when they are moved to old-space.
>
>     Post scavenging options...
>        a. Resume the world and leave it for "GC" to process these
> recently grayed objects.
>
>        b. Keep world stopped for "MP" to complete marking, emptying gray set to
>            transition to Sweep Phase, at which point "MP" resumes the world.
>            It doesn't matter if subsequently objects are added to the gray set,
>            since existing white objects can never again be referenced by "MP".
>
>
> 6. Sweeping can be safely done by "GC" since white-objects are
> unreachable from "MP".
>    "GC" can also take time to determine an optimum page P to compact and
>    then (per 1.) passes to "MP" via a relocation-queue the objects to be
>    relocated using forwarding pointers. "GC" could even spend extra time
>    to determine the optimum relocation-destination without impacting the
>    performance of "MP".  When "MP" empties the relocation-queue, "GC" starts
>    on the next Marking Phase.
>
>
> 7. Now a question remains about multi-threaded flattening of
> forwarding pointers.
>     If two threads simultaneously perform an identical transform from...
>        someObject-slot --> forwarder-b --> finalObject-c
>     to...
>        someObject-slot --> finalObject-c
>     does it matter that these operation may be done twice overlapping?
>
>    Options...
>
>      a. One mitigation could be for "GC" to identify forwarders to be flattened
>         and queue them for "MP" to process (reuse the compaction-queue).  This
>         is work that "MP" would need to do anyway, but brings it forward to be
>         dealt with at a convenient time.
>
>      b. I guess "MP" and "GC" could play nice together if when encountering
>         a slot containing with a forwarding pointer they both do
> something similar to (2.) like...
>           i. Load object-slot from shared object-space into local
> variable S <== slot
>          ii. Set local variable S' <== flattened/followed pointer
>         iii. Atomic compare-and-swap S' back into object-space,
> (object-slot <== if S then S')
>              - "MP" gets priority but presume conflicts rare anyway.
>              - While this violates rule (1.) the presumed frequency of
>                encountering forwarding pointers is low for "MP", so performance
>                should not be affected. Indeed by "GC" pre-emptively
> flattening forwarders
>                the frequency "MP" sees reduces.
>
>
> 8. The release of the compacted page back to the OS is held up by
> forwarding pointers.
>    Forwarders are part of the graph followed in the Marking Phase they get
>    marked gray/black just like objects if they are referenced.  After
> forwarders are
>    fully flattened they are skipped by Marking and end up marked white
> and released
>    just like any other object. Once all forwarders are released, the
> page is released
>    to the OS.   If "GC" can effectively flatten pointers concurrently
> with "MP" during its
>    normal Marking Phase, then pages would be released back to the OS
> in a timely manner.
>
>
>
> Now one thing I am curious about how the GC tri-color marking is implemented.
> At https://clementbera.wordpress.com/2014/01/16/spurs-new-object-format
> it describes how Spur's object header has three bits for GC.
> * isGray
> * isRemembered
> * isMarked  (I presume this means marked "black")
> Do the bits imply the gray set is not stored in a separate data
> structure on the heap, but rather distributed in-place, which I guess
> would require multiple passes through memory to grow the gray set?
>
>
> So this is not an area I'm set up to seriously work on, but I remain curious
> and hopefully its a useful seed discussion for others.
> cheers -ben
>
>
> P.S. At https://clementbera.wordpress.com/2017/09/19/vm-learning-memory-management/
> nice to hear that you have Sophie (I presume) continuing with the VM.
>
>
>
> --
> Clément Béra
> Pharo consortium engineer
> https://clementbera.wordpress.com/
> Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq

Reply | Threaded
Open this post in threaded view
|

Re: Garbage Collection (was Re: [Pharo-dev] Discussing the roadmap)

Ben Coman
In reply to this post by Clément Bera-4
 
On 4 December 2017 at 15:47, Clément Bera <[hidden email]> wrote:
>
> Here are some ideas/things to say on the tip of my head:
> - Changing an object to a forwarding object is non atomic (we need to maintain at least stack invariant)

Thats because the whole multiword object has to be copied.
What about the reverse, flattening a forwarding-object back into the
real-object?
I presume only the one word of the object-pointer is updated.
btw, are we safe if competing threads write the *same* data to a slot,


Anyway my idea was the GC-thread would only *identify* objects to be
moved, and queue them
for the Main-thread to chip away at, so only one thread is converting
objects to forwarding objects.


> - To decrease the pauses in full GC I have 2 plans:
> -- incremental marking (split the mark pause in multiple small pauses): Not implemented right now.
> -- selective compaction (compacts only part of the heap instead of all the heap and sweeps the rest, similar to G1, but uses forwarders instead of lots of card marking metadata): I implemented SpurSweeper which only sweeps but works very well.
> - Currently the marking phase removes all forwarders and I would like incremental marking to maintain the same invariant (forwarders are always white).

A concurrent-marking thread could essentially do the same.
         i. From shared memory load forwarder F from object-slot
                    F <== object-slot

         ii. Follow forwarder to real-object, store into temporary R
                    R <== flattened/followed pointer

         iii. Atomic compare-and-swap R back into object-slot,
                    object-slot <== if F then R

When (iii.) fails
* If I'm the Main-thread, then an other-thread already did what I wanted,
   and since thats the *only* mutation other-threads can do to an object-slot.
   I am certain...    "object-slot == R",   so since I'm handling a
failed forwarding-check, just continue with the normal retry.
* If I'm an other-thread, no hurry.  Read the object-slot again and it
"should" be a real-object, otherwise just keep trying until I get one.

So the question is... When using forwarders for compaction,
how often would fail forwarder-check fail...
  * if there was one thread on its own; versus
  * in the Main-thread if an other-thread had already flattened many of them


> - In general, Concurrent marking and sweeping have been implemented everywhere, but no concurrent compaction. For compaction you can make it selective (compact only part of the heap and the part which needs it the most) like I suggest and like in G1, which decreases considerably compaction pause time. Work on concurrent compaction is state of the art and not in production everywhere, see for example

IIUC not many other languages use forwarding pointer like we do, and
these seem like a real advantage
to compact incrementally and concurrently.


> No more time right now.

I recognise it wasn't a great format.
I really appreciate the time you could spare.

cheers -ben