Smalltalk › Squeak › Squeak VM

rough idea for per segment Lilliputian lists

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

Ben Coman

rough idea for per segment Lilliputian lists

On Mon, 8 Oct 2018 at 02:08, <[hidden email]> wrote:

ClementBera uploaded a new version of VMMaker to project VM Maker:

SpurSelectiveCompactor compacts memory by selecting the memory segments with the most free space and compacting only those, to limit fragmentation while being really quick to perform. The algorithm is fast mostly because it does not update pointers: they are updated lazily during the next marking phase, so there is no need to read the fields of objects in other memory segments that the one compacted.

The algorithm works as follow. First, a global sweep pass iterates over the memory linearly, changing unmarked objects to free chunks and concatenating free chunks. During the global sweep phase, the segments of the heap are analysed to determine the percentage of occupation. Second, the least occupied segments are compacted by copying the remaining live objects into an entirely free segment, called regionToFill (we detail later in the paragraph where regionToFill comes from), changing their values to forwarding objects and marking the free chunks as unavailable (removed from free list and marked as data objects). Third, the next marking phase removes all forwarders. Fourth, at the beginning of the next compaction phase the compacted segments from the previous GC can be entirely marked as free space (No need to check anything inside, there were only forwarders and trash data). One of the compacted segment is then selected as the segmentToFill, others are just marked as free chunks.

The compaction is effectively partial, compacting only the most critical segments of the heap to limit fragmentation. Compaction time is crazy low, since a low number of objects are moved and pointer updated is lazily done during the next marking phase, while still preventing memory fragmentation.

Now this works well when biasForGC is true, but when performing a snapshot, the compactor is just total crap (we need to figure out a solution).

Performance trick for lilliputian chunks:
+ Specific free chunks (called lilliputian, see isLilliputianSize:) are managed using a single linked list instead of a double linked list since there's not enough room in the free chunk for the back pointer. During the sweep phase this is not a problem since we're rebuilding the free chunk structure, but during selective compaction we're detaching free chunks from the free chunk structure and that can be horribly slow (10 seconds sometimes at 20Gb heap due to many iteration over the single linked list). To work around this problem, the sweep phase use lastLilliputianChunk variable to sort the lilliputian free chunk single linked list in ascending address order (See interceptAddFreeChunkWithBytes:at:). During the selective compation phase, the same variable is re-used to iterate at most once over the single linked list while detaching lilliputian chunks (See incrementalUnlinkSmallChunk:). In addition, each segment is annotated during the sweep phase with the last lilliputian chunk it
holds. Hence, during the compaction phase, the linked list is iterated but the iteration can jump to the last chunk of the previous segment to compact.!

- Specific free chunks (called lilliputian, see isLilliputianSize:) are managed using a single linked list instead of a double linked list since there's not enough room in the free chunk for the back pointer. During the sweep phase this is not a problem since we're rebuilding the free chunk structure, but during selective compaction we're detaching free chunks from the free chunk structure and that can be horribly slow (10 seconds sometimes at 20Gb heap due to many iteration over the single linked list). To work around this problem, the sweep phase use lastLilliputianChunk variable to sort the lilliputian free chunk single linked list in ascending address order (See interceptAddFreeChunkWithBytes:at:). During the selective compation phase, the same variable is re-used to iterate at most once over the single linked list while detaching lilliputian chunks (See incrementalUnlinkSmallChunk:).

Please excuse me that I only skimmed the implementation and so don't have a good grasp if what I'm saying is practical,

but anyway reading the above description a random idea floating along the wind found me...

Have you considered per-segment-Lilliputian-list?

Reserve the first word of each segment as the pointer to the segment's first free-Lilliputian-chunk. If it is null there are no free-Lilliputian-chunks in that segment.

The global sweep easily(?) knows which segment a Lilliputian-free-chunk belongs to and prepends to that per-segment list.

To combine all those separate per-segment lists into an overall-list...

split that reserved word into two parts...

* the LSB part provides an "offset" to the first element of the per-segment-list

* the MSB part provides a pointer to the next segment with a per-segment-list i.e. "next-per-segment-list".

Also btw, Lilliputian-free-chunks only require space to hold the "next-offset" within its segment

To allocate from the overall-list...

given a root pointer "X", follow to the first segment,

allocate chunk at "offset",
update "offset" with "next-offset",

if "offset" is zero, update "X" with "next-per-segment-list"

Reserve a second word at the start of each segment and you have a back pointer to the previous segment.

So when a segment is removed, its easy to remove all its Lilliputian-free-chunks out of the overall-list in one go.

The Lilliputian-free-chunk-overall-list would be like a double-linked-list with offshoots.

Maybe(?) then the Lilliputian never needs to be sorted,

and also doesn't need to be traversed when freeing a segment, just remove the segment from the overall-list.

cheers -ben

P.S. Please feel free to play "idea"-whac-a-mole. I've not put a lot of critical thinking into it. It was easier to share than stop it bubbling about.

https://en.wikipedia.org/wiki/Whac-A-Mole

Clément Béra

Re: rough idea for per segment Lilliputian lists

Hi Ben,

What you are saying makes a lot of sense.

Some context:

Free chunks are organized, like my last blog post describes, as a 64 words array structure. Word 2 to 63 are either 0 or the first free chunk of size 2 to 63 respectively. Word 0 is used as a reference to the free tree, a binary tree holding in each node a reference to the first free chunk of size > 63. Large chunks are referenced through the free tree, small chunks directly through the 64 words array structure.

When I started to work on SelectiveCompactor, all the free chunks were organized from there as single linked lists. That worked well since to allocate a chunk, one can just take the first one of the list, and in the PlanningCompact algorithm, compacting reset the freelists and just add back a few large free chunks at the end of the compaction. Now when I built the sweep algo, I first tried to remove free chunks from the free list when iterating over memory to concatenate unmarked objects and existing free chunks. Obviously at 2Gb sweep was already horribly slow since removing an element from a single linked list to concatenate it is O(N). My first idea was to organize the free chunks in double linked list instead of linked list, to remove them from the list in O(1) instead of O(N). That introduced the concept of lilliputian, only present in 64 bits, free chunks of size 2 can have only the header and a pointer, hence they are the only ones organized as a single linked list while the other are organized as a double linked list. Removing a segment is now O(1) except for lilliputian.

However, that was still too slow for the sweep algorithm (removing the chunk still require 2 extra random memory write which hurt cpu cache & lilliputian long linked list were hurting performance). The last implementation that is there just resets the free chunk linked lists before sweeping and rebuilds them while sweeping.

In SelectiveCompact, the sweep phase sweeps the heap rebuilding the linked list but also computes occupation of each segment. If there's at least 1 segment < 70% full, a selective compaction phase starts, selecting the maximum number of segments, from the least occupied ones, and compact them into an entirely empty segment. Typically it compacts 1-10 segments into 1. In this context, compacted segments hold free chunks which need to be cleared from the linked lists. Clearing free chunks organized as double linked lists is fast enough, since a few segments are compacted. Clearing lilliputian is still O(N), so clearing many of them lead to horrible compaction time (O(N^2) multiple seconds instead of multiple ms at 20Gb workload). So I needed to find a solution for lilliputian.

Solution:

As Ben suggested, one solution would be to organized the lilliputian single linked list as a per segment linked list. I did not do that because it would impact a lot free chunk organization, I changed it in the past and it takes 2 days to make sure I did not break anything in yhe production VM.

So what I did instead first is to try to sort lilliputian while sweeping (which is almost free I just need an extra variable and execution time is not really impacted), iterate over the lilliputian linked list while compacting segments, and since segments are in ascending addresses order, that meant I could iterate at most once over the lilliputian linked list while compacting (O(N^2) becomes O(N)). That reduced the stupid pathological cases from multiple seconds to multiple hundred ms.

That was great, but still not enough (we target 16ms pause for 60fps apps). So in addition I annotated while sweeping each segment with their last lilliputian, which means that now I can iterate over the lilliputian linked list per segment and remove the remaining overhead, leading to compaction pauses of several ms.

I still clear the lilliputians while iterating over the segment in compaction phase, clearing all the lilliputian of 1 segment at once is indeed now possible, but it has no real performance benefit IMO since it does not avoid any cache miss (prev lilliputian is in cache for the segment iteration and current one has just been read from memory). Current impl works so I've just kept it that way.

Now what you suggest Ben is better, no sorting of lilliputian while sweeping, just keep a reference to the last lilliputian in the segment (first one found in the segment since they end up in reverse address order), and then clear all the segment lilliputians at once when compacting it. It was just easier to make it work the way I did from what was implemented when I did it.

I thought about doing something similar to what you described when I finished but I don't think it would impact that much performance (it does not remove read/writes with cache miss) so I was too lazy.

Right now I have 2 development images, one matching the trunk and one with extra profiling changes for my paper, so it is difficult to change things. But I may do that later when I finish this paper or you can change it yourself.

Compiling with selective requires to change compactorClass and enable tempvectorreadbarrier in slang compilation setting. The only thing is that u cannot really snapshot right now with selective (the idea is to have both selective and planning so that snapshots use planning and the rest of the code use selective, see hybridcompactor, not sure that works so far though, because without full heap compaction snapshots are quite bigger).

Damned that was a long mail. But you really nailed the problem. I've been dying to talk about that to someone I'm glad you asked :-)

On Mon, Oct 8, 2018 at 3:17 PM Ben Coman <[hidden email]> wrote:

On Mon, 8 Oct 2018 at 02:08, <[hidden email]> wrote:
ClementBera uploaded a new version of VMMaker to project VM Maker:

SpurSelectiveCompactor compacts memory by selecting the memory segments with the most free space and compacting only those, to limit fragmentation while being really quick to perform. The algorithm is fast mostly because it does not update pointers: they are updated lazily during the next marking phase, so there is no need to read the fields of objects in other memory segments that the one compacted.

The algorithm works as follow. First, a global sweep pass iterates over the memory linearly, changing unmarked objects to free chunks and concatenating free chunks. During the global sweep phase, the segments of the heap are analysed to determine the percentage of occupation. Second, the least occupied segments are compacted by copying the remaining live objects into an entirely free segment, called regionToFill (we detail later in the paragraph where regionToFill comes from), changing their values to forwarding objects and marking the free chunks as unavailable (removed from free list and marked as data objects). Third, the next marking phase removes all forwarders. Fourth, at the beginning of the next compaction phase the compacted segments from the previous GC can be entirely marked as free space (No need to check anything inside, there were only forwarders and trash data). One of the compacted segment is then selected as the segmentToFill, others are just marked as free chunks.

The compaction is effectively partial, compacting only the most critical segments of the heap to limit fragmentation. Compaction time is crazy low, since a low number of objects are moved and pointer updated is lazily done during the next marking phase, while still preventing memory fragmentation.

Now this works well when biasForGC is true, but when performing a snapshot, the compactor is just total crap (we need to figure out a solution).

Performance trick for lilliputian chunks:
+ Specific free chunks (called lilliputian, see isLilliputianSize:) are managed using a single linked list instead of a double linked list since there's not enough room in the free chunk for the back pointer. During the sweep phase this is not a problem since we're rebuilding the free chunk structure, but during selective compaction we're detaching free chunks from the free chunk structure and that can be horribly slow (10 seconds sometimes at 20Gb heap due to many iteration over the single linked list). To work around this problem, the sweep phase use lastLilliputianChunk variable to sort the lilliputian free chunk single linked list in ascending address order (See interceptAddFreeChunkWithBytes:at:). During the selective compation phase, the same variable is re-used to iterate at most once over the single linked list while detaching lilliputian chunks (See incrementalUnlinkSmallChunk:). In addition, each segment is annotated during the sweep phase with the last lilliputian chunk it
holds. Hence, during the compaction phase, the linked list is iterated but the iteration can jump to the last chunk of the previous segment to compact.!

- Specific free chunks (called lilliputian, see isLilliputianSize:) are managed using a single linked list instead of a double linked list since there's not enough room in the free chunk for the back pointer. During the sweep phase this is not a problem since we're rebuilding the free chunk structure, but during selective compaction we're detaching free chunks from the free chunk structure and that can be horribly slow (10 seconds sometimes at 20Gb heap due to many iteration over the single linked list). To work around this problem, the sweep phase use lastLilliputianChunk variable to sort the lilliputian free chunk single linked list in ascending address order (See interceptAddFreeChunkWithBytes:at:). During the selective compation phase, the same variable is re-used to iterate at most once over the single linked list while detaching lilliputian chunks (See incrementalUnlinkSmallChunk:).

Please excuse me that I only skimmed the implementation and so don't have a good grasp if what I'm saying is practical,
but anyway reading the above description a random idea floating along the wind found me...

Have you considered per-segment-Lilliputian-list?
Reserve the first word of each segment as the pointer to the segment's first free-Lilliputian-chunk. If it is null there are no free-Lilliputian-chunks in that segment.
The global sweep easily(?) knows which segment a Lilliputian-free-chunk belongs to and prepends to that per-segment list.

To combine all those separate per-segment lists into an overall-list...
split that reserved word into two parts...
* the LSB part provides an "offset" to the first element of the per-segment-list
* the MSB part provides a pointer to the next segment with a per-segment-list i.e. "next-per-segment-list".
Also btw, Lilliputian-free-chunks only require space to hold the "next-offset" within its segment

To allocate from the overall-list...
given a root pointer "X", follow to the first segment,
allocate chunk at "offset",
update "offset" with "next-offset",
if "offset" is zero, update "X" with "next-per-segment-list"

Reserve a second word at the start of each segment and you have a back pointer to the previous segment.
So when a segment is removed, its easy to remove all its Lilliputian-free-chunks out of the overall-list in one go.
The Lilliputian-free-chunk-overall-list would be like a double-linked-list with offshoots.

Maybe(?) then the Lilliputian never needs to be sorted,
and also doesn't need to be traversed when freeing a segment, just remove the segment from the overall-list.

cheers -ben

P.S. Please feel free to play "idea"-whac-a-mole. I've not put a lot of critical thinking into it. It was easier to share than stop it bubbling about.
https://en.wikipedia.org/wiki/Whac-A-Mole

Ben Coman

Re: rough idea for per segment Lilliputian lists

On Tue, 9 Oct 2018 at 15:52, Clément Béra <[hidden email]> wrote:

Solution:

As Ben suggested, one solution would be to organized the lilliputian single linked list as a per segment linked list. I did not do that because it would impact a lot free chunk organization, I changed it in the past and it takes 2 days to make sure I did not break anything in the production VM.

So what I did instead first is to try to sort lilliputian while sweeping (which is almost free

cool, thats good context. if fair to balance simplicity (i.e. reliability) versus optimization.

I just need an extra variable and execution time is not really impacted), iterate over the lilliputian linked list while compacting segments, and since segments are in ascending addresses order, that meant I could iterate at most once over the lilliputian linked list while compacting (O(N^2) becomes O(N)). That reduced the stupid pathological cases from multiple seconds to multiple hundred ms.

That was great, but still not enough (we target 16ms pause for 60fps apps). So in addition I annotated while sweeping each segment with their last lilliputian, which means that now I can iterate over the lilliputian linked list per segment and remove the remaining overhead, leading to compaction pauses of several ms.

I missed that pauses were that small already.

I still clear the lilliputians while iterating over the segment in compaction phase, clearing all the lilliputian of 1 segment at once is indeed now possible, but it has no real performance benefit IMO since it does not avoid any cache miss (prev lilliputian is in cache for the segment iteration and current one has just been read from memory). Current impl works so I've just kept it that way.

Now what you suggest Ben is better, no sorting of lilliputian while sweeping, just keep a reference to the last lilliputian in the segment (first one found in the segment since they end up in reverse address order), and then clear all the segment lilliputians at once when compacting it. It was just easier to make it work the way I did from what was implemented when I did it.

cool. nice to know my understanding was on the right track and how other practical matters affect it

I thought about doing something similar to what you described when I finished but I don't think it would impact that much performance (it does not remove read/writes with cache miss) so I was too lazy.

Right now I have 2 development images, one matching the trunk and one with extra profiling changes for my paper, so it is difficult to change things. But I may do that later when I finish this paper or you can change it yourself.

I'm sure I'd find it interesting to do so, but I'm already behind on some projects.

Compiling with selective requires to change compactorClass and enable tempvectorreadbarrier in slang compilation setting. The only thing is that u cannot really snapshot right now with selective (the idea is to have both selective and planning so that snapshots use planning and the rest of the code use selective, see hybridcompactor, not sure that works so far though, because without full heap compaction snapshots are quite bigger).

Damned that was a long mail. But you really nailed the problem. I've been dying to talk about that to someone I'm glad you asked :-)

I'm glad I can prompt the creative process.

cheers -ben

On Mon, Oct 8, 2018 at 3:17 PM Ben Coman <[hidden email]> wrote:

On Mon, 8 Oct 2018 at 02:08, <[hidden email]> wrote:
ClementBera uploaded a new version of VMMaker to project VM Maker:

SpurSelectiveCompactor compacts memory by selecting the memory segments with the most free space and compacting only those, to limit fragmentation while being really quick to perform. The algorithm is fast mostly because it does not update pointers: they are updated lazily during the next marking phase, so there is no need to read the fields of objects in other memory segments that the one compacted.

The algorithm works as follow. First, a global sweep pass iterates over the memory linearly, changing unmarked objects to free chunks and concatenating free chunks. During the global sweep phase, the segments of the heap are analysed to determine the percentage of occupation. Second, the least occupied segments are compacted by copying the remaining live objects into an entirely free segment, called regionToFill (we detail later in the paragraph where regionToFill comes from), changing their values to forwarding objects and marking the free chunks as unavailable (removed from free list and marked as data objects). Third, the next marking phase removes all forwarders. Fourth, at the beginning of the next compaction phase the compacted segments from the previous GC can be entirely marked as free space (No need to check anything inside, there were only forwarders and trash data). One of the compacted segment is then selected as the segmentToFill, others are just marked as free chunks.

The compaction is effectively partial, compacting only the most critical segments of the heap to limit fragmentation. Compaction time is crazy low, since a low number of objects are moved and pointer updated is lazily done during the next marking phase, while still preventing memory fragmentation.

Now this works well when biasForGC is true, but when performing a snapshot, the compactor is just total crap (we need to figure out a solution).

Performance trick for lilliputian chunks:
+ Specific free chunks (called lilliputian, see isLilliputianSize:) are managed using a single linked list instead of a double linked list since there's not enough room in the free chunk for the back pointer. During the sweep phase this is not a problem since we're rebuilding the free chunk structure, but during selective compaction we're detaching free chunks from the free chunk structure and that can be horribly slow (10 seconds sometimes at 20Gb heap due to many iteration over the single linked list). To work around this problem, the sweep phase use lastLilliputianChunk variable to sort the lilliputian free chunk single linked list in ascending address order (See interceptAddFreeChunkWithBytes:at:). During the selective compation phase, the same variable is re-used to iterate at most once over the single linked list while detaching lilliputian chunks (See incrementalUnlinkSmallChunk:). In addition, each segment is annotated during the sweep phase with the last lilliputian chunk it
holds. Hence, during the compaction phase, the linked list is iterated but the iteration can jump to the last chunk of the previous segment to compact.!

- Specific free chunks (called lilliputian, see isLilliputianSize:) are managed using a single linked list instead of a double linked list since there's not enough room in the free chunk for the back pointer. During the sweep phase this is not a problem since we're rebuilding the free chunk structure, but during selective compaction we're detaching free chunks from the free chunk structure and that can be horribly slow (10 seconds sometimes at 20Gb heap due to many iteration over the single linked list). To work around this problem, the sweep phase use lastLilliputianChunk variable to sort the lilliputian free chunk single linked list in ascending address order (See interceptAddFreeChunkWithBytes:at:). During the selective compation phase, the same variable is re-used to iterate at most once over the single linked list while detaching lilliputian chunks (See incrementalUnlinkSmallChunk:).

Please excuse me that I only skimmed the implementation and so don't have a good grasp if what I'm saying is practical,
but anyway reading the above description a random idea floating along the wind found me...

Have you considered per-segment-Lilliputian-list?
Reserve the first word of each segment as the pointer to the segment's first free-Lilliputian-chunk. If it is null there are no free-Lilliputian-chunks in that segment.
The global sweep easily(?) knows which segment a Lilliputian-free-chunk belongs to and prepends to that per-segment list.

To combine all those separate per-segment lists into an overall-list...
split that reserved word into two parts...
* the LSB part provides an "offset" to the first element of the per-segment-list
* the MSB part provides a pointer to the next segment with a per-segment-list i.e. "next-per-segment-list".
Also btw, Lilliputian-free-chunks only require space to hold the "next-offset" within its segment

To allocate from the overall-list...
given a root pointer "X", follow to the first segment,
allocate chunk at "offset",
update "offset" with "next-offset",
if "offset" is zero, update "X" with "next-per-segment-list"

Reserve a second word at the start of each segment and you have a back pointer to the previous segment.
So when a segment is removed, its easy to remove all its Lilliputian-free-chunks out of the overall-list in one go.
The Lilliputian-free-chunk-overall-list would be like a double-linked-list with offshoots.

Maybe(?) then the Lilliputian never needs to be sorted,
and also doesn't need to be traversed when freeing a segment, just remove the segment from the overall-list.

cheers -ben

P.S. Please feel free to play "idea"-whac-a-mole. I've not put a lot of critical thinking into it. It was easier to share than stop it bubbling about.
https://en.wikipedia.org/wiki/Whac-A-Mole

Clément Béra

Re: rough idea for per segment Lilliputian lists

We're talking about compaction pause right.

Current full GC pause with Selective is Mark pause + scavenge pause + Sweep pause + compaction pause + shrink pause.

At 20Gb Moose workload, it's 17 sec + 25ms (avg, up to 68ms) + 1.5sec + ~20ms + 0 ms (up to 4ms)

The thing is that, research wise, Mark and sweep pauses can be incremental or concurrent. So only scavenge and compaction are problematic (hardly concurrent without redesigning the full memory manager in uncharted territories). That means that in theory with this design we can have max 20ms pause.

On Tue, Oct 9, 2018 at 6:26 PM Ben Coman <[hidden email]> wrote:

On Tue, 9 Oct 2018 at 15:52, Clément Béra <[hidden email]> wrote:
Solution:

As Ben suggested, one solution would be to organized the lilliputian single linked list as a per segment linked list. I did not do that because it would impact a lot free chunk organization, I changed it in the past and it takes 2 days to make sure I did not break anything in the production VM.

So what I did instead first is to try to sort lilliputian while sweeping (which is almost free

cool, thats good context. if fair to balance simplicity (i.e. reliability) versus optimization.

I just need an extra variable and execution time is not really impacted), iterate over the lilliputian linked list while compacting segments, and since segments are in ascending addresses order, that meant I could iterate at most once over the lilliputian linked list while compacting (O(N^2) becomes O(N)). That reduced the stupid pathological cases from multiple seconds to multiple hundred ms.

That was great, but still not enough (we target 16ms pause for 60fps apps). So in addition I annotated while sweeping each segment with their last lilliputian, which means that now I can iterate over the lilliputian linked list per segment and remove the remaining overhead, leading to compaction pauses of several ms.

I missed that pauses were that small already.

I still clear the lilliputians while iterating over the segment in compaction phase, clearing all the lilliputian of 1 segment at once is indeed now possible, but it has no real performance benefit IMO since it does not avoid any cache miss (prev lilliputian is in cache for the segment iteration and current one has just been read from memory). Current impl works so I've just kept it that way.

Now what you suggest Ben is better, no sorting of lilliputian while sweeping, just keep a reference to the last lilliputian in the segment (first one found in the segment since they end up in reverse address order), and then clear all the segment lilliputians at once when compacting it. It was just easier to make it work the way I did from what was implemented when I did it.

cool. nice to know my understanding was on the right track and how other practical matters affect it

I thought about doing something similar to what you described when I finished but I don't think it would impact that much performance (it does not remove read/writes with cache miss) so I was too lazy.

Right now I have 2 development images, one matching the trunk and one with extra profiling changes for my paper, so it is difficult to change things. But I may do that later when I finish this paper or you can change it yourself.

I'm sure I'd find it interesting to do so, but I'm already behind on some projects.

Compiling with selective requires to change compactorClass and enable tempvectorreadbarrier in slang compilation setting. The only thing is that u cannot really snapshot right now with selective (the idea is to have both selective and planning so that snapshots use planning and the rest of the code use selective, see hybridcompactor, not sure that works so far though, because without full heap compaction snapshots are quite bigger).

Damned that was a long mail. But you really nailed the problem. I've been dying to talk about that to someone I'm glad you asked :-)

I'm glad I can prompt the creative process.

cheers -ben

On Mon, Oct 8, 2018 at 3:17 PM Ben Coman <[hidden email]> wrote:

On Mon, 8 Oct 2018 at 02:08, <[hidden email]> wrote:
ClementBera uploaded a new version of VMMaker to project VM Maker:

SpurSelectiveCompactor compacts memory by selecting the memory segments with the most free space and compacting only those, to limit fragmentation while being really quick to perform. The algorithm is fast mostly because it does not update pointers: they are updated lazily during the next marking phase, so there is no need to read the fields of objects in other memory segments that the one compacted.

The algorithm works as follow. First, a global sweep pass iterates over the memory linearly, changing unmarked objects to free chunks and concatenating free chunks. During the global sweep phase, the segments of the heap are analysed to determine the percentage of occupation. Second, the least occupied segments are compacted by copying the remaining live objects into an entirely free segment, called regionToFill (we detail later in the paragraph where regionToFill comes from), changing their values to forwarding objects and marking the free chunks as unavailable (removed from free list and marked as data objects). Third, the next marking phase removes all forwarders. Fourth, at the beginning of the next compaction phase the compacted segments from the previous GC can be entirely marked as free space (No need to check anything inside, there were only forwarders and trash data). One of the compacted segment is then selected as the segmentToFill, others are just marked as free chunks.

The compaction is effectively partial, compacting only the most critical segments of the heap to limit fragmentation. Compaction time is crazy low, since a low number of objects are moved and pointer updated is lazily done during the next marking phase, while still preventing memory fragmentation.

Now this works well when biasForGC is true, but when performing a snapshot, the compactor is just total crap (we need to figure out a solution).

Performance trick for lilliputian chunks:
+ Specific free chunks (called lilliputian, see isLilliputianSize:) are managed using a single linked list instead of a double linked list since there's not enough room in the free chunk for the back pointer. During the sweep phase this is not a problem since we're rebuilding the free chunk structure, but during selective compaction we're detaching free chunks from the free chunk structure and that can be horribly slow (10 seconds sometimes at 20Gb heap due to many iteration over the single linked list). To work around this problem, the sweep phase use lastLilliputianChunk variable to sort the lilliputian free chunk single linked list in ascending address order (See interceptAddFreeChunkWithBytes:at:). During the selective compation phase, the same variable is re-used to iterate at most once over the single linked list while detaching lilliputian chunks (See incrementalUnlinkSmallChunk:). In addition, each segment is annotated during the sweep phase with the last lilliputian chunk it
holds. Hence, during the compaction phase, the linked list is iterated but the iteration can jump to the last chunk of the previous segment to compact.!

- Specific free chunks (called lilliputian, see isLilliputianSize:) are managed using a single linked list instead of a double linked list since there's not enough room in the free chunk for the back pointer. During the sweep phase this is not a problem since we're rebuilding the free chunk structure, but during selective compaction we're detaching free chunks from the free chunk structure and that can be horribly slow (10 seconds sometimes at 20Gb heap due to many iteration over the single linked list). To work around this problem, the sweep phase use lastLilliputianChunk variable to sort the lilliputian free chunk single linked list in ascending address order (See interceptAddFreeChunkWithBytes:at:). During the selective compation phase, the same variable is re-used to iterate at most once over the single linked list while detaching lilliputian chunks (See incrementalUnlinkSmallChunk:).

Please excuse me that I only skimmed the implementation and so don't have a good grasp if what I'm saying is practical,
but anyway reading the above description a random idea floating along the wind found me...

Have you considered per-segment-Lilliputian-list?
Reserve the first word of each segment as the pointer to the segment's first free-Lilliputian-chunk. If it is null there are no free-Lilliputian-chunks in that segment.
The global sweep easily(?) knows which segment a Lilliputian-free-chunk belongs to and prepends to that per-segment list.

To combine all those separate per-segment lists into an overall-list...
split that reserved word into two parts...
* the LSB part provides an "offset" to the first element of the per-segment-list
* the MSB part provides a pointer to the next segment with a per-segment-list i.e. "next-per-segment-list".
Also btw, Lilliputian-free-chunks only require space to hold the "next-offset" within its segment

To allocate from the overall-list...
given a root pointer "X", follow to the first segment,
allocate chunk at "offset",
update "offset" with "next-offset",
if "offset" is zero, update "X" with "next-per-segment-list"

Reserve a second word at the start of each segment and you have a back pointer to the previous segment.
So when a segment is removed, its easy to remove all its Lilliputian-free-chunks out of the overall-list in one go.
The Lilliputian-free-chunk-overall-list would be like a double-linked-list with offshoots.

Maybe(?) then the Lilliputian never needs to be sorted,
and also doesn't need to be traversed when freeing a segment, just remove the segment from the overall-list.

cheers -ben

P.S. Please feel free to play "idea"-whac-a-mole. I've not put a lot of critical thinking into it. It was easier to share than stop it bubbling about.
https://en.wikipedia.org/wiki/Whac-A-Mole

Clément Béra
https://clementbera.github.io/

https://clementbera.wordpress.com/