Determining where I leak memory, Issue 207

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Determining where I leak memory, Issue 207

Stephan Eggermont-3
As the garbage collection doesn't seem to create much space and the
database is growing a bit too fast, I wanted to check with ScanBackup
which objects are involved. The latest version (12) doesn't seem to be able
to open backup files, as the code to determine the file type is missing.
The earlier version (9) that has the code, also doesn't work with it.

See Issue 207.

Growth:
23 nov. 224MB
24 nov. 320MB
30 nov. 352MB
1  dec. 496MB
3  dec. 512MB
5  dec. 528MB
6  dec. 672MB

Stephan Eggermont

Reply | Threaded
Open this post in threaded view
|

Re: Determining where I leak memory, Issue 207

James Foster-8
Stephan,

Thanks very much for the bug report. I'll take a look. Until then you can try with version 8.

James

On Dec 6, 2010, at 4:30 AM, Stephan Eggermont wrote:

> As the garbage collection doesn't seem to create much space and the
> database is growing a bit too fast, I wanted to check with ScanBackup
> which objects are involved. The latest version (12) doesn't seem to be able
> to open backup files, as the code to determine the file type is missing.
> The earlier version (9) that has the code, also doesn't work with it.
>
> See Issue 207.
>
> Growth:
> 23 nov. 224MB
> 24 nov. 320MB
> 30 nov. 352MB
> 1  dec. 496MB
> 3  dec. 512MB
> 5  dec. 528MB
> 6  dec. 672MB
>
> Stephan Eggermont
>

Reply | Threaded
Open this post in threaded view
|

Re: Determining where I leak memory, Issue 207

Dale Henrichs
In reply to this post by Stephan Eggermont-3
Stephan,

I see that James is taking a look at the SanBackup issue.

In order to address the growth issues (pardon me if you already answered
these questions:)

   - make sure you are running the maintenance vm  script (more
     specifically the Seaside session expiration code
   - try stopping/restarting all of your gems and then run an MFC. (see
     Issue 136: http://code.google.com/p/glassdb/issues/detail?id=136)

Dale

On 12/06/2010 04:30 AM, Stephan Eggermont wrote:

> As the garbage collection doesn't seem to create much space and the
> database is growing a bit too fast, I wanted to check with ScanBackup
> which objects are involved. The latest version (12) doesn't seem to be able
> to open backup files, as the code to determine the file type is missing.
> The earlier version (9) that has the code, also doesn't work with it.
>
> See Issue 207.
>
> Growth:
> 23 nov. 224MB
> 24 nov. 320MB
> 30 nov. 352MB
> 1  dec. 496MB
> 3  dec. 512MB
> 5  dec. 528MB
> 6  dec. 672MB
>
> Stephan Eggermont
>

Reply | Threaded
Open this post in threaded view
|

Re: Determining where I leak memory, Issue 207

Stephan Eggermont-3
Hello Dale,

I'm not sure I understand how this session expiration works (it's a Seaside 2.8 app)
We work with runSeasideGems start/stop, and I ran Norberts script with the Seaside
Gems stopped.

In GemTools, we printed the following:

|classes sizes result|
result := Dictionary new.
classes := (PackageInfo named: 'Kompas-Web') classes.
sizes := SystemRepository countInstances: classes.
sizes doWithIndex: [:each :i |
    each > 1000 ifTrue: [result at: (classes at: i) put: each ]].
result

It showed us we had 11070 KSSession instances. We are not yet that succesful.
The number should be below 100 currently.

Only one of them was expired:
 
(KSSession allInstances
        select: [:each | each expired])
size

We then expired them all at the Seaside level:

(KSSession allInstances do: [:each | each expire]

Then doing a SystemRepository markForCollection gave
impressing results: it counted 800,000 live objects and
5,000,000 possible dead.

That was not enough to garbage collect them.
What is the next step? Should I run/adapt the maintenance tasks of
StartMaintenance30?

Stephan
Reply | Threaded
Open this post in threaded view
|

Re: Determining where I leak memory, Issue 207

NorbertHartl

On 07.12.2010, at 12:51, Stephan Eggermont wrote:

> Hello Dale,
>
> I'm not sure I understand how this session expiration works (it's a Seaside 2.8 app)
> We work with runSeasideGems start/stop, and I ran Norberts script with the Seaside
> Gems stopped.
>
Which script do you mean?

> In GemTools, we printed the following:
>
> |classes sizes result|
> result := Dictionary new.
> classes := (PackageInfo named: 'Kompas-Web') classes.
> sizes := SystemRepository countInstances: classes.
> sizes doWithIndex: [:each :i |
>    each > 1000 ifTrue: [result at: (classes at: i) put: each ]].
> result
>
> It showed us we had 11070 KSSession instances. We are not yet that succesful.
> The number should be below 100 currently.
>
> Only one of them was expired:
>
It seems there is something wrong with the maintenance script. The script adds some things to the object log which is available under /tools/objectLog. If there is nothing about the expiry of sessions then the script does not work properly.

> (KSSession allInstances
> select: [:each | each expired])
> size
>
> We then expired them all at the Seaside level:
>
> (KSSession allInstances do: [:each | each expire]
>
> Then doing a SystemRepository markForCollection gave
> impressing results: it counted 800,000 live objects and
> 5,000,000 possible dead.
>
> That was not enough to garbage collect them.
> What is the next step? Should I run/adapt the maintenance tasks of
> StartMaintenance30?
>
If objects are "possible dead" then the other gems et. al have the chance to veto the collection of objects. I would say it is unlikely that all of the 5.000.000 objects are not really dead. Do you know that markForCollection does just a "mark for collection"? It does not garbage collect the objects. Log out all of your DataCurator sessions and wait a few minutes. If things are normal you should see the empty space in the extent grow. If not, complain again.

Norbert

Reply | Threaded
Open this post in threaded view
|

Re: Determining where I leak memory, Issue 207

Stephan Eggermont-3
Hello Norbert,


On 7 dec 2010, at 14:19, Norbert Hartl wrote:
> Which script do you mean?

The one from Issue 136

> It seems there is something wrong with the maintenance script. The script adds some things to the object log which is available under /tools/objectLog. If there is nothing about the expiry of sessions then the script does not work properly.

Yeah, in logs I find in the current admingcgem.log:

[Info]: Finished second GarRemoveNotDead at 12/07/2010 02:16:34 PM CET.  Removed 0 objs from possibleDead
[Info]: SweepWsUnion ending possibleDead size = 717
  Finished sweeping possible dead at 12/07/2010 02:16:34 PM CET.
  possible dead size=717  notDead size=5253458  
Starting doSweepWsUnion at 12/07/2010 03:03:10 PM CET
  Starting values: WSU size=46898  PD size=5262338
MarkSweepArgsSType::(): buffer size= 320 pages
GarRemoveNotDead: removing live from PD. live size: 5262338
  Finished first GarRemoveNotDead at 12/07/2010 03:03:15 PM CET. Removed 5262338 objects from possibleDead
[Info]: WsUnion during first sweep size = 21
[Info]: Finished second GarRemoveNotDead at 12/07/2010 03:03:15 PM CET.  Removed 0 objs from possibleDead
[Info]: SweepWsUnion ending possibleDead size = 0
  Finished sweeping possible dead at 12/07/2010 03:03:15 PM CET.
  possible dead size=0  notDead size=5262338  
Starting doSweepWsUnion at 12/07/2010 03:16:30 PM CET
  Starting values: WSU size=46912  PD size=5264924
MarkSweepArgsSType::(): buffer size= 320 pages
GarRemoveNotDead: removing live from PD. live size: 5264924
  Finished first GarRemoveNotDead at 12/07/2010 03:16:35 PM CET. Removed 5264924 objects from possibleDead
[Info]: WsUnion during first sweep size = 21
[Info]: Finished second GarRemoveNotDead at 12/07/2010 03:16:35 PM CET.  Removed 0 objs from possibleDead
[Info]: SweepWsUnion ending possibleDead size = 0
  Finished sweeping possible dead at 12/07/2010 03:16:35 PM CET.
  possible dead size=0  notDead size=5264924  

and in maintenance_gem.log

Before commit: true
2010-12-07T06:09:10.70276093482971-08:00
Unregistering...
Begin AutoTransaction: true
Finish expiration loop: true
...Expired: 0 sessions.
Before commit: true
2010-12-07T06:10:10.70736598968506-08:00
Unregistering...
Begin AutoTransaction: true
Finish expiration loop: true
...Expired: 1 sessions.
Before commit: true
2010-12-07T06:11:10.72887492179871-08:00
Unregistering...
Begin AutoTransaction: true
Finish expiration loop: true
...Expired: 0 sessions.
Before commit: true
2010-12-07T06:12:10.73608803749084-08:00
Unregistering...
Begin AutoTransaction: true
Finish expiration loop: true
...Expired: 1 sessions.
Before commit: true
2010-12-07T06:13:10.76529002189636-08:00
Unregistering...
Begin AutoTransaction: true

> If objects are "possible dead" then the other gems et. al have the chance to veto the collection of objects. I would say it is unlikely that all of the 5.000.000 objects are not really dead. Do you know that markForCollection does just a "mark for collection"? It does not garbage collect the objects. Log out all of your DataCurator sessions and wait a few minutes. If things are normal you should see the empty space in the extent grow. If not, complain again.

It doesn't.

Do I need to have more running than
glass    14143  0.0  4.5 691288 370744 ?       Ssl  Nov23   0:28 /opt/gemstone/GemStone64Bit2.4.4.1-x86_64.Linux/sys/stoned seaside -z/opt/gemstone/product/seaside/data/system.conf -l/opt/gemstone/log/seaside.log
glass    14144  0.0  1.9 602620 163120 ?       Sl   Nov23   8:31 /opt/gemstone/product/sys/shrpcmonitor 'seaside@127.0.1.1' setlocationstr 31250 setnumpages 49 setnumprocs 0 setmemlock 5000 setspinlockcount 3906 settargetfflim
glass    14150  0.0  1.6 579240 136456 ?       S    Nov23   0:10 /opt/gemstone/product/sys/pgsvrmain seaside@127.0.1.1 0 1 -1
glass    14151  0.0  5.7 579520 472232 ?       S    Nov23   0:29 /opt/gemstone/product/sys/pgsvrmain TCP 48160 90
glass    14160  0.0  1.5 615732 124176 ?       Sl   Nov23  13:41 /opt/gemstone/product/sys/gem pagemanager 'seaside' -T 5000
glass    14165  0.0  5.7 617520 467352 ?       Sl   Nov23   0:27 /opt/gemstone/product/sys/gem reclaimgcgem 'seaside' 0 0 -T 5000
glass    14166  0.0  5.6 617396 464204 ?       Sl   Nov23  13:39 /opt/gemstone/product/sys/gem admingcgem 'seaside' -T 5000
glass    14167  0.0  1.6 632712 133976 ?       Sl   Nov23   0:21 /opt/gemstone/product/sys/gem symbolgem 'seaside' -T 10000
glass    14339  0.0  5.5 1232224 456512 ?      Sl   Nov23   3:00 /opt/gemstone/product/bin/topaz -l -T200000
glass    14468  0.0  0.0  28840  2096 ?        Ss   Nov23   0:00 /opt/gemstone/GemStone64Bit2.4.4.1-x86_64.Linux/sys/netldid gs64ldi -g -n -aglass -l/opt/gemstone/log/gs64ldi.log

Do I need to manually add a task to the admin process? I'm using 2.8

Stephan


Reply | Threaded
Open this post in threaded view
|

Re: Determining where I leak memory, Issue 207

Dale Henrichs
Stephan,

The possible dead size of 5M does mean that you have that many objects
that are not directly connected into the persistent object graph.

The WSUnion sweep removes objects that have been referenced by objects
that were written since the MFC completed AND (and this is an important
and) objects that were voted down by gems.

Since you are presumably running a fairly standard seaside setup (minus
the maintenance vm) one of the seaside gems (or a GemTools gem) are
voting down all of the dead probably because the vm has a root reference
of some sort stuck in it's head ... which is likely to be a
manifestation of Issue 136 and you are just unlucky that an object stuck
in a vm's head is keeping 5M objects alive ...

There are two things that we could do:

   1. recycle all of your gems and run another MFC
   2. start a statmonitor, run another MFC and see if we can figure out
      which gem (or gems) are voting down objects ...

Since it only takes one well placed object to keep a bunch of objects
alive, we may not have much success in tracking down _the culprit_....

Restarting all of the gems that are attached to the stone (including
GemTools gems) and then running an MFC followed by a reclaimAll (which
should force the immediate reclaim of the dead objects):

run
SystemRepository markForCollection
%
run
SystemRepository postReclaimAll: SystemRepository reclaimAll
%

Note that reclaimAll will expect all gems to vote so it is best to close
any development gems while attempting to run reclaimAll (the seaside
gems will vote automatically)..

This should should get most if not all of your 5M dead disposed of...

Dale

On 12/07/2010 07:21 AM, Stephan Eggermont wrote:

> Hello Norbert,
>
>
> On 7 dec 2010, at 14:19, Norbert Hartl wrote:
>> Which script do you mean?
>
> The one from Issue 136
>
>> It seems there is something wrong with the maintenance script. The script adds some things to the object log which is available under /tools/objectLog. If there is nothing about the expiry of sessions then the script does not work properly.
>
> Yeah, in logs I find in the current admingcgem.log:
>
> [Info]: Finished second GarRemoveNotDead at 12/07/2010 02:16:34 PM CET.  Removed 0 objs from possibleDead
> [Info]: SweepWsUnion ending possibleDead size = 717
>    Finished sweeping possible dead at 12/07/2010 02:16:34 PM CET.
>    possible dead size=717  notDead size=5253458
> Starting doSweepWsUnion at 12/07/2010 03:03:10 PM CET
>    Starting values: WSU size=46898  PD size=5262338
> MarkSweepArgsSType::(): buffer size= 320 pages
> GarRemoveNotDead: removing live from PD. live size: 5262338
>    Finished first GarRemoveNotDead at 12/07/2010 03:03:15 PM CET. Removed 5262338 objects from possibleDead
> [Info]: WsUnion during first sweep size = 21
> [Info]: Finished second GarRemoveNotDead at 12/07/2010 03:03:15 PM CET.  Removed 0 objs from possibleDead
> [Info]: SweepWsUnion ending possibleDead size = 0
>    Finished sweeping possible dead at 12/07/2010 03:03:15 PM CET.
>    possible dead size=0  notDead size=5262338
> Starting doSweepWsUnion at 12/07/2010 03:16:30 PM CET
>    Starting values: WSU size=46912  PD size=5264924
> MarkSweepArgsSType::(): buffer size= 320 pages
> GarRemoveNotDead: removing live from PD. live size: 5264924
>    Finished first GarRemoveNotDead at 12/07/2010 03:16:35 PM CET. Removed 5264924 objects from possibleDead
> [Info]: WsUnion during first sweep size = 21
> [Info]: Finished second GarRemoveNotDead at 12/07/2010 03:16:35 PM CET.  Removed 0 objs from possibleDead
> [Info]: SweepWsUnion ending possibleDead size = 0
>    Finished sweeping possible dead at 12/07/2010 03:16:35 PM CET.
>    possible dead size=0  notDead size=5264924
>
> and in maintenance_gem.log
>
> Before commit: true
> 2010-12-07T06:09:10.70276093482971-08:00
> Unregistering...
> Begin AutoTransaction: true
> Finish expiration loop: true
> ...Expired: 0 sessions.
> Before commit: true
> 2010-12-07T06:10:10.70736598968506-08:00
> Unregistering...
> Begin AutoTransaction: true
> Finish expiration loop: true
> ...Expired: 1 sessions.
> Before commit: true
> 2010-12-07T06:11:10.72887492179871-08:00
> Unregistering...
> Begin AutoTransaction: true
> Finish expiration loop: true
> ...Expired: 0 sessions.
> Before commit: true
> 2010-12-07T06:12:10.73608803749084-08:00
> Unregistering...
> Begin AutoTransaction: true
> Finish expiration loop: true
> ...Expired: 1 sessions.
> Before commit: true
> 2010-12-07T06:13:10.76529002189636-08:00
> Unregistering...
> Begin AutoTransaction: true
>
>> If objects are "possible dead" then the other gems et. al have the chance to veto the collection of objects. I would say it is unlikely that all of the 5.000.000 objects are not really dead. Do you know that markForCollection does just a "mark for collection"? It does not garbage collect the objects. Log out all of your DataCurator sessions and wait a few minutes. If things are normal you should see the empty space in the extent grow. If not, complain again.
>
> It doesn't.
>
> Do I need to have more running than
> glass    14143  0.0  4.5 691288 370744 ?       Ssl  Nov23   0:28 /opt/gemstone/GemStone64Bit2.4.4.1-x86_64.Linux/sys/stoned seaside -z/opt/gemstone/product/seaside/data/system.conf -l/opt/gemstone/log/seaside.log
> glass    14144  0.0  1.9 602620 163120 ?       Sl   Nov23   8:31 /opt/gemstone/product/sys/shrpcmonitor 'seaside@127.0.1.1' setlocationstr 31250 setnumpages 49 setnumprocs 0 setmemlock 5000 setspinlockcount 3906 settargetfflim
> glass    14150  0.0  1.6 579240 136456 ?       S    Nov23   0:10 /opt/gemstone/product/sys/pgsvrmain seaside@127.0.1.1 0 1 -1
> glass    14151  0.0  5.7 579520 472232 ?       S    Nov23   0:29 /opt/gemstone/product/sys/pgsvrmain TCP 48160 90
> glass    14160  0.0  1.5 615732 124176 ?       Sl   Nov23  13:41 /opt/gemstone/product/sys/gem pagemanager 'seaside' -T 5000
> glass    14165  0.0  5.7 617520 467352 ?       Sl   Nov23   0:27 /opt/gemstone/product/sys/gem reclaimgcgem 'seaside' 0 0 -T 5000
> glass    14166  0.0  5.6 617396 464204 ?       Sl   Nov23  13:39 /opt/gemstone/product/sys/gem admingcgem 'seaside' -T 5000
> glass    14167  0.0  1.6 632712 133976 ?       Sl   Nov23   0:21 /opt/gemstone/product/sys/gem symbolgem 'seaside' -T 10000
> glass    14339  0.0  5.5 1232224 456512 ?      Sl   Nov23   3:00 /opt/gemstone/product/bin/topaz -l -T200000
> glass    14468  0.0  0.0  28840  2096 ?        Ss   Nov23   0:00 /opt/gemstone/GemStone64Bit2.4.4.1-x86_64.Linux/sys/netldid gs64ldi -g -n -aglass -l/opt/gemstone/log/gs64ldi.log
>
> Do I need to manually add a task to the admin process? I'm using 2.8
>
> Stephan
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Determining where I leak memory, Issue 207

Stephan Eggermont-3
Thanks Dale,

Found the culprit. This topaz that has been running since Nov23:

>> glass    14339  0.0  5.5 1232224 456512 ?      Sl   Nov23   3:00 /opt/gemstone/product/bin/topaz -l -T200000

made the difference. After killing it I'm back to 803566 live objects and 142052 possible dead, of 12784680 bytes.
After making a full backup and restoring from that, I'm back at a 130,023,424 filesize with 2015232 (bytes?) free.

Stephan
Reply | Threaded
Open this post in threaded view
|

Re: Determining where I leak memory, Issue 207

Dale Henrichs
On 12/07/2010 03:12 PM, Stephan Eggermont wrote:

> Thanks Dale,
>
> Found the culprit. This topaz that has been running since Nov23:
>
>>> glass    14339  0.0  5.5 1232224 456512 ?      Sl   Nov23   3:00 /opt/gemstone/product/bin/topaz -l -T200000
>
> made the difference. After killing it I'm back to 803566 live objects and 142052 possible dead, of 12784680 bytes.
> After making a full backup and restoring from that, I'm back at a 130,023,424 filesize with 2015232 (bytes?) free.
>
> Stephan

Stephan,

Excellent ... I'm glad we got that one figured out:)

Dale