the test test - an experience report

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

the test test - an experience report

Stefan Schmiedl
You might recall my building a tool for online testing
with Iliad and GNU Smalltalk. Today was "the" day, and
here is what happened.

I had gst running on one of my webservers, students
accessed it through an apache proxy set up to serve
static files and forward the remaining requests to
the app. As testers we had 5 classes with about 25
students each. The test they took had 13 exercises
with a total of 95 questions.

The clients were about 25 PCs in a private subnet,
going through a NAT-ing DSL router with something
between 200 kbit/s and 500 kbit/s upstream and
2 Mbit/s and 6 Mbit/s downstream (IIRC), shared with a few
other users across the rest of the school grounds.
Response time was snappy, iliad's "ajax loader"
mostly only flashed in the corner, without a chance
of showing off the animation.

From the users' point of view, it was a nice change
of things, and they quite liked it, once we got over
the initial hurdle (below).

I watched the app through atop and paid attention to
the RSIZE, i.e. the amount of physical RAM the gst-remote
process was using
1) after startup, before any access: 19 MB
2) after displaying the registration page with a single widget
   for 25 clients: 33 MB
3) after everybody started working, i.e. with 25 full blown
   test widgets: 65 MB
4) after finishing the test: between 100 MB and 110 MB

I have saved the images of every run
  gst-remote --eval="ObjectMemory snapshot"
and will try to find out what has caused this growth.

CPU load on the server varied upto 90% during the initial phase
(more below) and upto 30% during test execution. The numbers are
not really reliable, as the machine is hosting a few other apps,
which might have contributed. The machine is a "small" single
core 64 bit AMD64 3700+ with 1GB RAM.

Now for the interesting problem, that managed to make me
a bit nervous... and would have been a total showstopper,
had it not happened in this "experimental testing session".

The students of class 6a logged into their Windows domain accounts,
started Firefox and entered the URL for the test (stage 1 above).
Then they entered their names into the registration page (stage 2)
and clicked on the button to access the test. Shortly after server
CPU load went to 100% with the following error message being repeated
as fast as the remote terminal could cope with:

  "Socket accept error: Error while trying to accept a socket connection"

Client side a one-liner 500 error message was reported.

Time for pkill gst-remote ... I rebuilt the image and started the
server again. This time we staged the 25 "almost simultaneous" login
attempts into four batches of 6 each and things worked fine from that point on.

After finishing the test, the students logged off and the next class, 6b ...
had the exact same experience ... and 6c and 6d, too.

For the final group I tried a different approach:
They logged on, opened the URL, and sat on their hands.
I killed gst-remote, rebuilt the image, restarted gst-remote and told them
to reload the page. They then entered their names and started clicking on
the answers and the Socket error of Doom appeared again. Kill, rebuild,
restart. Everybody loads the registration page (not staged, just 25 students
clicking when they're ready), enters their name and works on the test as it
should be. No hiccup.

I am very open to any suggestions as to what could have caused this
misbehavior. I don't think iliad is concerned (besides generating ajax requests),
so swazoo and the gst socket implementation are my next suspects.

My vague-feeling-in-the-gut-proved-by-handwaving hypothesis is that it
might be the combination of building a fairly large widget tree *and*
creating a bunch of new socket connections at the same time that's causing
the trouble.

I'll try to build a test bed to reproduce the disaster in a controlled
setting, but it will be a few days before I can really get to this.

Any ideas?

s.


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - an experience report

Paolo Bonzini-3
On 07/16/2009 03:33 PM, Stefan Schmiedl wrote:
> You might recall my building a tool for online testing
> with Iliad and GNU Smalltalk. Today was "the" day, and
> here is what happened.

Thanks for the experience!

>  From the users' point of view, it was a nice change
> of things, and they quite liked it, once we got over
> the initial hurdle (below).

What were they using before?

> I have saved the images of every run
>    gst-remote --eval="ObjectMemory snapshot"
> and will try to find out what has caused this growth.

What is the actual size of the images?

> The students of class 6a logged into their Windows domain accounts,
> started Firefox and entered the URL for the test (stage 1 above).
> Then they entered their names into the registration page (stage 2)
> and clicked on the button to access the test. Shortly after server
> CPU load went to 100% with the following error message being repeated
> as fast as the remote terminal could cope with:
>
>    "Socket accept error: Error while trying to accept a socket connection"

Do you have the backtrace?

> I'll try to build a test bed to reproduce the disaster in a controlled
> setting, but it will be a few days before I can really get to this.
>
> Any ideas?

You could use Ruby's Mechanize for that.
http://mechanize.rubyforge.org/mechanize/ -- just to show we're not
zealots! :-)

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - an experience report

Stefan Schmiedl
On Fri, 17 Jul 2009 13:12:45 +0200
Paolo Bonzini <[hidden email]> wrote:

> Thanks for the experience!

I left out the horrific part where the usual last minute fix from the
night before broke half of the system ...

>
> >  From the users' point of view, it was a nice change
> > of things, and they quite liked it, once we got over
> > the initial hurdle (below).
>
> What were they using before?

usual school equipment: paper and pencils :-)

> > I have saved the images of every run
> >    gst-remote --eval="ObjectMemory snapshot"
> > and will try to find out what has caused this growth.
>
> What is the actual size of the images?

stefan@g128 testdrive $ ls -l ../ot.im *.im
-rw-r--r-- 1 stefan stefan  6234160 16. Jul 17:14 ../ot.im   <- "clean image"
-r--r--r-- 1 stefan stefan 76702080 16. Jul 09:30 ot6a.im    <- "used image"
-r--r--r-- 1 stefan stefan 92434208 16. Jul 10:35 ot6b.im    <- "special case"
-r--r--r-- 1 stefan stefan 74360512 16. Jul 11:24 ot6c.im    <- "used image"
-r--r--r-- 1 stefan stefan 79733448 16. Jul 11:59 ot6d.im    <- "used image"
-r--r--r-- 1 stefan stefan 79306448 16. Jul 12:33 ot6e.im    <- "used image"

The "special case" was a group of students who were sitting there somewhat
longer, probably clicking around some more.

> > The students of class 6a logged into their Windows domain accounts,
> > started Firefox and entered the URL for the test (stage 1 above).
> > Then they entered their names into the registration page (stage 2)
> > and clicked on the button to access the test. Shortly after server
> > CPU load went to 100% with the following error message being
> > repeated as fast as the remote terminal could cope with:
> >
> >    "Socket accept error: Error while trying to accept a socket
> > connection"
>
> Do you have the backtrace?

There's not much in there. I only have two logs, one contains:

Loading package Sockets
Loading package Sport
Loading package Iconv
Loading package Iliad-Core
Loading package Iliad-More-UI
Loading package Magritte
Loading package Iliad-More-Widgets
Loading package Iliad-More-Magritte
Loading package Swazoo
Loading package Iliad-More-Examples
Loading package Iliad-More
Loading package Iliad-Swazoo
Loading package Iliad
Loading package OnlineTester
gst-remote server started.
Socket accept error: Error while trying to accept a socket connection.
... repeated 34461 times until pkill gst-remote ...
/usr/local/share/smalltalk/kernel/../scripts/Remote.st:281: Terminated
(ip 12)HomedAssociation(Association)>>#=
(ip 38)Set>>#findIndex:
(ip 6)Set(HashedCollection)>>#findIndexOrNil:
(ip 4)Set(HashedCollection)>>#remove:ifAbsent:
(ip 8)Set(Collection)>>#mourn:
(ip 6)HomedAssociation>>#mourn
(ip 4)[] in ProcessorScheduler>>#initialize
(ip 24)Array(SequenceableCollection)>>#do:
(ip 6)[] in ProcessorScheduler>>#initialize
(ip 4)[] in BlockClosure>>#newProcessWith:
(ip 52)[] in Process>>#onBlock:at:suspend:
(ip 10)<unwind> BlockClosure>>#on:do:
(ip 14)[] in Process>>#onBlock:at:suspend:
(ip 4)<unwind> BlockClosure>>#ensure:
(ip 10)[] in Process>>#onBlock:at:suspend:
(ip 42)[] in BlockClosure>>#asContext:
(ip 12)BlockContext class>>#fromClosure:parent:

and the other ends with the "Remote.st:281: Terminated" line after 40417
"Socket accept error" lines. In that case, I'm fairly sure that the time between
the first error and the pkill was below 1 min.

> > I'll try to build a test bed to reproduce the disaster in a
> > controlled setting, but it will be a few days before I can really
> > get to this.
> >
> > Any ideas?
>
> You could use Ruby's Mechanize for that.
> http://mechanize.rubyforge.org/mechanize/ -- just to show we're not
> zealots! :-)

huh ... ruby, gst, what's the difference?

ducking and covering,
s.


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - a more detailed analysis of image contents

Stefan Schmiedl
I really should make a blog post out of this ... heck it's enough
for another mini-series :-)

Executive Summary:

There are MANY objects in the image related to Iliad.
How do we get rid of them?


On Fri, 17 Jul 2009 15:09:21 +0200
Stefan Schmiedl <[hidden email]> wrote:

> > What is the actual size of the images?  
>
> stefan@g128 testdrive $ ls -l ../ot.im *.im
> -rw-r--r-- 1 stefan stefan  6234160 16. Jul 17:14 ../ot.im   <- "clean image"
> -r--r--r-- 1 stefan stefan 76702080 16. Jul 09:30 ot6a.im    <- "used image"
> -r--r--r-- 1 stefan stefan 92434208 16. Jul 10:35 ot6b.im    <- "special case"
> -r--r--r-- 1 stefan stefan 74360512 16. Jul 11:24 ot6c.im    <- "used image"
> -r--r--r-- 1 stefan stefan 79733448 16. Jul 11:59 ot6d.im    <- "used image"
> -r--r--r-- 1 stefan stefan 79306448 16. Jul 12:33 ot6e.im    <- "used image"

ot.im is the preloaded image, my OnlineTester package, depending
on Iliad. The other images were saved via
  gst-remote --eval="ObjectMemory snapshot"
after finishing the test session.

I ran a bit of code on each of these images:

Object subclassesDo: [ :c | |s|
  s := c allInstances size.
  s > 0 ifTrue: [
    c name print.
    ': ' display.
    s printNl
  ]
]

Then I realized, that I was missing all the interesting classes,
but manually comparing allSubclassesDo output is really too much...
so let's gain some hacking points.

Collect the "clean" state in ot.im:
  usage := IdentityDictionary new.
  Object allSubclassesDo: [ :cls | |count|
    count := cls allInstances size.
    usage at: cls put: count
  ].
  s := FileStream fopen: 'counts.dump' mode: #write.
  [ ObjectDumper dump: usage to: s ] ensure: [ s close ]

The dictionary usage contains 1237 items and we can now
compare it to the "used" state in the other images:
  s := FileStream fopen: 'counts.dump' mode: #read.
  [ usage := ObjectDumper loadFrom: s ] ensure: [ s close ]
  Object allSubclassesDo: [ :cls | |count orig|
    count := cls allInstances size.
    orig := usage at: cls ifAbsent: [ 0 ].
    ( count > 1 and: [count > (orig * 2)] ) ifTrue: [
      cls display.
      ': ' display.
      orig print.
      ' -> ' display.
      count printNl.
    ]
  ].

ot6a.im has run with an error in the application code, so
the data might be a little bit off, let's look at some
impressive changes in ot6b.im. I've grouped them into some
chunks of related (in my point of view) classes.


"My" object counts match expections:
  OnlineTester.OTApplication: 0 -> 27
  OnlineTester.OTAufgabe: 0 -> 13
  OnlineTester.OTFrage: 0 -> 95
  OnlineTester.OTRegisterWidget: 0 -> 27
  OnlineTester.OTTestWidget: 0 -> 27
  OnlineTester.OTAufgabeWidget: 0 -> 351     = 13 * 27
  OnlineTester.OTFrageWidget: 0 -> 2565      = 95 * 27 (I guess)


Swazoo and Socket objects don't seem to be out of range, either,
considering the Process count.
  Process: 5 -> 258
  Swazoo.ContentTypeField: 0 -> 246
  Swazoo.GenericHeaderField: 0 -> 2704
  Swazoo.HTTPAcceptField: 0 -> 246
  Swazoo.HTTPConnection: 0 -> 246
  Swazoo.HTTPConnectionField: 0 -> 492
  Swazoo.HTTPContentLengthField: 0 -> 246
  Swazoo.HTTPDateField: 0 -> 246
  Swazoo.HTTPGet: 0 -> 246
  Swazoo.HTTPHeaders: 0 -> 492
  Swazoo.HTTPHostField: 0 -> 246
  Swazoo.HTTPRefererField: 0 -> 201
  Swazoo.HTTPRequestLine: 0 -> 246
  Swazoo.HTTPResponse: 0 -> 246
  Swazoo.HTTPServerField: 0 -> 246
  Swazoo.HTTPSetCookieField: 0 -> 206
  Swazoo.HTTPUserAgentField: 0 -> 246
  Swazoo.SwazooBuffer: 0 -> 246
  Swazoo.SwazooSocket: 0 -> 248
  Swazoo.SwazooStream: 0 -> 246
  Swazoo.SwazooTask: 0 -> 246
  Swazoo.SwazooURI: 0 -> 447
  Swazoo.URIResolution: 0 -> 246
  Sockets.IPAddress: 0 -> 500
  Sockets.ReadBuffer: 0 -> 247
  Sockets.ServerSocket: 0 -> 3
  Sockets.StreamSocket: 0 -> 246
  Sockets.TCPSocketImpl: 0 -> 251
  SpIPAddress: 0 -> 2
  SpSocket: 0 -> 248
  SpTimestamp: 0 -> 492


Iliad object counts seem to be a bit high, except for Session,
and they stay that way, even after a garbage collection.
  Iliad.Session: 0 -> 28
  Iliad.Action: 0 -> 16618
  Iliad.BreakElement: 0 -> 16613
  Iliad.Decorator: 0 -> 2999
  Iliad.DivElement: 0 -> 15498
  Iliad.FormElement: 0 -> 5166
  Iliad.Id: 0 -> 3054
  Iliad.InputElement: 0 -> 5166
  Iliad.LabelElement: 0 -> 16613
  Iliad.RadioButtonElement: 0 -> 16613
  Iliad.RawHtmlElement: 0 -> 26945
  Iliad.SpanElement: 0 -> 5166

I guess the following counts are closely related to
the Iliad objects:
  BlockClosure: 1254 -> 19541
  BlockContext: 304 -> 23255
  String: 14635 -> 95456



The FileStream count shows that my apache proxy did not work
as expected. Most of those are pointing to static files I had
tried to keep out of swazoo's hair.
  FileStream: 3 -> 33


Some things have gone wrong during usage:
  Kernel.CoreException: 7 -> 19

Let's look at them:
  Kernel.CoreException allInstancesDo: [ :e |
    e signalClass printNl
  ]
    Error
    Exception
    Halt
    Iliad.CurrentContext
    Iliad.DispatchError
    Iliad.ResponseNotification
    Kernel.CoreException
    MessageNotUnderstood
    Notification
    Signal
    SpAbstractError
    SpError
    Swazoo.HTTPException
    Swazoo.SwazooStreamNoDataError
    SystemExceptions.FileError
    SystemExceptions.PrimitiveFailed
    SystemExceptions.ProcessBeingTerminated
    SystemExceptions.UnhandledException
    SystemExceptions.UserInterrupt
    SystemExceptions.VMError



The most shocking increases are found in the collection
hierarchy:

  IdentityDictionary: 17 -> 48
    let's count the sizes
      Bag(0:1 1:30 251:1 3:2 2:1 6:1 14:1 13:1 17:1 23:1 22:1 )
        What do I need 30 IdentityDictionaries with 1 item for?
          OnlineTester.OTApplication->an OTApplication
        ah ... must come from iliad sessions

  WeakArray: 3 -> 993
    hmm... they're gone now ... weaklings, afraid of being
    examined.

  HomedAssociation: 3 -> 548
    note to self: look up what that is.
    map temporary stuff to nil


  Association: 524 -> 178658
  Dictionary: 10 -> 108373
    the former are stored in the latter and are almost
    completely iliad stuff

  OrderedCollection: 60 -> 109188
    let's count sizes again
      Bag(0:70960 32:3 1:27178 97:1 3:5211 2:160 64:1
        7:85 4:57 6:57 5:3 288:1 135:1 14:1117 8:199
        13:30 9:28 11:4052 10:31 43:2 12:3 576:1 28:1
        17:1 92:1 18:1 3246:1 20:1 2304:1 1152:1 )
      ooh ... 70k empty OC's
      the 27k single element OC's have either a RawHtmlElement
      or a cookie
      the 5k 3-element OC's are either the mime-type collection
        ('application/json' ' text/javascript' ' */*' )
      or
        (a DivElement a FormElement a DivElement )
      NB the singel OC with 3246 elements contains either
      classes or (quite often) nil ... what's that?



_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - a more detailed analysis of image contents

Paolo Bonzini-2

> ot.im is the preloaded image, my OnlineTester package, depending
> on Iliad. The other images were saved via
>   gst-remote --eval="ObjectMemory snapshot"
> after finishing the test session.

Can you place them somewhere I can download them?  They are interesting.

>   [ ObjectDumper dump: usage to: s ] ensure: [ s close ]

Nice :-)

> Swazoo and Socket objects don't seem to be out of range, either,
> considering the Process count.
>   Process: 5 -> 258

A bit too many?  Maybe something is holding onto them.

> Iliad object counts seem to be a bit high, except for Session,
> and they stay that way, even after a garbage collection.

Then we have to see what's the root for them.  Maybe some Process?

>     Iliad.CurrentContext
>     Iliad.ResponseNotification
>     Notification
>     SystemExceptions.ProcessBeingTerminated

These are notifications and are not a problem.

>   HomedAssociation: 3 -> 548
>     note to self: look up what that is.
>     map temporary stuff to nil

Related to weak collections and finalizations.  Probably sockets.

>   Association: 524 -> 178658
>   Dictionary: 10 -> 108373
>     the former are stored in the latter and are almost
>     completely iliad stuff

100000 dictionaries are a bit too many.  And anyway changing them to
LookupTables would save 178658 * 12 bytes for the associations.

>   OrderedCollection: 60 -> 109188
>     let's count sizes again
>       Bag(0:70960 32:3 1:27178 97:1 3:5211 2:160 64:1
>         7:85 4:57 6:57 5:3 288:1 135:1 14:1117 8:199
>         13:30 9:28 11:4052 10:31 43:2 12:3 576:1 28:1
>         17:1 92:1 18:1 3246:1 20:1 2304:1 1152:1 )
>       ooh ... 70k empty OC's

70000 * approx 40 bytes here to save too if you can use lazy
initialization...

>       the 5k 3-element OC's are either the mime-type collection
>         ('application/json' ' text/javascript' ' */*' )

Probably some result of parsing Accept fields?  For many
OrderedCollections, adding asArray strategically would save memory at
the cost of creating more garbage.

Summary: there's definitely something to optimize in code that creates
too many collections.

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - a more detailed analysis of image contents

Paolo Bonzini-2

> Yes, definitely. I'm a bit afraid by all these actions, because I know
> that many of them must be obsolete. I think it's time to introduce a new
> class, something like ActionRegistry, to manage actions a bit more
> finely, per widget probably.

That would be nice.  Still, congratulations (to both Nico and Stefan,
for different reasons) for the usage of Iliad in production.  Iliad
showed it was way more than a proof of concept, now it's time to
optimize it!

(Still, I'm curious about all those processes.  I'll look at them soon).

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - a more detailed analysis of image contents

Paolo Bonzini-3
In reply to this post by Stefan Schmiedl
A couple more data points to give Nicolas some work.

Let's get a random OC:

st> a := (OrderedCollection allInstances at: 14000)
OrderedCollection ()

and let's see where it comes from.  Some objects fail to print because
they point to files on Stefan's machine, so..

st> (b := a allOwners) collect: [ :e | e class ]
(Iliad.RawHtmlElement MethodContext CallinProcess Array Array Array
Array Array Array MethodContext MethodContext Association MethodContext
CallinProcess BlockClosure MethodContext MethodContext MethodContext )

st> b first inspect
An instance of Iliad.RawHtmlElement
   children: OrderedCollection ()
   attributes: Dictionary (
)
   contents: 'falsch'
a RawHtmlElement

Doh.  Here go ~400 bytes:

st> b first attributes basicSize
32
st> b first children basicSize
16

13 megabytes:

st> Iliad.RawHtmlElement allInstances count: [ :e | e children size = 0
& e attributes size = 0 ]
26945

plus 6 more for Iliad.BreakElement:

st> Iliad.Element allSubclasses asArray collect: [ :e | e->(e
allInstances count: [ :e | e children size = 0 & e attributes size = 0 ]) ]
.... Iliad.BreakElement->16613 ....

Regarding children, there are 5166 InputElements and 16613
RadioButtonElement that would save 150 bytes each.

and so on.

I'll look at Accept later.

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - a more detailed analysis of image contents

Stefan Schmiedl
On Mon, 20 Jul 2009 21:34:32 +0200
Paolo Bonzini <[hidden email]> wrote:

> st> (b := a allOwners) collect: [ :e | e class ]

allOwners ... good thing I didn't start writing a reference finder :-)

> st> b first attributes basicSize
> 32

is basicSize the RAM allocated for "the object itself", i.e. mostly the
pointers referencing other objects?

those are two very valuable tools ... thanks.
 

> st> b first children basicSize
> 16
>
> 13 megabytes:
>
> st> Iliad.RawHtmlElement allInstances count: [ :e | e children size = 0
> & e attributes size = 0 ]
> 26945
>
> plus 6 more for Iliad.BreakElement:

heh ... 6MB for <br/>? that's about all the breaks I need in a lifetime :-)

> I'll look at Accept later.

This is great. If Nico's reorganization action works out and if we can
further reduce the amount of RAM per element, Iliad will run really well
even with scarce resources.

I'll be tied up with other work until Wednesday, but hope to look at this
again then.

Thanks for all your effort,
s.


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - a more detailed analysis of image contents

Paolo Bonzini-3
In reply to this post by Stefan Schmiedl
Oops, hit send too early.

Let's see where they come from.

st> (c := b first allOwners) collect: [ :e | e class ]
(OrderedCollection Array Array Array Array Array Array Array Array Array
Array Array Array Array Array Array Array Array Array Array Array Array
Array Array Array Array Array Array Array Array Array Array Array Array
Array Array Array Array Array )
st> (d := c first allOwners) collect: [ :e | e class ]
(Iliad.LabelElement Array Array Array Array Array Array Array )
st> (e := d first allOwners first allOwners) collect: [ :e | e class ]
(Iliad.FormElement Array Array Array Array Array Array Array )
st> (f := e first allOwners first allOwners) collect: [ :e | e class ]
(Iliad.DivElement Array Array Array Array Array Array Array )
st> (g := f first allOwners first allOwners) collect: [ :e | e class ]
(BlockContext BlockContext BlockContext BlockContext Array Array )

st> g first method
[] in OnlineTester.OTFrageWidget>>contents
st> g second method
[] in OnlineTester.OTFrageWidget>>contents
st> g third method
[] in OnlineTester.OTFrageWidget>>contents
st> (h := g first allOwners) collect: [ :e | e class ]
(BlockClosure Array )
st> (i := h first allOwners) collect: [ :e | e class ]
(Iliad.Action Array Array )
st> (j := i first  allOwners) collect: [ :e | e class ]
(MethodContext CallinProcess MethodContext MethodContext MethodContext
Association BlockClosure BlockClosure BlockClosure BlockClosure
BlockClosure BlockClosure MethodContext CallinProcess MethodContext Array )

Uh-oh.  Quite a few owners, in particular:

st> (j at: 6)
'100058'->an Action
st> (k := (j at: 6) allOwners) collect: [ :e | e class ]
...
st> (k at: 8) class
Dictionary
st> (k at: 8) size
681

681 actions means that using a Dictionary for them was not bad (just a
LookupTable would save memory).  Sorry Nicolas for your ActionRegistry. :-)

At least, the number of widgets is what you'd expect:

st> actions := (k at: 8) asArray
st> widgets := (IdentitySet withAll: (actions collect: [:e |e block
receiver]) asArray
st> widgets size
95
st> widgets first frage aufgabe test aufgaben inject: 0 into: [ :o :a |
o + a fragen size ]
95

... and tells us that the dictionary covered the Actions in a single
session as expected (there are 95*27 OTFrageWidgets, 95 per pupil).  The
number of actions overall is also correct, so there is no leak here: it
should be roughly 2*number of answers per quiz (the AJAX request
rebuilds the tree and creates new actions), and we get

st> widgets first frage aufgabe test aufgaben inject: 0 into: [ :o :a |
o + (a fragen size * a antworten size * 2) ]
610

Probably sometimes (s)he tried different answers: :-)

st> (widgets collect: [:w| actions count: [:e | e block receiver == w]])
asBag
Bag(6:61 4:1 12:7 8:14 16:1 9:11 )

Again, this is for one pupil only.  Indeed we have overall over 16000
actions (as pointed out by Stefan in his first analysis), and these had
better stay in the Dictionary.

Instead for the

st> ((Iliad.Element allSubclasses collect: [:e|e allInstances size])
st>    fold: [:a :b|a+b])
64222

elements given the back-of-the-envelope analysis of OrderedCollection
and Dictionary usage, we are spending at least ~32 megabytes out of 90
(probably more like 40-50).

        >> IDEAS: I would probably replace the children
        OrderedCollection with a tail+next circular linked list
        representation.  At the beginning tail := nil, if it is not nil
        the head is accessible with tail next.  Inserting at the tail
        is easy.  All can be encapsulated into #childrenDo: of course,
        and it saves a slot compared to head+tail+next.  Consider that
        the OrderedCollection also costs two slots (one slot for the OC
        instance variable and one slot in the OC itself) so you'll save
        memory for the empty OC slots (probably like 100 bytes per
        element, totalling ~6 MB).

        For the attributes dictionary, reusing the RBSmallDictionary
        code should work well, or maybe lazy initalization would be
        okay coupled with moving a few attributes out of the dictionary:

        st> ((Dictionary allInstances select: [ :e | e size = 1 ])
                collect: [ :e | e keys ]) asBag
        Bag(Set ('class' ):15498 Set (#created ):1 Set ('for' ):16613
        Set (#selectorsMap ):1 )

        st> ((Dictionary allInstances select: [ :e | e size = 2 ])
        collect: [ :e | e keys ]) asBag
        Bag(Set (#expirySeconds #charset ):1 Set (#modified #created
         ):27 Set ('class' 'id' ):5166 )

        st> ((Dictionary allInstances select: [ :e | e size = 3 ])
        collect: [ :e | e keys ]) asBag
        Bag(Set ('value' 'name' 'type' ):5166 )

        st> ((Dictionary allInstances select: [ :e | e size = 4 ])
        collect: [ :e | e keys ]) asBag
        Bag(Set ('id' #value 'name' 'type' ):11447 )

        st> ((Dictionary allInstances select: [ :e | e size = 5 ])
        collect: [ :e | e keys ]) asBag
        Bag(... Set ('accept-charset' 'class' 'method' 'action'
        'onsubmit' ):5166 )

        Again, the break-even point is very low.   Even storing
        5 attributes in all HTML elements including <br> would be
        okay, since a Dictionary costs 4 words for flags+class+
        size+object pointer, and 1 word for the tally.  But of course
        value/name/type would be stored only in an InputElement etc.,
        only id+class+lazily initialized children would be in the
        superclass.  At this point, even using a special SmallDictionary
        class might be premature optimization...

        Since forms will be common in AJAX applications, saving
        memory there is nice.  Maybe accept-charset could be left into
        attributes, but without the #ifAbsentPut: here:

     beforePrintHtml [
        <category: 'printing'>
        self attributeAt: 'action' ifAbsentPut: [self url asString].
        self attributeAt: 'method' ifAbsent: [self usePost].
        self attributeAt: 'accept-charset'
                ifAbsentPut: [self session charset].
        ...
     ]

        since 99.99% it will be sane to use the default taken from the
        session.  Probably there are more such cases where #ifAbsentPut:
        is used.

Strings are many (mostly from Action ids and from Swazoo's HTTP header
parser), but they consume little memory:

st> String allInstances inject: 0 into: [ :o :i | o + (i size + 7
bitAnd: -8) + 32 ]
347616

A final note regarding Swazoo,

st> a := OrderedCollection allInstances select: [:e | e size = 3 and: [
e first isString ]]

Shows some OrderedCollection ('application/json' ' text/javascript' '
*/*') indeed, but only 43.

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - a more detailed analysis of image contents

Paolo Bonzini-3
In reply to this post by Stefan Schmiedl
On 07/20/2009 09:59 PM, Stefan Schmiedl wrote:

> On Mon, 20 Jul 2009 21:34:32 +0200
> Paolo Bonzini<[hidden email]>  wrote:
>
>> st>  (b := a allOwners) collect: [ :e | e class ]
>
> allOwners ... good thing I didn't start writing a reference finder :-)
>
>> st>  b first attributes basicSize
>> 32
>
> is basicSize the RAM allocated for "the object itself", i.e. mostly the
> pointers referencing other objects?

No, it's the number of indexed instance variables ("slots" in the
OrderedCollection or Dictionary, elements in a non-growable collection
like Array or String).

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - a more detailed analysis of image contents

Paolo Bonzini-3
In reply to this post by Stefan Schmiedl
On 07/20/2009 11:04 PM, nicolas petton wrote:

> Hi,
>
> I just committed ActionRegistry.
>
> It's still young code, but so far it works nicely.
> An ActionRegistry is now responsible for holding all actions related to
> its owner. This way widgets release all their obsolete actions each time
> they are rebuilt. For each normal request, all registries are also
> cleaned up.
>
> This way there shouldn't be obsolete actions which blocks keep
> references to Iliad elements. I ran some tests with the blog and counter
> examples, and the difference is impressive.

I would still use a Dictionary there (the online tester has ~300 actions
active at the same time, IIUC, out of ~700 created through the entire
navigation).

Nice though!

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - a more detailed analysis of image contents

Paolo Bonzini-3

> Ok, but would you use a dictionary in the ActionRegistry, or would you
> drop ActionRegistry too?
> Right now actions are stored in OrderedCollection in registries.

The former (a LookupTable actually).

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - a more detailed analysis of image contents

Paolo Bonzini-2
In reply to this post by Paolo Bonzini-3

>            >> IDEAS: I would probably replace the children
>            OrderedCollection with a tail+next circular linked list
>            representation.  At the beginning tail := nil, if it is not nil
>            the head is accessible with tail next.  Inserting at the tail
>            is easy.  All can be encapsulated into #childrenDo: of course,
>            and it saves a slot compared to head+tail+next.  Consider that
>            the OrderedCollection also costs two slots (one slot for the OC
>            instance variable and one slot in the OC itself) so you'll save
>            memory for the empty OC slots (probably like 100 bytes per
>            element, totalling ~6 MB).
>
>            For the attributes dictionary, reusing the RBSmallDictionary
>
> What is the difference between LookupTable and RBSmallDIctionary? What
> would be the more appropriate, and in which cases?

RBSmallDictionary gives up hashing, and just does a linear search over
an OrderedCollection.

LookupTable is the same as a Dictionary, but it doesn't use Associations
and this makes it use less memory.

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - a more detailed analysis of image contents

Paolo Bonzini-2
nicolas petton wrote:
> Using a LookupTable in ActionRegistry seems to slow down Iliad (a few ms
> per request). More importantly, gst does a *lot* of garbage collections
> (every 5 requests or so). This is very annoying, because it really slows
> down the response time.
>
> Any hint?

Can you send me the patch to ActionRegistry and the LookupTable version
of the file, so I can try to track where the allocations come from?

Also, does the same happen with Dictionary?

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - a more detailed analysis of image contents

Paolo Bonzini-2
Paolo Bonzini wrote:

> nicolas petton wrote:
>> Using a LookupTable in ActionRegistry seems to slow down Iliad (a few ms
>> per request). More importantly, gst does a *lot* of garbage collections
>> (every 5 requests or so). This is very annoying, because it really slows
>> down the response time.
>>
>> Any hint?
>
> Can you send me the patch to ActionRegistry and the LookupTable version
> of the file, so I can try to track where the allocations come from?

Nevermind, it's in the svn repository.  This however still holds:

> Also, does the same happen with Dictionary?

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - a more detailed analysis of image contents

Paolo Bonzini-2
In reply to this post by Paolo Bonzini-2
nicolas petton wrote:
> Using a LookupTable in ActionRegistry seems to slow down Iliad (a few ms
> per request). More importantly, gst does a *lot* of garbage collections
> (every 5 requests or so). This is very annoying, because it really slows
> down the response time.

I can only guess that this:

2003-06-25  Paolo Bonzini  <[hidden email]>

        * kernel/LookupTable.st: Remove values instance variable,
        store key/value pairs in adjacent slots.

was a bad idea.

But, the garbage collection still puzzles me.

Can you try adding a _gst_show_backtrace(stdout) in _gst_global_gc and
see what is printed?

Alternatively, I see you also have OrderedCollection in Session for the
per-widget Action registries.  Can you test the performance of
Dictionary vs. LookupTable there, too?

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - an experience report

Stefan Schmiedl
In reply to this post by Stefan Schmiedl
Hi Paolo, Nico.

Another piece of the puzzle?

On Thu, 16 Jul 2009 15:33:16 +0200
Stefan Schmiedl <[hidden email]> wrote:

>
> The students of class 6a logged into their Windows domain accounts,
> started Firefox and entered the URL for the test (stage 1 above).
> Then they entered their names into the registration page (stage 2)
> and clicked on the button to access the test. Shortly after server
> CPU load went to 100% with the following error message being repeated
> as fast as the remote terminal could cope with:
>
>   "Socket accept error: Error while trying to accept a socket connection"
>
> Client side a one-liner 500 error message was reported.
>
> Time for pkill gst-remote ... I rebuilt the image and started the
> server again. This time we staged the 25 "almost simultaneous" login
> attempts into four batches of 6 each and things worked fine from that point on.
>
> After finishing the test, the students logged off and the next class, 6b ...
> had the exact same experience ... and 6c and 6d, too.
>
> For the final group I tried a different approach:
> They logged on, opened the URL, and sat on their hands.
> I killed gst-remote, rebuilt the image, restarted gst-remote and told them
> to reload the page. They then entered their names and started clicking on
> the answers and the Socket error of Doom appeared again. Kill, rebuild,
> restart. Everybody loads the registration page (not staged, just 25 students
> clicking when they're ready), enters their name and works on the test as it
> should be. No hiccup.

While I have not yet managed to reproduce the error message through
a ruby mechanize script, I have noticed something suspicious:

Start the server, check sockets on the server
  server # netstat -n | grep 4080
  server #

Run a mechanize script performing a few requests on the client.
The script fetches the first page and the referenced css and js files.
  client $ ruby mech.rb 1
  client $

Look at sockets on client
  client $ netstat -n | grep 4080
  tcp        0      0 192.168.1.5:37021       88.198.5.34:4080        FIN_WAIT2  

Look at sockets on server
  server # netstat -n | grep 4080
  tcp        0      0 88.198.5.34:4080        93.223.36.238:37021     CLOSE_WAIT

Wait about 10 min .... (typing this text)

Look at sockets on client
  client $ netstat -n | grep 4080
  client $

Look at sockets on server
  server # netstat -n | grep 4080
  tcp        0      0 88.198.5.34:4080        93.223.36.238:37021     CLOSE_WAIT

Run mechanize script again:
  client $ ruby mech.rb 1
  client $

Sockets on client:
  client $ netstat -n | grep 4080
  tcp        0      0 192.168.1.5:57747       88.198.5.34:4080        FIN_WAIT2  

Sockets on server:
server # netstat -n | grep 4080
tcp        0      0 88.198.5.34:4080        93.223.36.238:37021     CLOSE_WAIT
tcp        0      0 88.198.5.34:4080        93.223.36.238:57747     CLOSE_WAIT

soooo.... the problem described above has nothing to do with timing issues,
but instead resource exhaustion due to _too many_ open sockets in CLOSE_WAIT
state.

Note also that the problem is heavily exacerbated when the app is accessed
through an apache proxy as was done in the test session. In this scenario,
running the same requests as above
  client $ ruby mech.rb 1

results in the following server-side mess:
  server # netstat -n | grep 4080
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57163         CLOSE_WAIT
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57157         CLOSE_WAIT
  tcp        0      0 127.0.0.1:57155         127.0.0.1:4080          FIN_WAIT2  
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57156         CLOSE_WAIT
  tcp        0      0 127.0.0.1:57163         127.0.0.1:4080          FIN_WAIT2  
  tcp        0      0 127.0.0.1:57157         127.0.0.1:4080          FIN_WAIT2  
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57161         CLOSE_WAIT
  tcp        0      0 127.0.0.1:57153         127.0.0.1:4080          FIN_WAIT2  
  tcp        0      0 127.0.0.1:57161         127.0.0.1:4080          FIN_WAIT2  
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57155         CLOSE_WAIT
  tcp        0      0 127.0.0.1:57162         127.0.0.1:4080          TIME_WAIT  
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57159         CLOSE_WAIT
  tcp        0      0 127.0.0.1:57154         127.0.0.1:4080          FIN_WAIT2  
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57158         CLOSE_WAIT
  tcp        0      0 127.0.0.1:57156         127.0.0.1:4080          FIN_WAIT2  
  tcp        0      0 127.0.0.1:57160         127.0.0.1:4080          FIN_WAIT2  
  tcp        0      0 127.0.0.1:57158         127.0.0.1:4080          FIN_WAIT2  
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57160         CLOSE_WAIT
  tcp        0      0 127.0.0.1:57159         127.0.0.1:4080          FIN_WAIT2  
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57153         CLOSE_WAIT
  tcp        0      0 127.0.0.1:4080          127.0.0.1:57154         CLOSE_WAIT

The FIN_WAIT2 sockets will disappear after a while, the *10* CLOSE_WAIT sockets won't.
And since they are already closed, they won't be reused either, AFAICT.

Now look at what google found for me:

http://www.sunmanagers.org/pipermail/summaries/2006-January/007068.html

I think, one of swazoo/sport/socket needs a behavioral readjustment.

s.


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - an experience report

Paolo Bonzini-3

> While I have not yet managed to reproduce the error message through
> a ruby mechanize script, I have noticed something suspicious:

Very suspicious.

Go slowly = time to do GCs = more finalization = closed sockets.

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - an experience report

Paolo Bonzini-3
In reply to this post by Stefan Schmiedl
I can reproduce this with "strace -e accept,socket,close" on your image.
  Lots of accepts, very few closes...

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: the test test - an experience report

Stefan Schmiedl
In reply to this post by Stefan Schmiedl
On Thu, 23 Jul 2009 00:48:10 +0200
Paolo Bonzini <[hidden email]> wrote:

> Please send the script so I can try to reproduce this on your image. :-)

attached
s.

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

mech.rb (1K) Download Attachment
12