I was looking at some old slides of Joe Armstrong on Concurrency-orinted programming. He set the following challenge:
Put N processes in a ring: Send a simple message round the ring M times. Increase N until the system crashes. How long did it take to start the ring? How long did it take to send a message? When did it crash? He gave graphs comparing Erlang, Java and C#. I decided to compare Pharo. Here is what I got; the creation times are PER PROCESS and the messaging times are PER MESSAGE. first run procs creation/µs msg/µs 200 0 7.0 500 0 9.7 1000 2 15.4 2000 1 21.6 5000 13 31.5 10000 19.9 40.7 20000 46.5 55.4 50000 130.9 98.0 second run procs creation/µs msg/µs 200 0.0 7.0 500 0.0 10.12 1000 0.0 16.53 2000 1.5 24.26 5000 12.8 32.15 10000 28.1 39.497 20000 58.15 52.0295 50000 75.1 95.581 third run procs creation/µs msg/µs 200 0.0 7.0 500 0.0 8.6 1000 2.0 11.0 2000 1.0 16.55 5000 10.2 21.76 10000 12.0 49.57 20000 52.35 65.035 50000 91.76 117.1 Each process is a Pharo object (an instance of ERringElement) that contains a counter, a reference to the next ERringElement, and an "ErlangProcess" that is a Process that contains a reference to an instance of SharedQueue (its "mailbox"). The good news is that up to 50k processes, it didn't crash. But it did run with increasing sloth. I can imagine that the increasing process-creation time is due to beating on the memory manager. But why the increasing message-sending time as the number of processes increases? (Recall that exactly one process is runnable at any given time). I'm wondering if the scheduler is somehow getting overwhelmed by all of the non-runable processes that are blocked on Semaphores in SharedQueue. Any ideas? (My code is on Squeaksource in project Erlang. But be warned that there is a simulation of the Erlang "universal server" in there too. To run this code, look for class ErlangRingTest.) |
Hi Andrew,
On Wed, Aug 17, 2011 at 1:50 PM, Andrew P. Black <[hidden email]> wrote: I was looking at some old slides of Joe Armstrong on Concurrency-orinted programming. He set the following challenge: If you're using Cog then one reason performance falls off with number of processes is context-to-stack mapping, see 08 Under Cover Contexts and the Big Frame-Up. Once there are more processes than stack pages every process switch faults out a(t least one) frame to a heap context and faults in a heap context to a frame. You can experiment by changing the number of stack pages (see vmAttributeAt:put:) but you can't have thousands of stack pages; it uses too much C stack memory. I think the default is 64 pages and Teleplace uses ~ 112. Each stack page can hold up to approximately 50 activations.
But to be sure what the cause of the slowdown is one could use my VMProfiler. Has anyone ported this to Pharo yet?
best, Eliot |
>
> Each process is a Pharo object (an instance of ERringElement) that contains a counter, a reference to the next ERringElement, and an "ErlangProcess" that is a Process that contains a reference to an instance of SharedQueue (its "mailbox"). > > The good news is that up to 50k processes, it didn't crash. But it did run with increasing sloth. > > I can imagine that the increasing process-creation time is due to beating on the memory manager. But why the increasing message-sending time as the number of processes increases? (Recall that exactly one process is runnable at any given time). I'm wondering if the scheduler is somehow getting overwhelmed by all of the non-runable processes that are blocked on Semaphores in SharedQueue. Any ideas? > > If you're using Cog then one reason performance falls off with number of processes is context-to-stack mapping, see 08 Under Cover Contexts and the Big Frame-Up. Once there are more processes than stack pages every process switch faults out a(t least one) frame to a heap context and faults in a heap context to a frame. You can experiment by changing the number of stack pages (see vmAttributeAt:put:) but you can't have thousands of stack pages; it uses too much C stack memory. I think the default is 64 pages and Teleplace uses ~ 112. Each stack page can hold up to approximately 50 activations. > > But to be sure what the cause of the slowdown is one could use my VMProfiler. Has anyone ported this to Pharo yet? not that I know. Where is the code :) > > > (My code is on Squeaksource in project Erlang. But be warned that there is a simulation of the Erlang "universal server" in there too. To run this code, look for class ErlangRingTest.) > > > > -- > best, > Eliot > |
In reply to this post by Eliot Miranda-2
On 18 August 2011 00:02, Eliot Miranda <[hidden email]> wrote:
> Hi Andrew, > > On Wed, Aug 17, 2011 at 1:50 PM, Andrew P. Black <[hidden email]> wrote: >> >> I was looking at some old slides of Joe Armstrong on Concurrency-orinted >> programming. He set the following challenge: >> >> Put N processes in a ring: >> Send a simple message round the ring M times. >> Increase N until the system crashes. >> How long did it take to start the ring? >> How long did it take to send a message? >> When did it crash? >> >> He gave graphs comparing Erlang, Java and C#. I decided to compare Pharo. >> Here is what I got; the creation times are PER PROCESS and the messaging >> times are PER MESSAGE. >> >> first run >> >> procs creation/µs msg/µs >> >> 200 0 7.0 >> 500 0 9.7 >> 1000 2 15.4 >> 2000 1 21.6 >> 5000 13 31.5 >> 10000 19.9 40.7 >> 20000 46.5 55.4 >> 50000 130.9 98.0 >> >> second run >> >> procs creation/µs msg/µs >> >> 200 0.0 7.0 >> 500 0.0 10.12 >> 1000 0.0 16.53 >> 2000 1.5 24.26 >> 5000 12.8 32.15 >> 10000 28.1 39.497 >> 20000 58.15 52.0295 >> 50000 75.1 95.581 >> >> third run >> >> procs creation/µs msg/µs >> >> 200 0.0 7.0 >> 500 0.0 8.6 >> 1000 2.0 11.0 >> 2000 1.0 16.55 >> 5000 10.2 21.76 >> 10000 12.0 49.57 >> 20000 52.35 65.035 >> 50000 91.76 117.1 >> >> Each process is a Pharo object (an instance of ERringElement) that >> contains a counter, a reference to the next ERringElement, and an >> "ErlangProcess" that is a Process that contains a reference to an instance >> of SharedQueue (its "mailbox"). >> >> The good news is that up to 50k processes, it didn't crash. But it did >> run with increasing sloth. >> >> I can imagine that the increasing process-creation time is due to beating >> on the memory manager. But why the increasing message-sending time as the >> number of processes increases? (Recall that exactly one process is runnable >> at any given time). I'm wondering if the scheduler is somehow getting >> overwhelmed by all of the non-runable processes that are blocked on >> Semaphores in SharedQueue. Any ideas? > > If you're using Cog then one reason performance falls off with number of > processes is context-to-stack mapping, see 08 Under Cover Contexts and the > Big Frame-Up. Once there are more processes than stack pages every process > switch faults out a(t least one) frame to a heap context and faults in a > heap context to a frame. You can experiment by changing the number of stack > pages (see vmAttributeAt:put:) but you can't have thousands of stack pages; > it uses too much C stack memory. I think the default is 64 pages and > Teleplace uses ~ 112. Each stack page can hold up to approximately 50 > activations. > But to be sure what the cause of the slowdown is one could use my > VMProfiler. Has anyone ported this to Pharo yet? Hmm, as to me it doesn't explains why messages/second degrading linearly to number of processes. Because after hitting certain limit (all stack pages are full), the messages/time proportion should not degrade anymore. Given your explanation, i would expect something like following: 10000 - 49.57 ... somewhere here we hit stack pages limit ... 20000 - 65.035 30000 - 65.035 40000 - 65.035 50000 - 65.035 >> >> (My code is on Squeaksource in project Erlang. But be warned that there >> is a simulation of the Erlang "universal server" in there too. To run this >> code, look for class ErlangRingTest.) > > > > -- > best, > Eliot > -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Prof. Andrew P. Black
Hi Andrew:
On 17 Aug 2011, at 22:50, Andrew P. Black wrote: > I can imagine that the increasing process-creation time is due to beating on the memory manager. But why the increasing message-sending time as the number of processes increases? (Recall that exactly one process is runnable at any given time). I'm wondering if the scheduler is somehow getting overwhelmed by all of the non-runable processes that are blocked on Semaphores in SharedQueue. Any ideas? Vague from memory, and might confused things (we have tried too many different things here ;)): Shouldn't the standard interpreter VM remove the process from the scheduler list when it is waiting on a semaphore? I think, when there is only a single runable process for a given priority, then the list only contains that one. Best regards Stefan > > (My code is on Squeaksource in project Erlang. But be warned that there is a simulation of the Erlang "universal server" in there too. To run this code, look for class ErlangRingTest.) > > > > > -- Stefan Marr Software Languages Lab Vrije Universiteit Brussel Pleinlaan 2 / B-1050 Brussels / Belgium http://soft.vub.ac.be/~smarr Phone: +32 2 629 2974 Fax: +32 2 629 3525 |
In reply to this post by Igor Stasenko
On 18 August 2011 00:20, Igor Stasenko <[hidden email]> wrote:
> On 18 August 2011 00:02, Eliot Miranda <[hidden email]> wrote: >> Hi Andrew, >> >> On Wed, Aug 17, 2011 at 1:50 PM, Andrew P. Black <[hidden email]> wrote: >>> >>> I was looking at some old slides of Joe Armstrong on Concurrency-orinted >>> programming. He set the following challenge: >>> >>> Put N processes in a ring: >>> Send a simple message round the ring M times. >>> Increase N until the system crashes. >>> How long did it take to start the ring? >>> How long did it take to send a message? >>> When did it crash? >>> >>> He gave graphs comparing Erlang, Java and C#. I decided to compare Pharo. >>> Here is what I got; the creation times are PER PROCESS and the messaging >>> times are PER MESSAGE. >>> >>> first run >>> >>> procs creation/µs msg/µs >>> >>> 200 0 7.0 >>> 500 0 9.7 >>> 1000 2 15.4 >>> 2000 1 21.6 >>> 5000 13 31.5 >>> 10000 19.9 40.7 >>> 20000 46.5 55.4 >>> 50000 130.9 98.0 >>> >>> second run >>> >>> procs creation/µs msg/µs >>> >>> 200 0.0 7.0 >>> 500 0.0 10.12 >>> 1000 0.0 16.53 >>> 2000 1.5 24.26 >>> 5000 12.8 32.15 >>> 10000 28.1 39.497 >>> 20000 58.15 52.0295 >>> 50000 75.1 95.581 >>> >>> third run >>> >>> procs creation/µs msg/µs >>> >>> 200 0.0 7.0 >>> 500 0.0 8.6 >>> 1000 2.0 11.0 >>> 2000 1.0 16.55 >>> 5000 10.2 21.76 >>> 10000 12.0 49.57 >>> 20000 52.35 65.035 >>> 50000 91.76 117.1 >>> >>> Each process is a Pharo object (an instance of ERringElement) that >>> contains a counter, a reference to the next ERringElement, and an >>> "ErlangProcess" that is a Process that contains a reference to an instance >>> of SharedQueue (its "mailbox"). >>> >>> The good news is that up to 50k processes, it didn't crash. But it did >>> run with increasing sloth. >>> >>> I can imagine that the increasing process-creation time is due to beating >>> on the memory manager. But why the increasing message-sending time as the >>> number of processes increases? (Recall that exactly one process is runnable >>> at any given time). I'm wondering if the scheduler is somehow getting >>> overwhelmed by all of the non-runable processes that are blocked on >>> Semaphores in SharedQueue. Any ideas? >> >> If you're using Cog then one reason performance falls off with number of >> processes is context-to-stack mapping, see 08 Under Cover Contexts and the >> Big Frame-Up. Once there are more processes than stack pages every process >> switch faults out a(t least one) frame to a heap context and faults in a >> heap context to a frame. You can experiment by changing the number of stack >> pages (see vmAttributeAt:put:) but you can't have thousands of stack pages; >> it uses too much C stack memory. I think the default is 64 pages and >> Teleplace uses ~ 112. Each stack page can hold up to approximately 50 >> activations. >> But to be sure what the cause of the slowdown is one could use my >> VMProfiler. Has anyone ported this to Pharo yet? > > > Hmm, as to me it doesn't explains why messages/second degrading > linearly to number of processes. > Because after hitting certain limit (all stack pages are full), the > messages/time proportion should not degrade anymore. > Given your explanation, i would expect something like following: > 10000 - 49.57 > ... somewhere here we hit stack pages limit ... > 20000 - 65.035 > 30000 - 65.035 > 40000 - 65.035 > 50000 - 65.035 > yes.. except one thing: since contexts are allocated on heap, it means more work for GC, then it explains that it degrading linearly. Andrew, can you play with GC parameters, like increase number of allocations between incremental GCs etc? >>> >>> (My code is on Squeaksource in project Erlang. But be warned that there >>> is a simulation of the Erlang "universal server" in there too. To run this >>> code, look for class ErlangRingTest.) >> >> >> >> -- >> best, >> Eliot >> > > > > -- > Best regards, > Igor Stasenko AKA sig. > -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Stefan Marr-4
On 18 August 2011 00:24, Stefan Marr <[hidden email]> wrote:
> Hi Andrew: > > On 17 Aug 2011, at 22:50, Andrew P. Black wrote: > >> I can imagine that the increasing process-creation time is due to beating on the memory manager. But why the increasing message-sending time as the number of processes increases? (Recall that exactly one process is runnable at any given time). I'm wondering if the scheduler is somehow getting overwhelmed by all of the non-runable processes that are blocked on Semaphores in SharedQueue. Any ideas? > Vague from memory, and might confused things (we have tried too many different things here ;)): > Shouldn't the standard interpreter VM remove the process from the scheduler list when it is waiting on a semaphore? > I think, when there is only a single runable process for a given priority, then the list only contains that one. > Yes, but as far as i remember removing from head/adding to list tail is O(1) operation. It should not depend linearly from the size of a list. > Best regards > Stefan > >> >> (My code is on Squeaksource in project Erlang. But be warned that there is a simulation of the Erlang "universal server" in there too. To run this code, look for class ErlangRingTest.) >> >> >> >> >> > > > > -- > Stefan Marr > Software Languages Lab > Vrije Universiteit Brussel > Pleinlaan 2 / B-1050 Brussels / Belgium > http://soft.vub.ac.be/~smarr > Phone: +32 2 629 2974 > Fax: +32 2 629 3525 > > > -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Stefan Marr-4
On Wed, Aug 17, 2011 at 2:24 PM, Stefan Marr <[hidden email]> wrote: Hi Andrew: Yes, you're right. And any flavour VM will do this. All the processes in the ring except the running one are waiting on the semaphore.
I think, when there is only a single runable process for a given priority, then the list only contains that one. best, Eliot |
In reply to this post by Igor Stasenko
On 17 Aug 2011, at 14:24 , Igor Stasenko wrote: > yes.. except one thing: since contexts are allocated on heap, it means > more work for GC, > then it explains that it degrading linearly. > > Andrew, can you play with GC parameters, like increase number of > allocations between incremental GCs etc? How do I do that? Is there a gc monitor, that will show me how much time the gc is taking, or how many times the incremental gc is running? Andrew |
On Tue, Aug 23, 2011 at 11:11 AM, Andrew P. Black <[hidden email]> wrote:
Check method category "vm parameters" in SmalltalkImage. Is there a gc monitor, that will show me how much time the gc is taking, or how many times the incremental gc is running? Yes, open "System" -> "Vm statistics" Andrew -- Mariano http://marianopeck.wordpress.com |
For me,
SmalltalkImage current vmStatisticsReportString ...works best. If you run this twice, the second time it prints the diff compared to the previous run. So if you print the following three lines SmalltalkImage current vmStatisticsReportString. self runSomeCode. SmalltalkImage current vmStatisticsReportString you'll get the number and time of full and incremental GC cycles while running your code (of course there may have been other processes active in between which may affect the results). It's also a good idea to force a full GC before. Adrian On Aug 23, 2011, at 12:57 , Mariano Martinez Peck wrote: > On Tue, Aug 23, 2011 at 11:11 AM, Andrew P. Black <[hidden email]> wrote: > >> >> On 17 Aug 2011, at 14:24 , Igor Stasenko wrote: >> >>> yes.. except one thing: since contexts are allocated on heap, it means >>> more work for GC, >>> then it explains that it degrading linearly. >>> >>> Andrew, can you play with GC parameters, like increase number of >>> allocations between incremental GCs etc? >> >> How do I do that? >> >> > > Check method category "vm parameters" in SmalltalkImage. > > >> Is there a gc monitor, that will show me how much time the gc is taking, or >> how many times the incremental gc is running? >> >> > Yes, open "System" -> "Vm statistics" > > >> Andrew >> >> >> > > > -- > Mariano > http://marianopeck.wordpress.com |
Free forum by Nabble | Edit this page |