Blair, Niall
Blair McGlashan wrote: If you use Rosetta (http://www.metaprog.com/Rosetta), you can get the code over in .pac format now...Niall Cheers -- -- Joseph Pelrine [ | ] MetaProg GmbH Email: [hidden email] Web: http://www.metaprog.com "If you don't live on the edge, you're taking up too much space" - Doug Robinson |
In reply to this post by Chris Uppal-3
> So I come back to my point. If your code is running about ~20 times
faster on > Java than Dolphin, then I think much of the difference is down to the primitive > types. No, I don't use them much. I think it is the message sends. In this small "benchmark" it is less apparent than in real world ST apps, where everything you do is a message send. > alloc = 1.5 > get (index) = 5.7 > get (iterator) = 2.2 >A significant difference, but not *vast*. > ... > P.S. for interest: I did compare against a "warmed up" hotspot server. > ... > For the other tests, FWIW, it was about double the Hotspot client speed. So you have a factor of five to ten here. This IS vast, at least to me. Of course this depends on the application, but for the actual project I participate we would have choosen C++ if the Java VM where only 2 times slower than it is. We have to compete with C software in the market.. Here is a short C++ test I did (VC6 on 2GHz P4 with 512MB). Alloc = 1015 Access (index st style) = 63 Access (index java style) = 47 double mul = 16 Compare to hotspot server/client time needed alloc = 1438/984 time needed get (index) = 47/63 double mul = 31/16 Apparently, at least in this micro benchmark, there is not much difference between C++ and Java anymore (unless one uses C in C++). Ciao ...Jochen |
In reply to this post by Eliot Miranda
> Alloc time: 13684 / 8.9 1537.53
> Get (index) time: 686 / 8.9 77.0787 > Iterate (do) time: 531 / 8.9 59.6629 > Double mul time: 2369 / 8.9 266.18 > GC time: 979 / 8.9 110.0 > Overall runtime: 18250 / 8.9 2050.56 > > but I doubt the memory times would scale anything like as well as the > Get & Index times... The iterate times are very nice. The double mul time is showing probably the boxed-float effect Chris has mentioned. Ciao ...Jochen |
In reply to this post by Jochen Riekhof
"Jochen Riekhof" <[hidden email]> wrote in message
news:3e5ba126$[hidden email]... > [re: Rectangle new] > Yep, this is all correct. For the VW to work, I also used the origin > selector instead of the top selector used in Dolphin, because the instance > vars were all nil. I was too tired to continue yesterday, though :-). OK, so in that case you really should have rerun equivalent tests on Dolphin to give a fair comparison. You should use #basicNew to create an uninitialized Rectangle on both, and the #origin selector on both. For the record, when I do this on D5 I get a repeatable timing on the alloc test of ~1240 for the first run, and ~670mS on the second and subsequent runs. On VW7 "out of the box" I get a wide range of times between 1354mS and 5690mS. However, I think this is probably not a fair result. Had you reconfigured the memory policy on VW at all when you ran the test? Using #origin, which is just an accessor unlike #top which does computation, speeds up the two iteration tests by 30% or more. > > > statement: " Performance is about factor twenty > > lower than Java HotSpot VM, ..." On this test at least, that would appear > be > > FUD, right? :-) > > [Frankly, though, I think you really need some more "macro" benchmarks, > i.e. > > closer to an actual application, to draw any real performance conclusions] > > The factor twenty is (for me) the real number, as it stems from some > algorithms I ported from ST to Java without further optimizations. The first > was a "windowizing" algorithm, that basically puts a large amount of small > rectangles into a number of equally sized much bigger rectangles - the > number of big rectangles should be minimal. > This involves a lot of allocations, a lot of reordering and collection > searches and iterations. >... To me that's a lot more interesting. Do you have figures for VW on that test which might give us some point of comparison? > ... > The second was on images, and invoked many byteAtOffset: calls to access > pixels of bitmaps. > I got comparable results - factor 20 roughly. Frankly, I'm not surprised about that. > > Shurely there is no larger area of interpretation than on benchmarking, and > my numbers where not as concise as they could have been. Fortunately Chris > made up for this :-). Also, both the Java and the ST code can definitely be > optimized (thereby making it much less maintainable and readable). I do not > intent to do that, as it is (in Java) fast enough. When having the Designer > hat on, I do not care about the implementation of a Rectangle class, I just > use it. If it is by design slower in ST, not my problem... That is a very fair point, and one I would usually agree with, but in this case you were attempting (I think) to make a micro comparison, so it is pertinent. >... This is the price > you pay for "everything is an object". I pay the same price the opposite way > in Java, e.g. when creating tons of syntactical crap in form of wrapper > classes around integers to use them as Dictionary keys. Where performance is > important, the current choice IMO must be a dynamic compilation VM. Noone I > know uses interpreted Java at all. There might be use for it e.g. when > writing scripts that run only very short time. > > I will inform you of further "closer to an actual application" relations > when I have to prototype something again. Great. In the end those comparisons are much more interesting, even if more difficult to achieve. Regards Blair |
> > The second was on images, and invoked many byteAtOffset: calls to access
> > pixels of bitmaps. > > I got comparable results - factor 20 roughly. > > Frankly, I'm not surprised about that. I would be very interested to know if there is a more efficient way to access bmp data from Dolphin. I would like to keep all code in ST, but I have to modify the bmp pixels as this is what has to be rendered. Everything works great so far, except that the pixel access could be a bit faster. Also, it reminds me of the awkward code I had to write to check for top-down bitmaps. Topdown bmps have a negative height in BITMAPINFOHEADER, but on reading this structure the height is always positive regardless of bottom-up or top-down.. This can well be a windows bug, though. > That is a very fair point, and one I would usually agree with, but in this > case you were attempting (I think) to make a micro comparison, so it is > pertinent. No, my intent was to find out more about where the factor 20 might be originated. From the small and immature tests I am now guessing that gc and alloc might contribute something, and above that message sends are probably gaining more and more importance as the code gets more complex. > Great. In the end those comparisons are much more interesting, even if more difficult to achieve. And also more difficult to interpret, as the port will definitely gets "different" in more and more places. But I will report anyway :-) Ciao ...Jochen |
In reply to this post by Jochen Riekhof
Dear Jochen and group!
I've done some testing for fun and here are the results. DPRO Java Squeek VA VSE VW7NC alloc 2160 32500 13400 15800 8100 5000 get (index) 610 180 775 140 115 100 get (iterator) 310 320 580 100 110 59 double mul 240 100 535 160 110 215 The code has been slightly altered to match the Java code as much as possible. In other words, we create a rectangle with 0 coordinates and later access one of it's variables (that's what Java's getWidth() method does). I've tried to be fair so here is the code. VW, DPRO, Squeek, VA [ newOrigin := 0@0. newCorner := 0@0. oc := OrderedCollection new. 1 to: 1000000 do: [:each | oc add: (Rectangle origin: newOrigin corner: newCorner)]]. [1 to: 1000000 do: [:each | (oc at: each) origin]]. [oc do: [:each | each origin]]. VSE [ leftTopPoint := 0@0. rightBottomPoint := 0@0. oc := OrderedCollection new: 1000002. 1 to: 1000000 do: [:each | oc add: (Rectangle leftTop: leftTopPoint rightBottom: rightBottomPoint)]] [1 to: 1000000 do: [:each | (oc at: each) leftTop]]. [oc do: [:each | each leftTop]]. My machine is IBM A31 Laptop running Win2K on an Intel P4-1800MHz with 512 MB ram. The Java platform is the IBM Visual Age for Java 4.0 (normal free edition downloaded from the net). It's way too difficult to compile java files then invoking the java.exe with the correct classpath and million other parameters to get a small test running. IBM's VA Java resembles Smalltalk a lot - that's why I like it. Does it run as fast as Java HotSpot? Well, the conclusion is that Java is the slowest when it comes to allocating objects (no surprise here). Dolphin has some strange problem the first time it allocates many objects. If allocating 1 mil. rectangles as in the example above, it goes smoothly (actually the fastest one!). But if I try to allocate 3 mil. rectangles, then I have to wait minutes before it finishes. The subsequent run of the code runs faster. So there must be something rotten in the memory allocation algorithm when it reaches a certain size. VW does the job well. It's out of the box installation so it's not tuned. I guess if a VW guru tweaks the image a little the numbers will be better. VSE is somewhere in the middle. I was surprised to see VA being the slowest ST when it comes to object allocation. After all, it's very common to allocate a lot of work/temp objects in Smalltalk. When it comes to iteration, the JIT'ed Smalltalks are faster than Java. Dolphin could probably benefit a lot in performance if they implemented JIT VM. VW has very fast #do: for OrderedCollection! Floating points, Java is fastest. But good old VSE is not much behind Java. VW and Dolphin could probably do it a little better. Well, that's all folks! I would like to see if other people get the same results. Specially those who can find out how to start other Java VM than IBM VAJ. |
In reply to this post by Chris Uppal-3
I wrote:
> I see the same effect. I think there's something very screwy going > on. Either a bug or a most unfortunate interaction with the OS. I just noticed yet a third wierdness: in a freshly started image: Time secondsToRun: [(1 to: 3000000) collect: [:i | Array new: 2]] --> 13.164 whereas in the same image (restarted again): Time secondsToRun: [(1 to: 3000000) collect: [:i | Point new]] --> 542.435 Note: times are in seconds, and I'm allocating 3M 2-slot objects, rather than 1M Rectangles. -- chris |
In reply to this post by Todor Todorov-2
Todor Todorov wrote:
[snip] > VW does the job well. It's out of the box installation so it's not tuned. I > guess if a VW guru tweaks the image a little the numbers will be better. When you first drive a new, hired or borrowed car do you adjust the seat or just leave it as you found it? The VW memory defaults are indeed poor defaults and we will change them in the next release, but they are easy to change and don't require guru level sophistication to do so. I would appreciate a slightly higher standard in the publishing of benchmark results. -- _______________,,,^..^,,,____________________________ Eliot Miranda Smalltalk - Scene not herd |
> Todor Todorov wrote:
> [snip] > > VW does the job well. It's out of the box installation so it's not tuned. I > > guess if a VW guru tweaks the image a little the numbers will be better. > > When you first drive a new, hired or borrowed car do you adjust the seat > or just leave it as you found it? The VW memory defaults are indeed > poor defaults and we will change them in the next release, but they are > easy to change and don't require guru level sophistication to do so. I > would appreciate a slightly higher standard in the publishing of > benchmark results. First, the results were not _published_, they were posted on an open forum, with a clear disclaimer indicating that your product might have been slighted. Instead of making condescending comments about the effort, why don't you request the code and/or help the poster tune it? Sincerely, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
In reply to this post by Todor Todorov-2
Todor Todorov wrote:
> IBM's VA Java resembles Smalltalk a lot - that's why I like it. Does > it run as fast as Java HotSpot? It is a lot like Smalltalk, in fact (I'm *told* it's a fact, though I have no personal knowledge) it uses essentially the same VM as VASt. However, for Java, it is nowhere near as fast as the current crop of JVMs. > Well, the conclusion is that Java is the slowest when it comes to > allocating objects (no surprise here). For allocation, the test we've all been using is *wildly* unrepresentative. Generational or quasi-generational GC is the norm for VM implementers, and has been for years. That implementation approach is dependent on the idea that most objects will "die young" (that is, are eligible for GC very soon after they are first created). Allocating 1M long-lived Rectangles (or whatever) in a tight loop doesn't even closely represent the typical pattern. I had a hack at making a better micro-benchmark that tries to reflect the pattern more closely -- it allocates far more objects than it keeps for long. It's not a *good* benchmark, even so, but it was fun to play with. One example of the difference is that the IBM JVM for Windows (which is *not* the same as the VAJ VM) switched from being 2x faster than Sun's JVM on the straight loop, to being 2x slower on my "benchmark" (much more interestingly, it showed that its relative performance was degraded in proportion to the amount of non-garbage in the "image" -- which *really* suprised me). FWIW, the Sun JVMs came out best (by a useful margin -- *if* it's real), Dolphin and the other JVMs fared somewhat worse. I still couldn't get VW to perform up to its reputation (well-earned, I believe, though -- again -- I have no personal knowledge) despite following Eliot's formula, and it trailed by an order of magnitude. I've decided that I don't trust the results enough to post them as numbers, not even in fun, but if anyone's interested in the details, or in the test code, then please feel free to drop me a line. -- chris P.S. I also looked at the memory footprint of the running processes (which should be moderately indicative of real programs' memory reqs in response to a given load). Dolphin, VW, and the Sun JVMs were closely bunched, with Dolphin first by a nose, then the other JVMs trailed in at least a factor of 2 behind. |
In reply to this post by Bill Schwab-2
Bill Schwab wrote:
> > > Todor Todorov wrote: > > [snip] > > > VW does the job well. It's out of the box installation so it's not > tuned. I > > > guess if a VW guru tweaks the image a little the numbers will be better. > > > > When you first drive a new, hired or borrowed car do you adjust the seat > > or just leave it as you found it? The VW memory defaults are indeed > > poor defaults and we will change them in the next release, but they are > > easy to change and don't require guru level sophistication to do so. I > > would appreciate a slightly higher standard in the publishing of > > benchmark results. > > First, the results were not _published_, they were posted on an open forum, > with a clear disclaimer indicating that your product might have been > slighted. Instead of making condescending comments about the effort, why > don't you request the code and/or help the poster tune it? I already did, yesterday. I also posted results but they were on a much slower machine. I provided scaled results based on the lowest ration of the non-allocation benchmark with the caveat that I didn't expect the allocation benchmark to scale as well. With usenet archives like Google's a posting is in some ways even better than a publication because it can be retrieved so easily. I don't think my comments condescend. They merely use an analogy to help make the point. -- _______________,,,^..^,,,____________________________ Eliot Miranda Smalltalk - Scene not herd |
In reply to this post by Eliot Miranda
Eliot, I am sorry if you woke up in a bad mood this morning. When I
purchase, borrow or hire a new automobile, yes I adjust the seat. But I do not adjust or tune the engine. Since I am not buying a race car, but just an ordinary vehicle, I assume the factory have adjusted and tuned the engine to a level where it satisfies most people and runs most stable. I would expect the same of a software product. But no hard feelings. All I wanted to say with the expression "VW guru" is that my knowledge is not good enough to play with the VW memory parameters. I am sorry if you or others have misunderstood my statement. I never meant to say that a Ph.D. in VW memory management is required to tune VW. If you tell me exactly what to do to tune the VW memory for that benchmark, I will gladly rerun the test and post the new results. -- Todor. "Eliot Miranda" <[hidden email]> wrote in message news:[hidden email]... > > When you first drive a new, hired or borrowed car do you adjust the seat > or just leave it as you found it? The VW memory defaults are indeed > poor defaults and we will change them in the next release, but they are > easy to change and don't require guru level sophistication to do so. I > would appreciate a slightly higher standard in the publishing of > benchmark results. > -- > _______________,,,^..^,,,____________________________ > Eliot Miranda Smalltalk - Scene not herd |
In reply to this post by Eliot Miranda
I've now followed Eliot's instructions.
VW does it much better. The alloc time is down to abouc 1700 ms. So it's the fastest in my benchmark. Eliot claims that VW is even faster in Linux, but Bill have cast a spell on my laptop so it won't run linux. "Eliot Miranda" <[hidden email]> wrote in message news:[hidden email]... > > > Todor Todorov wrote: > [snip] > > VW does the job well. It's out of the box installation so it's not tuned. I > > guess if a VW guru tweaks the image a little the numbers will be better. > > When you first drive a new, hired or borrowed car do you adjust the seat > or just leave it as you found it? The VW memory defaults are indeed > poor defaults and we will change them in the next release, but they are > easy to change and don't require guru level sophistication to do so. I > would appreciate a slightly higher standard in the publishing of > benchmark results. > -- > _______________,,,^..^,,,____________________________ > Eliot Miranda Smalltalk - Scene not herd |
In reply to this post by Chris Uppal-3
"Chris Uppal" <[hidden email]> wrote in message
news:3e5b71fc$1$59862$[hidden email]... > Jochen Riekhof wrote: > > > time needed alloc (first time) = 36219 !! > > time needed alloc = 1874 > > I see the same effect. I think there's something very screwy going on. Either > a bug or a most unfortunate interaction with the OS. Actually, its a simple bug, although not in the VW as one might expect, but in the Smalltalk code which maintains some simple statistics to aid the decision as to when to perform a GC at times of high allocation rates. If this bug is patched (see attached), then I think you will find that the allocation speed will scale pretty linearly for allocations of 1 million or 3 million objects, and there will be relatively little difference between first and subsequent runs. Here are my JochenMark (:-)) results for allocation of 1 and 3 million Rectanges with #basicNew on D5, D6 and VWNC7, the latter being tuned as per Eliot's instructions. 1M 3M D6 1st 844 2776 D6 2nd 530 1851 D5 1st 1036 3290 D5 2nd 826 2657 VWNC7 1560 7415 (no sig. dif. between 1st and 2nd runs) The tests were run on an Athlon 1900MP+ system with 512Mb. All times are in milliseconds. I don't really consider these results too significant, since this sort of test is most unlike normal application behaviour, but I'm pleased to have tracked down an odd source of poor performance, so thanks Jochen and Chris. Regards Blair ------------------------ !MemoryManager methodsFor! otOverflow: anInteger | now | now := Delay millisecondClockValue. now - lastGCTime > (lastGCDuration * 4) ifTrue: [lastGCTime := now. self primCollectGarbage: 1. lastGCDuration := Delay millisecondClockValue - now max: 10]! ! |
In reply to this post by Jochen Riekhof
"Jochen Riekhof" <[hidden email]> wrote in message
news:[hidden email]... > > > The second was on images, and invoked many byteAtOffset: calls to access > > > pixels of bitmaps. > > > I got comparable results - factor 20 roughly. > > > > Frankly, I'm not surprised about that. > > I would be very interested to know if there is a more efficient way to > access bmp data from Dolphin. Perhaps, if you posted an example we might know a better way. However if it basically comes down to accessing the bytes of a DIBSection directly through its #imageBits, then you are talking about going through the a relatively unoptimized primitive against an ExternalAddress object. This is considerably slower than accessing the bytes of a ByteArray through the #at: primitive, as the following example will demonstrate: bytes := ByteArray newFixed: 1000000. Time millisecondsToRun: [1 to: bytes size do: [:i | bytes at: i]]. pBytes := bytes yourAddress asExternalAddress. Time millisecondsToRun: [0 to: bytes size-1 do: [:i | bytes byteAtOffset: 0]]. The first loop runs about twice as fast as the second on my machine, even though #at: needs to do a more expensive bounds check. However, whatever you do you aren't going to touch the speed of direct indexed access into a primitive array type. And even with that you aren't going to touch the speed of dedicated graphics in manipulating your bitmaps. >... > Also, it reminds me of the awkward code I had to write to check for top-down > bitmaps. Topdown bmps have a negative height in BITMAPINFOHEADER, but on > reading this structure the height is always positive regardless of bottom-up > or top-down.. This can well be a windows bug, though. Not sure about that one, but if anyone has any ideas we'd like to hear about it. > > > That is a very fair point, and one I would usually agree with, but in this > > case you were attempting (I think) to make a micro comparison, so it is > > pertinent. > > No, my intent was to find out more about where the factor 20 might be > originated. From the small and immature tests I am now guessing that gc and > alloc might contribute something, and above that message sends are probably > gaining more and more importance as the code gets more complex. Actually I doubt the allocation really has much to do with it. Please see Chris Uppals recent postings, and my reply to him today in this thread BTW: I've also uncovered the reason for the pause you experienced on closing your workspace if you don't first nil out the variable, but unlike the slow initial allocation (which requires a small change to one Smalltalk method), the pause can only be avoided with a patched VM. Regards Blair |
In reply to this post by Blair McGlashan
Blair,
> If this bug is patched (see attached), then I think you will > find that the allocation speed will scale pretty linearly for > allocations of 1 million or 3 million objects, and there will be > relatively little difference between first and subsequent runs. Great! Works a treat. BTW, the situation that provokes it isn't as wildly unnatural as I'd first thought. I checked one of my back-burner projects that stores a largish number of objects in STB format. I'd sort of shelved it after discovering that I'd need a much faster machine than I'm currently using. So I wondered how much difference the fix made to reading in the STB data. My toy dataset has about 0.5M objects; which is at the lower end of the number of objects that would be affected by the bug. Without the fix it took 60 seconds to read in (after eliminating disk IO time), with it the time dropped to 42 seconds. Not a *big* deal, but then I was dealing with a toy dataset; the real thing would be 2 to 3 times larger and the bug would have had a calamitous effect. So thank you for the fix. -- chris |
In reply to this post by Blair McGlashan
> Perhaps, if you posted an example we might know a better way. However if
it > basically comes down to accessing the bytes of a DIBSection directly through > its #imageBits, then you are talking about going through the a relatively > unoptimized primitive against an ExternalAddress object. This is > considerably slower than accessing the bytes of a ByteArray through the #at: > primitive, as the following example will demonstrate: >... > The first loop runs about twice as fast as the second on my machine, even > though #at: needs to do a more expensive bounds check. Yes, I use the ExternalAddress exposed by DIBSection>>imageBits. I have e.g. a pixelAt: method that reads pixelAt: aPoint ^imageBits byteAtOffset: (rowOffsets at: aPoint y + 1) + aPoint x When I understand you right, this is the fastest way?! (the rowOffsets seem a bit faster than a multiplication but mainly help a lot in dealing with bottom-up and top-down data). > BTW: I've also uncovered the reason for the pause you experienced on closing > your workspace if you don't first nil out the variable, but unlike the slow > initial allocation (which requires a small change to one Smalltalk method), > the pause can only be avoided with a patched VM. The patch is already in the image :-), the freeze on ws-close is very artificial (as were my tests) and never happened to me in normal work. Ciao ...Jochen |
In reply to this post by Blair McGlashan
Blair,
> However, whatever you do you aren't going to touch the speed of direct > indexed access into a primitive array type. And even with that you aren't > going to touch the speed of dedicated graphics in manipulating your bitmaps. Most of the C++ programming I do is writing DLLs that do number crunching for Dolphin. The Smalltalk code allocates and frees memory and controls the logic, and the C++ does the numerics. Sometimes that's useful because there is a lot of existing C++ code, and more often, it's for performance. This reminds me of a question that I've wanted to ask, but I'll start another thread for it. > BTW: I've also uncovered the reason for the pause you experienced on closing > your workspace if you don't first nil out the variable, but unlike the slow > initial allocation (which requires a small change to one Smalltalk method), > the pause can only be avoided with a patched VM. Pause? VM patch? We don't like pauses :) Is this something that could be of general interest? Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
In reply to this post by Blair McGlashan
> Here are my JochenMark (:-)) results for allocation of 1 and 3 million
> Rectanges with #basicNew > on D5, D6 and VWNC7, the latter being tuned as per Eliot's instructions. > > 1M 3M > D6 1st 844 2776 > D6 2nd 530 1851 > D5 1st 1036 3290 > D5 2nd 826 2657 > VWNC7 1560 7415 (no sig. dif. between 1st and 2nd runs) Interesting! [or alternatively "Ouch!", ed] The VW oldSpace allocator used to allocate tenured objects (which is what this "let's keep tons of objects around" stresses in VW's case is poor w.r.t a classic blue-book implementation because VW doesn't organize its oldSpace free lists as an objectTableEntry (ote) holding onto an objectBody. Instead it keeps separate lists of free otes and objectBodies. SO allocating an oldSpace object requires unlinking an ote from one free list and an objectBody from another. Further, the allocation code is not at all aggressively inlined and involves at least three procedure calls. Blair, if you're comfortable discussing it, what oldSpace free list organization does D5 use? -- _______________,,,^..^,,,____________________________ Eliot Miranda Smalltalk - Scene not herd |
In reply to this post by Bill Schwab-2
"Bill Schwab" <[hidden email]> wrote in message
news:b3oce5$1njhim$[hidden email]... > > > BTW: I've also uncovered the reason for the pause you experienced on > closing > > your workspace if you don't first nil out the variable, but unlike the > slow > > initial allocation (which requires a small change to one Smalltalk > method), > > the pause can only be avoided with a patched VM. > > Pause? VM patch? We don't like pauses :) Is this something that could > of general interest? You'd only experience a pause if a very large number of objects were collected by a single idle time GC cycle. In this case 3M objects in one collection. This is unlikely to occur in practice. Regards Blair |
Free forum by Nabble | Edit this page |