It would be nice to have some high-level profiling tools available in
Dolphin. Let me explain. Currently I can write some simple wrappers to try and profile low-level code (ie, function-level profiling). But I can't profile my code on-the-whole as it relates to the Dolphin VM. Right now, I have a pretty basic, trimmed down, Direct3D application up and running. Currently, rendering 2 (yes, TWO) triangles has a performance of about 90 FPS. That's insanely bad. I expected Dolphin to be a lot slower than C, but not by that much. At worst, the scene should be running at about 300 FPS. Regardless, there are several possible reasons for why the performance is so bad. But I can't profile these out to really find out which it is (or a combination of them): * Short-lived objects. Does Dolphin handle small memory allocations and short-lived objects well? Right now I'm actually creating vertex lists on-the-fly every frame. Yes, this is bad, but without profiling tools, I have no idea whether this is a bottleneck or not. * COM wrappers. What's the cost of type checking, fixing operand size (bignum -> integer), type changing (int -> float), etc? Right now I have no idea. But there certainly are a lot of these calls going on. * Window event handling. Currently (for ease of use) I'm subclassing ShellView for my window. But I don't know how much time is taken up doing [SessionManager inputState pumpMessages]. I may want to just use CreateWindow and PeekMessage myself. Maybe not. Still, without numbers, I don't know. * Message passing. I don't have a big hierarchy, or large numbers of objects by an stretch, but for factoring reasons, things are broken up quite a bit. What's the general cost of message passing? * Garbage collection. Related to short-lived objects, but how often is the GC running? When it does, how long does it take? This is what I mean by high-level profiling. It would be nice to be able to set a switch, and then get timings or at least percentage breakdowns of where my code is going (because obviously gathering this info would slow everything down). If COM wrappers are a problem area, I need to know that. I can make more globals instead of having short-lived objects local to a function if that's a problem. But at the moment I'm very much stabbing in the dark. Jeff M. |
Jeff,
> It would be nice to have some high-level profiling tools available in > Dolphin. You didn't mention Ian Bartholomew's profiler so I guess you haven't tried that yet. I don't know whether it would provide precisely the information you are looking for, but it would certainly help a lot. -- chris |
Chris,
>You didn't mention Ian Bartholomew's profiler so I guess you haven't tried that >yet. I don't know whether it would provide precisely the information you are >looking for, but it would certainly help a lot. It would help, but I don't think it would provide some of the precision that Jeff is looking for. There are two main problems with the Profiler as it stands... - it can only sample at a maximum of 1mS intervals. This was probably OK on the slower machines that were in use when the Profiler was first written but is too slow now. - it will only take a sample when Dolphin allows it to - basically before and after a "normal" message send. Anything unusual, api calls / optimised methods, get included in the enclosing sample. The original Dolphin Profiler, named Svengali, that appeared back in the early Dolphin beta could do more (it had a full execution tracer and a GC monitor IIRC) so it is possible to provide more information - just a bit difficult to implement. What I think it really needs is some sort of VM support. Something like a VM interrupt that, optionally, occurs after every n opcodes are evaluated? -- Ian The From address is valid - for the moment |
Ian,
Once I figured out the 2 lines required to use your profiler, it actually helped tremendously. I would still love the profiling ability that I mentioned in the original post from Andy and Blair, but your tool did a wonderful job (something this nice should be included with the distribution, IMO). It's always nice to see when 90% of the execution time is found to be in one location (isn't that always the case?). Regardless, FPS is up to > 1000. This is incredibly good news, and I huge worry is now gone. :-) Thanks for taking the time to write this great tool, btw! Jeff M. |
Hi Jeff,
> I would still love the profiling ability > that I mentioned in the original post from Andy and Blair, I agree, this would be a useful tool. > It's always nice to see when 90% of the execution time is found to be > in one location (isn't that always the case?). Regardless, FPS is up to > > 1000. This is incredibly good news, and I huge worry is now gone. :-) Can I ask what the problem was? Also, how are you driving your renderLoop ... are you using a process or WM_TIMER or ? When I read your original post I dug out my old DX8 image to have a play around. I remember switching from a renderLoop based on a forked process (similar to DoubleBufferedView) to a renderLoop based on WM_TIMER and seeing a big increase in frame rate. I never figured out why ... but in any case if you are getting 1000 fps ... you don't seem to have a problem! Steve -- [hidden email] |
Steve Alan Waring wrote:
> Hi Jeff, > > > It's always nice to see when 90% of the execution time is found to be > > in one location (isn't that always the case?). Regardless, FPS is up to > > > 1000. This is incredibly good news, and I huge worry is now gone. :-) > > Can I ask what the problem was? It was actually a stupid mistake on my part that is a decent performance hic-up in Direct3D period (state changes), but the same code in C++ isn't as big a performance hit as it was in Dolphin. Basically, at the start of every frame I was resetting all the render states. This actually has me concerned that the COM wrappers in Dolphin will be (eventually) a big performance sync, and that I'll have to buffer as much as I can. That change alone took the FPS up into the 4,000 region on my workstation at work (1K at home). > Also, how are you driving your renderLoop ... are you using a process > or WM_TIMER or ? Right now, I just create my own "main" loop. It looks something like this: GameEngine>>run [view isOpen] whileTrue: [self postRedisplay. SessionManager inputState pumpMessages. framecount := framecount + 1]. Probably not the way I'll do it at the end of the day, but it's simple to test and should give the best performance. Funny enough, after the above performance bottleneck was removed, the second biggest is "SessionManager inputState". Making that a local variable increases performance by another 100 or so FPS (that's a BIG deal - what could that possibly be doing that's so intensive?). Running with the profiler revealed several such instances (example: Keyboard default). Jeff M. |
Hi Jeff,
> code in C++ isn't as big a performance hit as it was in Dolphin. Just out of interest, for the kind of game you are building, if it was in C++, what is the percentage of time it would spend in "user" code vs directX itself? > This actually has me concerned that the COM wrappers in Dolphin > will be (eventually) a big performance sync, and that I'll have to > buffer as much as I can. You can optimize the creation of ExternalStructure instances fairly easily ... the base system does this in a number of places ... for example RECT class>>left:top:right:bottom: Classes like StructureArrays, which involve an additional layer of moving bytes around, can also be easily optimized by creating your own specialized constructors or specialized classes. I think the bottom line is that if you really need/want to, you can bypass the COM ExternalStructure wrappers and just work directly with ByteArrays ... which is relatively very fast for Dolphin. > "SessionManager inputState". Making that a local variable increases > performance by another 100 or so FPS (that's a BIG deal - what could > that possibly be doing that's so intensive?). Running with the profiler > revealed several such instances (example: Keyboard default). I am not sure. Ian's profiler is very useful, but when you get down to the leaves, it can be misleading at times as to what exactly is taking the time ... especially when external interfacing is involved. I have never worked on such performance critical code that I would notice the difference between a global access and a temporary ... but I have seen Blair and Andy do it in the base system. Steve -- [hidden email] |
In reply to this post by Ian B
Ian,
> - it can only sample at a maximum of 1mS intervals. This was probably > OK on the slower machines that were in use when the Profiler was first > written but is too slow now. A finer granularity would probably help, but I find the current 1ms interval to work OK. It's just a matter of ensuring that the code runs for a decent period (several seconds at least) under the profiler. That means that I have to run more loops now than in the old days -- but the test doesn't take any longer to execute. > What I think it really needs is some sort of VM support. Something > like a VM interrupt that, optionally, occurs after every n opcodes are > evaluated? That might be interesting too. But just a thought -- the profiler probably hasn't been working as well as it should on the current D6 VM since (as Blair mentioned a little while back) the VM wasn't checking for (speaking broadly) interrupt conditions as often as it should. -- chris |
In reply to this post by Jeff M.
Jeff,
> GameEngine>>run > [view isOpen] whileTrue: > [self postRedisplay. > SessionManager inputState pumpMessages. > framecount := framecount + 1]. > > Probably not the way I'll do it at the end of the day, but it's simple > to test and should give the best performance. Funny enough, after the > above performance bottleneck was removed, the second biggest is > "SessionManager inputState". Making that a local variable increases > performance by another 100 or so FPS (that's a BIG deal - what could > that possibly be doing that's so intensive?). That pattern: SessionManager inputState is used all over in Dolphin, and it's worth keeping an eye open for occurrences in the inner loops of performance-critical code. But it's not /that/ big a drag -- on my 1.5 GHz machine that expression takes 77 nanoseconds in D6[*], so it's hard to see why executing it a thousand or so times a second is making a detectable difference to your loop. Incidentally, Dolphin's external call mechanism is pretty fast, but in the usual pattern of using external libraries, with calls like: MyExternalLib default doSomethingWith: someValue. the overhead of: MyExternalLib default is larger than the overhead of the external call itself. Which is one reason why some of the most heavily used ExternalLibraries are subclassed from PermanentLibrary -- which has a faster, but less flexible, implementation of class-side #default. -- chris [*] Down from 235 nanos in D5 ! |
Chris,
Thanks for the input! I haven't been able to completely test the external calls and COM wrappers for timings yet, but your comments boost my confidence. Once I got this basic hic-up out of the way, I've been able to plow ahead quite successfully, and each addition from here on has had very little negative effect on framerate. Jeff M. |
In reply to this post by Ian B
> - it can only sample at a maximum of 1mS intervals. This was probably
> OK on the slower machines that were in use when the Profiler was first > written but is too slow now. > What is "sample interval"? I can't to find any help for it . :( Pavel |
Pavel,
>What is "sample interval"? I can't to find any help for it . :( The profiler takes a snapshot of the current execution stack sleeps for a bit, takes another snapshot, sleeps for a bit etc. The "sample interval" just defines the number of milliseconds it sleeps between taking snapshots. -- Ian The From address is valid - for the moment |
Free forum by Nabble | Edit this page |