I've been trying to get our sound framework going on linux and have got stuck. I wondered whether anyone has any experience getting round this problem on linux. It is complicated so excuse the lengthy explanation.
Our sound processing depends on a high-prirority thread running for a short time every 20 milliseconds to push data through the sound processing paths (microphone recording, OpenAL for spatialization, and so on). On linux this approach is not immediately applicable because (*)
a) linux's pthreads implementation does not allow non-superuser programs to create threads with other than the SCHED_OTHER scheduler, and b) linux's pthreads implementation does not support different priorities for the SCHED_OTHER scheduler
and so it is not possible for a non-superuser program to create a thread that runs at a higher priority than any other thread in the process. The approach I've explored so far has been to replace the high-pririty thread with an interval timer, where a timer interrupt is delivered via a signal handler. This approach won't work because it can very easily deadlock. For example there are non-reentrant locks in functions such as tz_convert (used within both localtime and localtime_r) such that if the signal is delivered during a call to localtime, any nested call to localtime, as is made by our logging code, will deadlock. I can avoid deadlock in code under our own control, but not in platform functions such as localtime upon which we depend.
Another approach would be (should be!) to still use two threads (which are unfortunately of the same priority because of the above thread priority restriction) but to cause the main VM thread to yield to the sound processing thread. In this approach the timer interrupt would signal a posix semaphore (via pthread_cond_signal) and then call pthread_yield. The sound processing thread would loop blocking on the semaphore via pthread_cond_wait. But...
On contemporary linux (including CERN's distro) pthread_yield is implemented as a straight call of the sched_yield(2) system call which does not yield to other threads within the process, but to other processes within the process's static priority. I quote:
NAME sched_yield - yield the processor
SYNOPSIS
#include <sched.h> int sched_yield(void); DESCRIPTION A process can relinquish the processor voluntarily without blocking by
calling sched_yield(). The process will then be moved to the end of the queue for its static priority and a new process gets to run.
Note: If the current process is the only process in the highest prior-
ity list at that time, this process will continue to run after a call to sched_yield().
So, although I have yet to confirm this, it looks like pthread_yield will not yield to the other thread. [I have confirmed that pthread_yield is a direct call of sched_yield(2) however]. The manual page could be incorrect, but my understanding is that on linux pthreads are a user-space implementation and therefore there isn't any way that a call to sched_yield is going to yield to another thread.
So my questions are - is there any way I'm missing to create different priority threads in a non-superuser process? - is there an implementation of pthread_yield's intended functionality I'm missing that will yield to other threads within the process?
best Eliot (*) see sched_setscheduler(2) and in particular that RLIMIT_RTPRIO for non-superuser programs is a maximum of 0, so one can't change the static priority of a non-superuser program either.
|
On Wed, Feb 03, 2010 at 06:01:00PM -0800, Eliot Miranda wrote: > > So my questions are > - is there any way I'm missing to create different priority threads in a > non-superuser process? Lower the priority of all other threads, which would be equivalent to running Squeak under nice(1), which generally works fine. Dave |
In reply to this post by Eliot Miranda-2
On 03.02.2010, at 18:01, Eliot Miranda wrote: > Our sound processing depends on a high-prirority thread running for a short time every 20 milliseconds to push data through the sound processing paths (microphone recording, OpenAL for spatialization, and so on). On linux this approach is not immediately applicable Scheduling in Linux sucked for a long time, only optimized for server work. Depending on the kernel your users have running it may still be totally unsuitable for anything multimedia. There was a big hubbub about that on LKML where the leading developers did not even acknowledge there was a problem with the scheduler. A rewritten scheduler demonstrated much better performance, but was not accepted as-is. Google for Con Kolivas to learn more. It supposedly has gotten better but only very recently. You need to find someone to talk to who knows this stuff, I'm only watching from afar. Here's the general idea: http://x264dev.multimedia.cx/?p=185 - Bert - |
In reply to this post by David T. Lewis
On Wed, Feb 3, 2010 at 6:38 PM, David T. Lewis <[hidden email]> wrote:
This doesn't work because in contemporary linux SCHED_OTHER threads are all of the same priority. One can neither raise nor lower their priority. nice changes a process's dynamic priority, not its static priority (see sched_setscheduler(2) and my footnote)
|
How about good old fork() then? On 4 February 2010 07:20, Eliot Miranda <[hidden email]> wrote: > > > > On Wed, Feb 3, 2010 at 6:38 PM, David T. Lewis <[hidden email]> wrote: >> >> On Wed, Feb 03, 2010 at 06:01:00PM -0800, Eliot Miranda wrote: >> > >> > So my questions are >> > - is there any way I'm missing to create different priority threads in a >> > non-superuser process? >> >> Lower the priority of all other threads, which would be equivalent to >> running Squeak under nice(1), which generally works fine. > > This doesn't work because in contemporary linux SCHED_OTHER threads are all of the same priority. One can neither raise nor lower their priority. nice changes a process's dynamic priority, not its static priority (see sched_setscheduler(2) and my footnote) > >> >> Dave >> > > > -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Bert Freudenberg
On Wed, Feb 3, 2010 at 6:40 PM, Bert Freudenberg <[hidden email]> wrote:
Thanks Bert! Plenty to investigate here, but building a special kernel is something I didn't bargain for ;)
I think one interim solution to explore is to run the sound pump from the VM thread in the VM's check for events code. In the Cog VMs a 1KHz heartbeat updates the wall clock and causes the VM to break out of native code and check for events (ideally the heartbeat is a separate thread). An mentioned, on linux the heartbeat is a software signal from an interval timer. The software interrupt serves also to interrupt any system calls (e.g. in relinquishProcessorForMilliseconds). I can extend the heartbeat to set a flag saying "run the sound pump" which will be checked by the VM when checking for events and it will invoke the sound pump code. The problem here is that any long-running activities that aren't interruptable by the software interrupt (OpenGL rendering?) will delay the sound pump. So this is either doomed to failure or might just work depending on those other activities. Comments? Try it, waste of time?
|
In reply to this post by Igor Stasenko
On Wed, Feb 3, 2010 at 9:24 PM, Igor Stasenko <[hidden email]> wrote:
Requires us to rewrite the entire sound pump using shared memory or the like. It's a lot of work and we hope to have an alpha next week :/
|
In reply to this post by Eliot Miranda-2
Well mmm doesn't the checkForInterrupts code get called every 1 millisecond (or there about?) Can't you add something there to ensure your sound logic doesn't starve. Hint if checkForInterrupts is not being called every ms or so how do you ensure delay accuracy is to a ms boundary? next hint if checkForInterrupts is being called 10 times a millisecond then isn't that wasteful, should you be grinding smalltalk bytecodes or assembler instructions there of? On 2010-02-03, at 6:01 PM, Eliot Miranda wrote:
-- =========================================================================== John M. McIntosh <[hidden email]> Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com =========================================================================== |
In reply to this post by Eliot Miranda-2
Eliot Miranda wrote: > I think one interim solution to explore is to run the sound pump from > the VM thread in the VM's check for events code. In the Cog VMs a 1KHz > heartbeat updates the wall clock and causes the VM to break out of > native code and check for events (ideally the heartbeat is a separate > thread). An mentioned, on linux the heartbeat is a software signal from > an interval timer. The software interrupt serves also to interrupt any > system calls (e.g. in relinquishProcessorForMilliseconds). I can extend > the heartbeat to set a flag saying "run the sound pump" which will be > checked by the VM when checking for events and it will invoke the sound > pump code. The problem here is that any long-running activities that > aren't interruptable by the software interrupt (OpenGL rendering?) will > delay the sound pump. So this is either doomed to failure or might just > work depending on those other activities. Comments? Try it, waste of time? We know that SwapBuffers() can take north of 100msecs although we might be able to mitigate that a little by proper use of fences. I'd ask Josh when he's back if he thinks it's reasonable to up the sound pipe to 100msecs or more latency (I suspect the answer is no). The alternative would be to make such calls threaded. Perhaps the sound thread is susceptible to using the same wakeup technique as in the threaded FFI? I.e., if the software interrupt notices that you're still in the same primitive, call the sound pump regardless and join (if necessary) upon return from that long running primitive? Cheers, - Andreas |
In reply to this post by Eliot Miranda-2
On 4 February 2010 07:31, Eliot Miranda <[hidden email]> wrote: > > > > On Wed, Feb 3, 2010 at 6:40 PM, Bert Freudenberg <[hidden email]> wrote: >> >> On 03.02.2010, at 18:01, Eliot Miranda wrote: >> > Our sound processing depends on a high-prirority thread running for a short time every 20 milliseconds to push data through the sound processing paths (microphone recording, OpenAL for spatialization, and so on). On linux this approach is not immediately applicable >> >> Scheduling in Linux sucked for a long time, only optimized for server work. Depending on the kernel your users have running it may still be totally unsuitable for anything multimedia. There was a big hubbub about that on LKML where the leading developers did not even acknowledge there was a problem with the scheduler. A rewritten scheduler demonstrated much better performance, but was not accepted as-is. Google for Con Kolivas to learn more. It supposedly has gotten better but only very recently. >> >> You need to find someone to talk to who knows this stuff, I'm only watching from afar. Here's the general idea: >> >> http://x264dev.multimedia.cx/?p=185 > > Thanks Bert! Plenty to investigate here, but building a special kernel is something I didn't bargain for ;) > > I think one interim solution to explore is to run the sound pump from the VM thread in the VM's check for events code. In the Cog VMs a 1KHz heartbeat updates the wall clock and causes the VM to break out of native code and check for events (ideally the heartbeat is a separate thread). An mentioned, on linux the heartbeat is a software signal from an interval timer. The software interrupt serves also to interrupt any system calls (e.g. in relinquishProcessorForMilliseconds). I can extend the heartbeat to set a flag saying "run the sound pump" which will be checked by the VM when checking for events and it will invoke the sound pump code. The problem here is that any long-running activities that aren't interruptable by the software interrupt (OpenGL rendering?) will delay the sound pump. So this is either doomed to failure or might just work depending on those other activities. Comments? Try it, waste of time? Well, if you need to run sound pump every 20ms or so, and heartbeat interval is 1ms, there is a good chances that it will work fine. Also, other thing to try, is making sure that in the buffer there is enough data to play sound for 30ms or so, while refilling it every 20ms (or more frequently). >> >> >> - Bert - >> >> > > > -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Andreas.Raab
On 4 February 2010 07:46, Andreas Raab <[hidden email]> wrote: > > Eliot Miranda wrote: >> >> I think one interim solution to explore is to run the sound pump from the >> VM thread in the VM's check for events code. In the Cog VMs a 1KHz >> heartbeat updates the wall clock and causes the VM to break out of native >> code and check for events (ideally the heartbeat is a separate thread). An >> mentioned, on linux the heartbeat is a software signal from an interval >> timer. The software interrupt serves also to interrupt any system calls >> (e.g. in relinquishProcessorForMilliseconds). I can extend the heartbeat to >> set a flag saying "run the sound pump" which will be checked by the VM when >> checking for events and it will invoke the sound pump code. The problem >> here is that any long-running activities that aren't interruptable by the >> software interrupt (OpenGL rendering?) will delay the sound pump. So this >> is either doomed to failure or might just work depending on those other >> activities. Comments? Try it, waste of time? > > We know that SwapBuffers() can take north of 100msecs although we might be > able to mitigate that a little by proper use of fences. I'd ask Josh when > he's back if he thinks it's reasonable to up the sound pipe to 100msecs or > more latency (I suspect the answer is no). > second. Weird. Can it be because it waits for vsync? > The alternative would be to make such calls threaded. Perhaps the sound > thread is susceptible to using the same wakeup technique as in the threaded > FFI? I.e., if the software interrupt notices that you're still in the same > primitive, call the sound pump regardless and join (if necessary) upon > return from that long running primitive? > > Cheers, > - Andreas > > -- Best regards, Igor Stasenko AKA sig. |
Igor Stasenko wrote: > Wow.. 100ms? Sounds like you can't have more than 10 frames per > second. Weird. Can it be because it waits for vsync? No it's because it got stuff queued up that the GPU hasn't processed yet. SwapBuffers() forces it to process the pending stuff. What we've done to fix this is to issue a fence and a flush to let the GPU tell us when it's done processing without blocking. This brings the time in SwapBuffers() down but it requires that your GPU and drivers support the fence instructions (not all do; in particular Intel doesn't). I have no clue what the situation on Linux is but I expect the worst. Cheers, - Andreas |
On 4 February 2010 07:56, Andreas Raab <[hidden email]> wrote: > > Igor Stasenko wrote: >> >> Wow.. 100ms? Sounds like you can't have more than 10 frames per >> second. Weird. Can it be because it waits for vsync? > > No it's because it got stuff queued up that the GPU hasn't processed yet. > SwapBuffers() forces it to process the pending stuff. What we've done to fix > this is to issue a fence and a flush to let the GPU tell us when it's done > processing without blocking. This brings the time in SwapBuffers() down but > it requires that your GPU and drivers support the fence instructions (not > all do; in particular Intel doesn't). I have no clue what the situation on > Linux is but I expect the worst. > pipeline being that slow on modern hardware. Simply because you can't get 300 fps by having 100ms delays :) > Cheers, > - Andreas > -- Best regards, Igor Stasenko AKA sig. |
Igor Stasenko wrote: > On 4 February 2010 07:56, Andreas Raab <[hidden email]> wrote: >> Igor Stasenko wrote: >>> Wow.. 100ms? Sounds like you can't have more than 10 frames per >>> second. Weird. Can it be because it waits for vsync? >> No it's because it got stuff queued up that the GPU hasn't processed yet. >> SwapBuffers() forces it to process the pending stuff. What we've done to fix >> this is to issue a fence and a flush to let the GPU tell us when it's done >> processing without blocking. This brings the time in SwapBuffers() down but >> it requires that your GPU and drivers support the fence instructions (not >> all do; in particular Intel doesn't). I have no clue what the situation on >> Linux is but I expect the worst. >> > Hmm.. i still think this is too much.. because i can't imagine the GL > pipeline being that slow on modern hardware. > Simply because you can't get 300 fps by having 100ms delays :) You only get 300fps if you run glxGears or other toy demos. Real applications deal with lots of complex meshes and lots of big textures. It's very easy to throw more at the GPU than it can handle. All it takes is a single Ron in your meeting :-) Cheers, - Andreas |
In reply to this post by Eliot Miranda-2
Replying to mine and John's suggestion of moving the pump to the VM's check event loop. On Wed, Feb 3, 2010 at 9:31 PM, Eliot Miranda <[hidden email]> wrote:
On Wed, Feb 3, 2010 at 9:44 PM, John M McIntosh <[hidden email]> wrote:
There is no guarantee that checkForInterrupts is being called every millisecond. There is only a guarantee that the VM will be asked to checkForInterrupts every millisecond. It will actually check as soon as it next checks. So if it is in a long-running primitive, or more importantly a full garbage collection it will check when this has finished. The GC pauses are the killer. For us every now and then there is a >100ms pause for GC and so the sound pump will occasionally be starved.
Essentially the sound pump has to run as an interrupt or in another thread. I am going to try one horrible hack. Right now the deadlock I'm seeing is in the sound pump's logging code that calls ctime_r to grab an ascii string of the local time to timestamp log entries. The VM can keep a copy of ctime_r's result and update it synchronously every second or so, instead of calling ctime_r on every log entry. This won't remove the possibility of lock-up due to other library calls, but it may reduce the probability low enough that our alpha demo will work sufficiently and give us time to work on a proper solution (i.e. investigating Bert's direction to improved linux schedulers) after the demo.
|
In reply to this post by johnmci
On Wed, Feb 3, 2010 at 9:44 PM, John M McIntosh <[hidden email]> wrote:
Damn right :) But it is only called at a maximum of once every millisecond or two (I think 2ms is an adequate delay resolution). On current machines one can't run the heartbeat at much more than 100 usecs because thats how long it takes each heartbeat (which does a thread switch, updates the clock and sets some flags).
Of course Cog is grinding out those bytecodes a lot faster than the normal Squeak VM so a 500Hz heartbeat is reasonably affordable :) Perhaps I should explain how this works in Cog. Cog does not call checkForInterrupts in normal processing of bytecodes. Instead it checks the stack limit on every frame-building send (i.e. not quick methods that answer constants & inst vars, and not primitives that succeed). It has to check the stack limit because the VM runs Smalltalk code on small stack pages (~64 pages, 4k bytes each), "paging out" a stack page worth of frames to heap contexts when there is no free page available. When stack overflow occurs it breaks out of machine code to handle the switch from the current stack page to a new one, and as part of that processing it checks for events. The heartbeat works by setting the stack limit to a value that will cause all stack limit checks to fail (this being 16rFFFFFFFF because stacks grow down).
The HotSpot Java VM uses a similar technique, but their stacks are not organized as pages (because Java doesn't have contexts one can map the java stack to a native thread stack quite easily, whereas the paged organization works well for marrying contexts and stack frames). On frame build they write to a location on a guard page. Since there is no load dependency for a write the write doesn't stall the processor. When they want to break out of Java machine code they remove write permission from the page and handle the protection violation exception. Of course this only works on systems where changing a page's protection is very cheap compared to the breakout frequency. It wouldn't work for a 1Khz beat; you'd spend all of your time twiddling page protections. But a neat technique all the same.
best Eliot
|
In reply to this post by Eliot Miranda-2
On 2010-02-04, at 8:45 AM, Eliot Miranda wrote:
Well I have been doing some work for a client where we adjust the oops allocation counter GC trigger to float up/down based on the microsecond timer and a target millisecond time for an incremental GC. The full GC could be a special case where you have code in the GC loop markAndTrace:. Oh a counter and compare, technically that could be a free cost. Actually there *is* a counter in there.... statMarkCountLocal Yes, yes it's a local that then is assigned to a global, because on powerpc it would become a register var, versus a memory access. Likely compilers today might just optimize it away.. Mmm you could pick some ceiling based on the variable object allocation count which is targeting a 1ms incremental GC cycle. The next hassle is the long running primitive calls. But don't you have an audit list of them, how rare? If there is one or two, then can you be clever.. Or setup a timer interrupt or something just when the primitive is running? -- =========================================================================== John M. McIntosh <[hidden email]> Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com =========================================================================== |
Free forum by Nabble | Edit this page |