NativeBoost : optimisation of the machine code generation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

NativeBoost : optimisation of the machine code generation

Thomas Bany
Hi everyone,

I'm trying to reduce the computation time of the following pseudo-code:

- memory allocation (~40 doubles)
- object heap to C heap copying
- NativeBoost call (nbCall:)
- memory freeing

The time profiling results are bellow:

- 24*3600 calls : > 1 minute
- 24*3600 calls with only memory allocation and copying : < 1 second
- 1 call with a 24*3600 loop inside de C code : < 1 second

So it appears that the very coslty step is the transition from Pharo to C. And I was wondering if it was possible to drasticly reduce this time by doing something like, generate the the machine code once and call it multiple time ?

Thanks in advance !

Thomas.
Reply | Threaded
Open this post in threaded view
|

Re: NativeBoost : optimisation of the machine code generation

Luc Fabresse
Hi Thomas,

2014-08-07 17:25 GMT+02:00 Thomas Bany <[hidden email]>:
Hi everyone,

I'm trying to reduce the computation time of the following pseudo-code:

- memory allocation (~40 doubles)
- object heap to C heap copying
- NativeBoost call (nbCall:)
- memory freeing

The time profiling results are bellow:

- 24*3600 calls : > 1 minute
- 24*3600 calls with only memory allocation and copying : < 1 second
- 1 call with a 24*3600 loop inside de C code : < 1 second

So it appears that the very coslty step is the transition from Pharo to C. And I was wondering if it was possible to drasticly reduce this time by doing something like, generate the the machine code once and call it multiple time ?

the machine code for the marshalling of the arguments is generated one time for all.
so the penalty does not come from there.

please send the code you wrote for these micro-benchs so I can better understand what happens.

Luc
 

Thanks in advance !

Thomas.

Reply | Threaded
Open this post in threaded view
|

Re: NativeBoost : optimisation of the machine code generation

abergel
In reply to this post by Thomas Bany
Hi Thomas,

Please share with us how it goes. Your experience is important to us.

Alexandre


> Le 07-08-2014 à 11:25, Thomas Bany <[hidden email]> a écrit :
>
> Hi everyone,
>
> I'm trying to reduce the computation time of the following pseudo-code:
>
> - memory allocation (~40 doubles)
> - object heap to C heap copying
> - NativeBoost call (nbCall:)
> - memory freeing
>
> The time profiling results are bellow:
>
> - 24*3600 calls : > 1 minute
> - 24*3600 calls with only memory allocation and copying : < 1 second
> - 1 call with a 24*3600 loop inside de C code : < 1 second
>
> So it appears that the very coslty step is the transition from Pharo to C. And I was wondering if it was possible to drasticly reduce this time by doing something like, generate the the machine code once and call it multiple time ?
>
> Thanks in advance !
>
> Thomas.

Reply | Threaded
Open this post in threaded view
|

Re: NativeBoost : optimisation of the machine code generation

kilon.alios
In reply to this post by Thomas Bany
I think that if you posted the code , preferably that contains only the problem would be easier to test , debug and investigate. 



On Thu, Aug 7, 2014 at 6:25 PM, Thomas Bany <[hidden email]> wrote:
Hi everyone,

I'm trying to reduce the computation time of the following pseudo-code:

- memory allocation (~40 doubles)
- object heap to C heap copying
- NativeBoost call (nbCall:)
- memory freeing

The time profiling results are bellow:

- 24*3600 calls : > 1 minute
- 24*3600 calls with only memory allocation and copying : < 1 second
- 1 call with a 24*3600 loop inside de C code : < 1 second

So it appears that the very coslty step is the transition from Pharo to C. And I was wondering if it was possible to drasticly reduce this time by doing something like, generate the the machine code once and call it multiple time ?

Thanks in advance !

Thomas.

Reply | Threaded
Open this post in threaded view
|

Re: NativeBoost : optimisation of the machine code generation

Thomas Bany
@ Alexandre: sure, no problem !

@ Luc:

I'm not sure how much code I can provide without being to specific, but here is how it goes :

  • Let's say I have the Smalltalk code bellow:
MyClass>>withNBCall
   externalArray := NBExternalArrayOfDoubles new: self internalArray size.
   output := NBExternalArrayOfDoubles new: 4.
   [self actualNBCallWith: externalArray adress storeResultIn: output adress] ensure: [externalArray free. output free].

MyClass>>withNBCallCommented
   externalArray := NBExternalArrayOfDoubles new: self internalArray size.
   output := NBExternalArrayOfDoubles new: 4.
   ["self actualNBCallWith: externalArray adress storeResultIn: output adress"] ensure: [externalArray free. output free].

MyClass>>actualNBCallWith: externalArray storeResultIn: output
   <primitive: 'primitiveNativeCall' module: 'NativeBoostPlugin' error: errorCode>
   ^self nbCall: #(void callToC(double * externalArray, double * output)) module: 'lib/myModule.dll'

  • And the C code bellow:
void callToC(double * externalArray, double * output) {
   computationWith(externalArray, output);
}

void specialCallToC(double * externalArray, double * output) {
   unsigned int i;
   for (i = 0; i < 24*3600; i++)
      computationWith(externalArray, output);
}

  • Now I have the following code typed in Time Profiler tool :
object := (MyClass new) variousInitialization; yourself
24*3600 timesRepeat: [object withNBCall]
>> Over 1 minute computation time of which over 99% are primitives. Also I don't see the nbCall: in the tree.

object := (MyClass new) variousInitialization; yourself
24*3600 timesRepeat: [object withNBCallCommented]
>> Less than 1 second.

object := (MyClass new) variousInitialization; yourself
object withNBCall
>> Less than 1 millisecond.

object := (MyClass new) variousInitialization; yourself
object withNBSpecialCall "This time, I use the specialCallToC() function"
>> Arround 20 millisecond.


Allright, that's a pile of code but I hope it help :)

On a side note:
  • Pharo 3, Win 7 32-bit
  • I'm not at work anymore and don't have my code with me. So I will double check tomorow that I didn't provided false informations but I think it's accurate of what I do.

Again, thanks for the interest on my issue !

Thomas.



2014-08-07 18:39 GMT+02:00 kilon alios <[hidden email]>:
I think that if you posted the code , preferably that contains only the problem would be easier to test , debug and investigate. 



On Thu, Aug 7, 2014 at 6:25 PM, Thomas Bany <[hidden email]> wrote:
Hi everyone,

I'm trying to reduce the computation time of the following pseudo-code:

- memory allocation (~40 doubles)
- object heap to C heap copying
- NativeBoost call (nbCall:)
- memory freeing

The time profiling results are bellow:

- 24*3600 calls : > 1 minute
- 24*3600 calls with only memory allocation and copying : < 1 second
- 1 call with a 24*3600 loop inside de C code : < 1 second

So it appears that the very coslty step is the transition from Pharo to C. And I was wondering if it was possible to drasticly reduce this time by doing something like, generate the the machine code once and call it multiple time ?

Thanks in advance !

Thomas.


Reply | Threaded
Open this post in threaded view
|

Re: NativeBoost : optimisation of the machine code generation

Thomas Bany
I forgot the copying of the data from the object heap to C heap:

MyClass>>withNBCall
   externalArray := NBExternalArrayOfDoubles new: self internalArray size.
   output := NBExternalArrayOfDoubles new: 4.
   1 to: self internalArray size. do: [ :index | externalArray at: index (put: self internalArray at: index) ].
   [self actualNBCallWith: externalArray adress storeResultIn: output adress] ensure: [externalArray free. output free].

MyClass>>withNBCallCommented
   externalArray := NBExternalArrayOfDoubles new: self internalArray size.
   output := NBExternalArrayOfDoubles new: 4.
   1 to: self internalArray size. do: [ :index | externalArray at: index (put: self internalArray at: index) ].
   ["self actualNBCallWith: externalArray adress storeResultIn: output adress"] ensure: [externalArray free. output free].

Thomas.



2014-08-07 19:15 GMT+02:00 Thomas Bany <[hidden email]>:
@ Alexandre: sure, no problem !

@ Luc:

I'm not sure how much code I can provide without being to specific, but here is how it goes :

  • Let's say I have the Smalltalk code bellow:
MyClass>>withNBCall
   externalArray := NBExternalArrayOfDoubles new: self internalArray size.
   output := NBExternalArrayOfDoubles new: 4.
   [self actualNBCallWith: externalArray adress storeResultIn: output adress] ensure: [externalArray free. output free].

MyClass>>withNBCallCommented
   externalArray := NBExternalArrayOfDoubles new: self internalArray size.
   output := NBExternalArrayOfDoubles new: 4.
   ["self actualNBCallWith: externalArray adress storeResultIn: output adress"] ensure: [externalArray free. output free].

MyClass>>actualNBCallWith: externalArray storeResultIn: output
   <primitive: 'primitiveNativeCall' module: 'NativeBoostPlugin' error: errorCode>
   ^self nbCall: #(void callToC(double * externalArray, double * output)) module: 'lib/myModule.dll'

  • And the C code bellow:
void callToC(double * externalArray, double * output) {
   computationWith(externalArray, output);
}

void specialCallToC(double * externalArray, double * output) {
   unsigned int i;
   for (i = 0; i < 24*3600; i++)
      computationWith(externalArray, output);
}

  • Now I have the following code typed in Time Profiler tool :
object := (MyClass new) variousInitialization; yourself
24*3600 timesRepeat: [object withNBCall]
>> Over 1 minute computation time of which over 99% are primitives. Also I don't see the nbCall: in the tree.

object := (MyClass new) variousInitialization; yourself
24*3600 timesRepeat: [object withNBCallCommented]
>> Less than 1 second.

object := (MyClass new) variousInitialization; yourself
object withNBCall
>> Less than 1 millisecond.

object := (MyClass new) variousInitialization; yourself
object withNBSpecialCall "This time, I use the specialCallToC() function"
>> Arround 20 millisecond.


Allright, that's a pile of code but I hope it help :)

On a side note:
  • Pharo 3, Win 7 32-bit
  • I'm not at work anymore and don't have my code with me. So I will double check tomorow that I didn't provided false informations but I think it's accurate of what I do.

Again, thanks for the interest on my issue !

Thomas.



2014-08-07 18:39 GMT+02:00 kilon alios <[hidden email]>:

I think that if you posted the code , preferably that contains only the problem would be easier to test , debug and investigate. 



On Thu, Aug 7, 2014 at 6:25 PM, Thomas Bany <[hidden email]> wrote:
Hi everyone,

I'm trying to reduce the computation time of the following pseudo-code:

- memory allocation (~40 doubles)
- object heap to C heap copying
- NativeBoost call (nbCall:)
- memory freeing

The time profiling results are bellow:

- 24*3600 calls : > 1 minute
- 24*3600 calls with only memory allocation and copying : < 1 second
- 1 call with a 24*3600 loop inside de C code : < 1 second

So it appears that the very coslty step is the transition from Pharo to C. And I was wondering if it was possible to drasticly reduce this time by doing something like, generate the the machine code once and call it multiple time ?

Thanks in advance !

Thomas.



Reply | Threaded
Open this post in threaded view
|

Re: NativeBoost : optimisation of the machine code generation

stepharo
In reply to this post by Thomas Bany
NativeBoost methods compiles native code only once (the first time they
are executed) and when sessionId changes (because you may be on a
different platform).
So this is already like that. The assembly code is cached in the method
literal.

Stef
On 7/8/14 17:25, Thomas Bany wrote:

> Hi everyone,
>
> I'm trying to reduce the computation time of the following pseudo-code:
>
> - memory allocation (~40 doubles)
> - object heap to C heap copying
> - NativeBoost call (nbCall:)
> - memory freeing
>
> The time profiling results are bellow:
>
> - 24*3600 calls : > 1 minute
> - 24*3600 calls with only memory allocation and copying : < 1 second
> - 1 call with a 24*3600 loop inside de C code : < 1 second
>
> So it appears that the very coslty step is the transition from Pharo
> to C. And I was wondering if it was possible to drasticly reduce this
> time by doing something like, generate the the machine code once and
> call it multiple time ?
>
> Thanks in advance !
>
> Thomas.


Reply | Threaded
Open this post in threaded view
|

Re: NativeBoost : optimisation of the machine code generation

Thomas Bany
Okey, I found the issue and it was me doing lazy benchmark: I had forgot a debug printing function in the C code, that I had removed between the benchs.

Thanks again for your time !

Thomas.


2014-08-07 21:56 GMT+02:00 stepharo <[hidden email]>:
NativeBoost methods compiles native code only once (the first time they are executed) and when sessionId changes (because you may be on a different platform).
So this is already like that. The assembly code is cached in the method literal.

Stef

On 7/8/14 17:25, Thomas Bany wrote:
Hi everyone,

I'm trying to reduce the computation time of the following pseudo-code:

- memory allocation (~40 doubles)
- object heap to C heap copying
- NativeBoost call (nbCall:)
- memory freeing

The time profiling results are bellow:

- 24*3600 calls : > 1 minute
- 24*3600 calls with only memory allocation and copying : < 1 second
- 1 call with a 24*3600 loop inside de C code : < 1 second

So it appears that the very coslty step is the transition from Pharo to C. And I was wondering if it was possible to drasticly reduce this time by doing something like, generate the the machine code once and call it multiple time ?

Thanks in advance !

Thomas.