Time primHighResClock truncated to 32 bits in 64 bits VMs.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Time primHighResClock truncated to 32 bits in 64 bits VMs.

Juan Vuletich-3
 
Hi Folks,

In 32 bit Cog VMs, `Time primHighResClock` answers LargePositiveInteger,
presumably up to 64 bits. This would mean a rollover in 167 years on a
3.5GHz machine.

But on 64 bit Cog and Stack Spur VMs, it answers a SmallInteger that is
truncated to 32 bits. This means a rollover in about one second.

I guesss this is a bug. Answering a SmallInteger, truncating the CPU 64
bit counter to 60 bits would be ok. I think it makes sense to restrict
answer to SmallInteger to avoid allocation, and a rollover every 41
years is not too much :)

Thanks,

--
Juan Vuletich
www.cuis-smalltalk.org
https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
@JuanVuletich


Reply | Threaded
Open this post in threaded view
|

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

David T. Lewis
 
On Thu, Dec 28, 2017 at 09:32:46AM -0300, Juan Vuletich wrote:

>
> Hi Folks,
>
> In 32 bit Cog VMs, `Time primHighResClock` answers LargePositiveInteger,
> presumably up to 64 bits. This would mean a rollover in 167 years on a
> 3.5GHz machine.
>
> But on 64 bit Cog and Stack Spur VMs, it answers a SmallInteger that is
> truncated to 32 bits. This means a rollover in about one second.
>
> I guesss this is a bug. Answering a SmallInteger, truncating the CPU 64
> bit counter to 60 bits would be ok. I think it makes sense to restrict
> answer to SmallInteger to avoid allocation, and a rollover every 41
> years is not too much :)
>
> Thanks,
>
> --
> Juan Vuletich
> www.cuis-smalltalk.org
> https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
> @JuanVuletich
>
>
Attached is the #primHighResClock accessor for Squeak/Pharo users.

I don't see anything obviously wrong with the primitive, although maybe it
involves the handling of positive64BitIntegerFor: in the 64-bit VM.

The primitive is:


primitiveHighResClock
        "Return the value of the high resolution clock if this system has any. The exact frequency of the high res clock is undefined specifically so that we can use processor dependent instructions (like RDTSC). The only use for the high res clock is for profiling where we can allocate time based on sub-msec resolution of the high res clock. If no high-resolution counter is available, the platform should return zero."
        <export: true>
        self pop: 1.
        self push: (self positive64BitIntegerFor: self ioHighResClock).


And the platform support code does this:

sqLong
ioHighResClock(void)
{
  /* return the value of the high performance counter */
  sqLong value = 0;
#if defined(__GNUC__) && ( defined(i386) || defined(__i386) || defined(__i386__)  \
                        || defined(i486) || defined(__i486) || defined (__i486__) \
                        || defined(intel) || defined(x86) || defined(i86pc) )
    __asm__ __volatile__ ("rdtsc" : "=A"(value));
#elif defined(__arm__) && (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_7A__))
        /* tpr - do nothing for now; needs input from eliot to decide further */
#else
# error "no high res clock defined"
#endif
  return value;
}



 

Time class-primHighResClock.st (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

Eliot Miranda-2
In reply to this post by Juan Vuletich-3
 
Hi Juan,

On Thu, Dec 28, 2017 at 4:32 AM, Juan Vuletich <[hidden email]> wrote:

Hi Folks,

In 32 bit Cog VMs, `Time primHighResClock` answers LargePositiveInteger, presumably up to 64 bits. This would mean a rollover in 167 years on a 3.5GHz machine.

But on 64 bit Cog and Stack Spur VMs, it answers a SmallInteger that is truncated to 32 bits. This means a rollover in about one second.

Are you sure?  What's a test case?  When I look at the source I don't see where this is happening:

platforms/Cross/vm/sq.h:sqLong ioHighResClock(void);

InterpreterPrimitives>>primitiveHighResClock
"Return the value of the high resolution clock if this system has any. The exact frequency of the high res clock is undefined specifically so that we can use processor dependent instructions (like RDTSC). The only use for the high res clock is for profiling where we can allocate time based on sub-msec resolution of the high res clock. If no high-resolution counter is available, the platform should return zero."
<export: true>
self pop: 1.
self push: (self positive64BitIntegerFor: self ioHighResClock).

And positive64BitIntegerFor: does not truncate to 32-bits:

StackInterpreter>>positive64BitIntegerFor: integerValue
<api>
<var: 'integerValue' type: #usqLong>
<var: 'highWord' type: #'unsigned int'>
"Answer a Large Positive Integer object for the given integer value.  N.B. will *not* cause a GC."
| newLargeInteger highWord sz |
objectMemory hasSixtyFourBitImmediates
ifTrue:
[(self cCode: [integerValue] inSmalltalk: [integerValue bitAnd: 1 << 64 - 1]) <= objectMemory maxSmallInteger ifTrue:
[^objectMemory integerObjectOf: integerValue].
sz := 8]
ifFalse:
[(highWord := integerValue >> 32) = 0 ifTrue:
[^self positive32BitIntegerFor: integerValue].
sz := 5.
(highWord := highWord >> 8) = 0 ifFalse:
[sz := sz + 1.
(highWord := highWord >> 8) = 0 ifFalse:
[sz := sz + 1.
(highWord := highWord >> 8) = 0 ifFalse:[sz := sz + 1]]]].
newLargeInteger := objectMemory
eeInstantiateSmallClassIndex: ClassLargePositiveIntegerCompactIndex
format: (objectMemory byteFormatForNumBytes: sz)
numSlots: 8 / objectMemory bytesPerOop.
objectMemory storeLong64: 0 ofObject: newLargeInteger withValue: (objectMemory byteSwapped64IfBigEndian: integerValue).
^newLargeInteger


So on my reading, on 64-bits this answers un-truncated non-negative SmallIntegers up to 60 bits in length, and then overflows into 8 byte LargePositiveIntegers.


I guesss this is a bug. Answering a SmallInteger, truncating the CPU 64 bit counter to 60 bits would be ok. I think it makes sense to restrict answer to SmallInteger to avoid allocation, and a rollover every 41 years is not too much :)

Thanks,

--
Juan Vuletich
www.cuis-smalltalk.org
https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
@JuanVuletich





--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

Eliot Miranda-2
In reply to this post by David T. Lewis
 
Hi David, Hi Jiuan,

On Thu, Dec 28, 2017 at 8:56 AM, David T. Lewis <[hidden email]> wrote:
 
On Thu, Dec 28, 2017 at 09:32:46AM -0300, Juan Vuletich wrote:
>
> Hi Folks,
>
> In 32 bit Cog VMs, `Time primHighResClock` answers LargePositiveInteger,
> presumably up to 64 bits. This would mean a rollover in 167 years on a
> 3.5GHz machine.
>
> But on 64 bit Cog and Stack Spur VMs, it answers a SmallInteger that is
> truncated to 32 bits. This means a rollover in about one second.
>
> I guesss this is a bug. Answering a SmallInteger, truncating the CPU 64
> bit counter to 60 bits would be ok. I think it makes sense to restrict
> answer to SmallInteger to avoid allocation, and a rollover every 41
> years is not too much :)
>
> Thanks,
>
> --
> Juan Vuletich
> www.cuis-smalltalk.org
> https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
> @JuanVuletich
>
>

Attached is the #primHighResClock accessor for Squeak/Pharo users.

I don't see anything obviously wrong with the primitive, although maybe it
involves the handling of positive64BitIntegerFor: in the 64-bit VM.

The primitive is:


primitiveHighResClock
        "Return the value of the high resolution clock if this system has any. The exact frequency of the high res clock is undefined specifically so that we can use processor dependent instructions (like RDTSC). The only use for the high res clock is for profiling where we can allocate time based on sub-msec resolution of the high res clock. If no high-resolution counter is available, the platform should return zero."
        <export: true>
        self pop: 1.
        self push: (self positive64BitIntegerFor: self ioHighResClock).


And the platform support code does this:

sqLong
ioHighResClock(void)
{
  /* return the value of the high performance counter */
  sqLong value = 0;
#if defined(__GNUC__) && ( defined(i386) || defined(__i386) || defined(__i386__)  \
                        || defined(i486) || defined(__i486) || defined (__i486__) \
                        || defined(intel) || defined(x86) || defined(i86pc) )
    __asm__ __volatile__ ("rdtsc" : "=A"(value));
#elif defined(__arm__) && (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_7A__))
        /* tpr - do nothing for now; needs input from eliot to decide further */
#else
# error "no high res clock defined"
#endif
  return value;
}

 
Ah, OK.  So this is the problem.  This will answer ex on 64-bit systems, which discard the upper 32-bits.  rdtsc loads %edx:%eax with the 64-bit time stamp, but on 64-bits the in-line asm will simply move %eax to %rax. We have to rewrite that in-line assembler to construct %rax correctly from %edx and %eax.



--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

Tobias Pape
 
Hi all,

> On 28.12.2017, at 18:06, Eliot Miranda <[hidden email]> wrote:
>
> Hi David, Hi Jiuan,
>
> On Thu, Dec 28, 2017 at 8:56 AM, David T. Lewis <[hidden email]> wrote:
>  
> On Thu, Dec 28, 2017 at 09:32:46AM -0300, Juan Vuletich wrote:
> >
> > Hi Folks,
> >
> > In 32 bit Cog VMs, `Time primHighResClock` answers LargePositiveInteger,
> > presumably up to 64 bits. This would mean a rollover in 167 years on a
> > 3.5GHz machine.
> >
> > But on 64 bit Cog and Stack Spur VMs, it answers a SmallInteger that is
> > truncated to 32 bits. This means a rollover in about one second.
> >
> > I guesss this is a bug. Answering a SmallInteger, truncating the CPU 64
> > bit counter to 60 bits would be ok. I think it makes sense to restrict
> > answer to SmallInteger to avoid allocation, and a rollover every 41
> > years is not too much :)
> >
> > Thanks,
> >
> > --
> > Juan Vuletich
> > www.cuis-smalltalk.org
> > https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
> > @JuanVuletich
> >
> >
>
> Attached is the #primHighResClock accessor for Squeak/Pharo users.
>
> I don't see anything obviously wrong with the primitive, although maybe it
> involves the handling of positive64BitIntegerFor: in the 64-bit VM.
>
> The primitive is:
>
>
> primitiveHighResClock
>         "Return the value of the high resolution clock if this system has any. The exact frequency of the high res clock is undefined specifically so that we can use processor dependent instructions (like RDTSC). The only use for the high res clock is for profiling where we can allocate time based on sub-msec resolution of the high res clock. If no high-resolution counter is available, the platform should return zero."
>         <export: true>
>         self pop: 1.
>         self push: (self positive64BitIntegerFor: self ioHighResClock).
>
>
> And the platform support code does this:
>
> sqLong
> ioHighResClock(void)
> {
>   /* return the value of the high performance counter */
>   sqLong value = 0;
> #if defined(__GNUC__) && ( defined(i386) || defined(__i386) || defined(__i386__)  \
>                         || defined(i486) || defined(__i486) || defined (__i486__) \
>                         || defined(intel) || defined(x86) || defined(i86pc) )
>     __asm__ __volatile__ ("rdtsc" : "=A"(value));
> #elif defined(__arm__) && (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_7A__))
>         /* tpr - do nothing for now; needs input from eliot to decide further */
> #else
> # error "no high res clock defined"
> #endif
>   return value;
> }
>
>  
> Ah, OK.  So this is the problem.  This will answer ex on 64-bit systems, which discard the upper 32-bits.  rdtsc loads %edx:%eax with the 64-bit time stamp, but on 64-bits the in-line asm will simply move %eax to %rax. We have to rewrite that in-line assembler to construct %rax correctly from %edx and %eax.
>

Wikipedia points to this code: https://web.archive.org/web/20161215213659/http://www.cs.wm.edu/~kearns/001lab.d/rdtsc.html

"
   unsigned long long int x;
   unsigned a, d;

   __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));

   return ((unsigned long long)a) | (((unsigned long long)d) << 32);
"

Which seems reasonable.

Best regards
        -Tobias


>
>
> --
> _,,,^..^,,,_
> best, Eliot

Reply | Threaded
Open this post in threaded view
|

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

timrowledge
In reply to this post by David T. Lewis
 

> On 28-12-2017, at 8:56 AM, David T. Lewis <[hidden email]> wrote:
> And the platform support code does this:
>
> sqLong
> ioHighResClock(void)
> {
>  /* return the value of the high performance counter */
>  sqLong value = 0;
> #if defined(__GNUC__) && ( defined(i386) || defined(__i386) || defined(__i386__)  \
>                        || defined(i486) || defined(__i486) || defined (__i486__) \
>                        || defined(intel) || defined(x86) || defined(i86pc) )
>    __asm__ __volatile__ ("rdtsc" : "=A"(value));
> #elif defined(__arm__) && (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_7A__))
>        /* tpr - do nothing for now; needs input from eliot to decide further */
> #else
> # error "no high res clock defined"
> #endif
>  return value;
> }
>

As a reminder to fix the ARM part someday, and because I am *so* not going to mess with git right now, here is an extract from
https://www.raspberrypi.org/forums/viewtopic.php?t=30821 "RDTSC on ARM”
=====================
Issue 1. The use of x86 assembly code.
The FFTW project has a whole bunch of equivalents to rdtsc as part of its own profiling code. See https://github.com/vesperix/FFTW-for-AR ... el/cycle.h
Replace the errant lines in nbeesrc-jan-10-2013/src/nbee/globals/profiling-functions.c (lines 49-50) with
Code: Select all

volatile unsigned cc;
static int init = 0;
if(!init) {
  __asm__ __volatile__ ("mcr p15, 0, %0, c9, c12, 2" :: "r"(1<<31)); /* stop the cc */
  __asm__ __volatile__ ("mcr p15, 0, %0, c9, c12, 0" :: "r"(5));     /* initialize */
  __asm__ __volatile__ ("mcr p15, 0, %0, c9, c12, 1" :: "r"(1<<31)); /* start the cc */
  init = 1;
}
__asm__ __volatile__ ("mrc p15, 0, %0, c9, c13, 0" : "=r"(cc));
return cc;
====================

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
All wiyht.  Rho sritched mg kegtops awound?


Reply | Threaded
Open this post in threaded view
|

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

Juan Vuletich-3
In reply to this post by Eliot Miranda-2
 
On 12/28/2017 2:06 PM, Eliot Miranda wrote:

Hi Eliot, Hi David,

Hi David, Hi Jiuan,

On Thu, Dec 28, 2017 at 8:56 AM, David T. Lewis <[hidden email]> wrote:
 
On Thu, Dec 28, 2017 at 09:32:46AM -0300, Juan Vuletich wrote:
>
> Hi Folks,
>
> In 32 bit Cog VMs, `Time primHighResClock` answers LargePositiveInteger,
> presumably up to 64 bits. This would mean a rollover in 167 years on a
> 3.5GHz machine.
>
> But on 64 bit Cog and Stack Spur VMs, it answers a SmallInteger that is
> truncated to 32 bits. This means a rollover in about one second.
>
> I guesss this is a bug. Answering a SmallInteger, truncating the CPU 64
> bit counter to 60 bits would be ok. I think it makes sense to restrict
> answer to SmallInteger to avoid allocation, and a rollover every 41
> years is not too much :)
>
> Thanks,
>
> --
> Juan Vuletich
> www.cuis-smalltalk.org
> https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
> @JuanVuletich
>
>

Attached is the #primHighResClock accessor for Squeak/Pharo users.

I don't see anything obviously wrong with the primitive, although maybe it
involves the handling of positive64BitIntegerFor: in the 64-bit VM.

The primitive is:


primitiveHighResClock
        "Return the value of the high resolution clock if this system has any. The exact frequency of the high res clock is undefined specifically so that we can use processor dependent instructions (like RDTSC). The only use for the high res clock is for profiling where we can allocate time based on sub-msec resolution of the high res clock. If no high-resolution counter is available, the platform should return zero."
        <export: true>
        self pop: 1.
        self push: (self positive64BitIntegerFor: self ioHighResClock).


And the platform support code does this:

sqLong
ioHighResClock(void)
{
  /* return the value of the high performance counter */
  sqLong value = 0;
#if defined(__GNUC__) && ( defined(i386) || defined(__i386) || defined(__i386__)  \
                        || defined(i486) || defined(__i486) || defined (__i486__) \
                        || defined(intel) || defined(x86) || defined(i86pc) )
    __asm__ __volatile__ ("rdtsc" : "=A"(value));
#elif defined(__arm__) && (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_7A__))
        /* tpr - do nothing for now; needs input from eliot to decide further */
#else
# error "no high res clock defined"
#endif
  return value;
}

 
Ah, OK.  So this is the problem.  This will answer ex on 64-bit systems, which discard the upper 32-bits.  rdtsc loads %edx:%eax with the 64-bit time stamp, but on 64-bits the in-line asm will simply move %eax to %rax. We have to rewrite that in-line assembler to construct %rax correctly from %edx and %eax.



--
_,,,^..^,,,_
best, Eliot


Thanks!

Cheers,
-- 
Juan Vuletich
www.cuis-smalltalk.org
https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
@JuanVuletich
Reply | Threaded
Open this post in threaded view
|

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

Tobias Pape
In reply to this post by timrowledge
 

> On 28.12.2017, at 20:07, tim Rowledge <[hidden email]> wrote:
>
>
>
>> On 28-12-2017, at 8:56 AM, David T. Lewis <[hidden email]> wrote:
>> And the platform support code does this:
>>
>> sqLong
>> ioHighResClock(void)
>> {
>> /* return the value of the high performance counter */
>> sqLong value = 0;
>> #if defined(__GNUC__) && ( defined(i386) || defined(__i386) || defined(__i386__)  \
>>                       || defined(i486) || defined(__i486) || defined (__i486__) \
>>                       || defined(intel) || defined(x86) || defined(i86pc) )
>>   __asm__ __volatile__ ("rdtsc" : "=A"(value));
>> #elif defined(__arm__) && (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_7A__))
>>       /* tpr - do nothing for now; needs input from eliot to decide further */
>> #else
>> # error "no high res clock defined"
>> #endif
>> return value;
>> }
>>
>
> As a reminder to fix the ARM part someday, and because I am *so* not going to mess with git right now, here is an extract from
> https://www.raspberrypi.org/forums/viewtopic.php?t=30821 "RDTSC on ARM”
> =====================
> Issue 1. The use of x86 assembly code.
> The FFTW project has a whole bunch of equivalents to rdtsc as part of its own profiling code. See https://github.com/vesperix/FFTW-for-AR ... el/cycle.h
> Replace the errant lines in nbeesrc-jan-10-2013/src/nbee/globals/profiling-functions.c (lines 49-50) with
> Code: Select all
>
> volatile unsigned cc;
> static int init = 0;
> if(!init) {
>  __asm__ __volatile__ ("mcr p15, 0, %0, c9, c12, 2" :: "r"(1<<31)); /* stop the cc */
>  __asm__ __volatile__ ("mcr p15, 0, %0, c9, c12, 0" :: "r"(5));     /* initialize */
>  __asm__ __volatile__ ("mcr p15, 0, %0, c9, c12, 1" :: "r"(1<<31)); /* start the cc */
>  init = 1;
> }
> __asm__ __volatile__ ("mrc p15, 0, %0, c9, c13, 0" : "=r"(cc));
> return cc;
> ====================


Looks interesting, but since when is this supported?
When I look for the ARM1176JZF-S Docu (aka Raspberry Pi 1), http://infocenter.arm.com/help/topic/com.arm.doc.ddi0290g/Bihbeabc.html
the CP15 info (see p15 above) has info on the c9 register, but not with c12 and c13 as second register.

OTOH, from the ARM Cortex-A7 (aka Raspberry Pi 2), the registers _are_ documented as "Performance monitor Control": http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464f/BABFIBHD.html

Do we want to be pi2-above compatible or also below?

Best regards
        -Tobias


>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> All wiyht.  Rho sritched mg kegtops awound?
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

timrowledge
 

> On 31-12-2017, at 4:37 AM, Tobias Pape <[hidden email]> wrote:
>> [snip]
>> As a reminder to fix the ARM part someday, and because I am *so* not going to mess with git right now, here is an extract from
>> https://www.raspberrypi.org/forums/viewtopic.php?t=30821 "RDTSC on ARM”
[snip]
>
>
> Looks interesting, but since when is this supported?
> When I look for the ARM1176JZF-S Docu (aka Raspberry Pi 1), http://infocenter.arm.com/help/topic/com.arm.doc.ddi0290g/Bihbeabc.html
> the CP15 info (see p15 above) has info on the c9 register, but not with c12 and c13 as second register.
>
> OTOH, from the ARM Cortex-A7 (aka Raspberry Pi 2), the registers _are_ documented as "Performance monitor Control": http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464f/BABFIBHD.html
>
> Do we want to be pi2-above compatible or also below?

Well the older v6 cores are less interesting, at least in Pi-land, because rarer; about 65% of those sold are v7 and over  30% are v7/8. However, it would be lovely to give those older Pi 1’s and the newer Pi 0’s a hires clock and I suspect that we’d use the c15/c12 cycle count register.

The problem is that there is only a 32bit value (in either case) and so we’d have to set up an interrupt to call on overflow and blah-blah-blah. So much more fun to get the v8 Cog done and use the PMU PMCCNTR_EL0 register, which is a proper 64bit reg. Ooh, and it can be set to tick every 64 cycles too, giving a larger range, and even do not tick if  various filters set. I think that’s supposed to allow not-counting of time to do a cache fiddle or whatever. Time for some smart person to pay me to get to work on the v8….


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful random insult:- Life by Norman Rockwell, but screenplay by Stephen King.