Smalltalk › Squeak › Squeak VM

Time primHighResClock truncated to 32 bits in 64 bits VMs.

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

9 messages Options

Juan Vuletich-3

Time primHighResClock truncated to 32 bits in 64 bits VMs.

Hi Folks,

In 32 bit Cog VMs, `Time primHighResClock` answers LargePositiveInteger,
presumably up to 64 bits. This would mean a rollover in 167 years on a
3.5GHz machine.

But on 64 bit Cog and Stack Spur VMs, it answers a SmallInteger that is
truncated to 32 bits. This means a rollover in about one second.

I guesss this is a bug. Answering a SmallInteger, truncating the CPU 64
bit counter to 60 bits would be ok. I think it makes sense to restrict
answer to SmallInteger to avoid allocation, and a rollover every 41
years is not too much :)

Thanks,

--
Juan Vuletich
www.cuis-smalltalk.org
https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
@JuanVuletich

David T. Lewis

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

On Thu, Dec 28, 2017 at 09:32:46AM -0300, Juan Vuletich wrote:

>
> Hi Folks,
>
> In 32 bit Cog VMs, `Time primHighResClock` answers LargePositiveInteger,
> presumably up to 64 bits. This would mean a rollover in 167 years on a
> 3.5GHz machine.
>
> But on 64 bit Cog and Stack Spur VMs, it answers a SmallInteger that is
> truncated to 32 bits. This means a rollover in about one second.
>
> I guesss this is a bug. Answering a SmallInteger, truncating the CPU 64
> bit counter to 60 bits would be ok. I think it makes sense to restrict
> answer to SmallInteger to avoid allocation, and a rollover every 41
> years is not too much :)
>
> Thanks,
>
> --
> Juan Vuletich
> www.cuis-smalltalk.org
> https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
> @JuanVuletich
>
>

Attached is the #primHighResClock accessor for Squeak/Pharo users.

I don't see anything obviously wrong with the primitive, although maybe it
involves the handling of positive64BitIntegerFor: in the 64-bit VM.

The primitive is:

primitiveHighResClock
"Return the value of the high resolution clock if this system has any. The exact frequency of the high res clock is undefined specifically so that we can use processor dependent instructions (like RDTSC). The only use for the high res clock is for profiling where we can allocate time based on sub-msec resolution of the high res clock. If no high-resolution counter is available, the platform should return zero."
<export: true>
self pop: 1.
self push: (self positive64BitIntegerFor: self ioHighResClock).

And the platform support code does this:

sqLong
ioHighResClock(void)
{
/* return the value of the high performance counter */
sqLong value = 0;
#if defined(__GNUC__) && ( defined(i386) || defined(__i386) || defined(__i386__) \
|| defined(i486) || defined(__i486) || defined (__i486__) \
|| defined(intel) || defined(x86) || defined(i86pc) )
__asm__ __volatile__ ("rdtsc" : "=A"(value));
#elif defined(__arm__) && (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_7A__))
/* tpr - do nothing for now; needs input from eliot to decide further */
#else
# error "no high res clock defined"
#endif
return value;
}

Time class-primHighResClock.st (1K) Download Attachment

Eliot Miranda-2

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

In reply to this post by Juan Vuletich-3

Hi Juan,

On Thu, Dec 28, 2017 at 4:32 AM, Juan Vuletich <[hidden email]> wrote:

Hi Folks,

In 32 bit Cog VMs, `Time primHighResClock` answers LargePositiveInteger, presumably up to 64 bits. This would mean a rollover in 167 years on a 3.5GHz machine.

But on 64 bit Cog and Stack Spur VMs, it answers a SmallInteger that is truncated to 32 bits. This means a rollover in about one second.

Are you sure? What's a test case? When I look at the source I don't see where this is happening:

platforms/Cross/vm/sq.h:sqLong ioHighResClock(void);

InterpreterPrimitives>>primitiveHighResClock

"Return the value of the high resolution clock if this system has any. The exact frequency of the high res clock is undefined specifically so that we can use processor dependent instructions (like RDTSC). The only use for the high res clock is for profiling where we can allocate time based on sub-msec resolution of the high res clock. If no high-resolution counter is available, the platform should return zero."

<export: true>

self pop: 1.

self push: (self positive64BitIntegerFor: self ioHighResClock).

And positive64BitIntegerFor: does not truncate to 32-bits:

StackInterpreter>>positive64BitIntegerFor: integerValue

<api>

<var: 'integerValue' type: #usqLong>

<var: 'highWord' type: #'unsigned int'>

"Answer a Large Positive Integer object for the given integer value. N.B. will *not* cause a GC."

| newLargeInteger highWord sz |

objectMemory hasSixtyFourBitImmediates

ifTrue:

[(self cCode: [integerValue] inSmalltalk: [integerValue bitAnd: 1 << 64 - 1]) <= objectMemory maxSmallInteger ifTrue:

[^objectMemory integerObjectOf: integerValue].

sz := 8]

ifFalse:

[(highWord := integerValue >> 32) = 0 ifTrue:

[^self positive32BitIntegerFor: integerValue].

sz := 5.

(highWord := highWord >> 8) = 0 ifFalse:

[sz := sz + 1.

(highWord := highWord >> 8) = 0 ifFalse:

[sz := sz + 1.

(highWord := highWord >> 8) = 0 ifFalse:[sz := sz + 1]]]].

newLargeInteger := objectMemory

eeInstantiateSmallClassIndex: ClassLargePositiveIntegerCompactIndex

format: (objectMemory byteFormatForNumBytes: sz)

numSlots: 8 / objectMemory bytesPerOop.

objectMemory storeLong64: 0 ofObject: newLargeInteger withValue: (objectMemory byteSwapped64IfBigEndian: integerValue).

^newLargeInteger

So on my reading, on 64-bits this answers un-truncated non-negative SmallIntegers up to 60 bits in length, and then overflows into 8 byte LargePositiveIntegers.

I guesss this is a bug. Answering a SmallInteger, truncating the CPU 64 bit counter to 60 bits would be ok. I think it makes sense to restrict answer to SmallInteger to avoid allocation, and a rollover every 41 years is not too much :)

Thanks,

--
Juan Vuletich
www.cuis-smalltalk.org
https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
@JuanVuletich

_,,,^..^,,,_

best, Eliot

Eliot Miranda-2

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

In reply to this post by David T. Lewis

Hi David, Hi Jiuan,

On Thu, Dec 28, 2017 at 8:56 AM, David T. Lewis <[hidden email]> wrote:

On Thu, Dec 28, 2017 at 09:32:46AM -0300, Juan Vuletich wrote:
>
> Hi Folks,
>
> In 32 bit Cog VMs, `Time primHighResClock` answers LargePositiveInteger,
> presumably up to 64 bits. This would mean a rollover in 167 years on a
> 3.5GHz machine.
>
> But on 64 bit Cog and Stack Spur VMs, it answers a SmallInteger that is
> truncated to 32 bits. This means a rollover in about one second.
>
> I guesss this is a bug. Answering a SmallInteger, truncating the CPU 64
> bit counter to 60 bits would be ok. I think it makes sense to restrict
> answer to SmallInteger to avoid allocation, and a rollover every 41
> years is not too much :)
>
> Thanks,
>
> --
> Juan Vuletich
> www.cuis-smalltalk.org
> https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
> @JuanVuletich
>
>

Attached is the #primHighResClock accessor for Squeak/Pharo users.

I don't see anything obviously wrong with the primitive, although maybe it
involves the handling of positive64BitIntegerFor: in the 64-bit VM.

The primitive is:

primitiveHighResClock
"Return the value of the high resolution clock if this system has any. The exact frequency of the high res clock is undefined specifically so that we can use processor dependent instructions (like RDTSC). The only use for the high res clock is for profiling where we can allocate time based on sub-msec resolution of the high res clock. If no high-resolution counter is available, the platform should return zero."
<export: true>
self pop: 1.
self push: (self positive64BitIntegerFor: self ioHighResClock).

And the platform support code does this:

sqLong
ioHighResClock(void)
{
/* return the value of the high performance counter */
sqLong value = 0;
#if defined(__GNUC__) && ( defined(i386) || defined(__i386) || defined(__i386__) \
|| defined(i486) || defined(__i486) || defined (__i486__) \
|| defined(intel) || defined(x86) || defined(i86pc) )
__asm__ __volatile__ ("rdtsc" : "=A"(value));
#elif defined(__arm__) && (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_7A__))
/* tpr - do nothing for now; needs input from eliot to decide further */
#else
# error "no high res clock defined"
#endif
return value;
}

Ah, OK. So this is the problem. This will answer ex on 64-bit systems, which discard the upper 32-bits. rdtsc loads %edx:%eax with the 64-bit time stamp, but on 64-bits the in-line asm will simply move %eax to %rax. We have to rewrite that in-line assembler to construct %rax correctly from %edx and %eax.

_,,,^..^,,,_

best, Eliot

Tobias Pape

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

Hi all,

> On 28.12.2017, at 18:06, Eliot Miranda <[hidden email]> wrote:
>
> Hi David, Hi Jiuan,
>
> On Thu, Dec 28, 2017 at 8:56 AM, David T. Lewis <[hidden email]> wrote:
>
> On Thu, Dec 28, 2017 at 09:32:46AM -0300, Juan Vuletich wrote:
> >
> > Hi Folks,
> >
> > In 32 bit Cog VMs, `Time primHighResClock` answers LargePositiveInteger,
> > presumably up to 64 bits. This would mean a rollover in 167 years on a
> > 3.5GHz machine.
> >
> > But on 64 bit Cog and Stack Spur VMs, it answers a SmallInteger that is
> > truncated to 32 bits. This means a rollover in about one second.
> >
> > I guesss this is a bug. Answering a SmallInteger, truncating the CPU 64
> > bit counter to 60 bits would be ok. I think it makes sense to restrict
> > answer to SmallInteger to avoid allocation, and a rollover every 41
> > years is not too much :)
> >
> > Thanks,
> >
> > --
> > Juan Vuletich
> > www.cuis-smalltalk.org
> > https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
> > @JuanVuletich
> >
> >
>
> Attached is the #primHighResClock accessor for Squeak/Pharo users.
>
> I don't see anything obviously wrong with the primitive, although maybe it
> involves the handling of positive64BitIntegerFor: in the 64-bit VM.
>
> The primitive is:
>
>
> primitiveHighResClock
> "Return the value of the high resolution clock if this system has any. The exact frequency of the high res clock is undefined specifically so that we can use processor dependent instructions (like RDTSC). The only use for the high res clock is for profiling where we can allocate time based on sub-msec resolution of the high res clock. If no high-resolution counter is available, the platform should return zero."
> <export: true>
> self pop: 1.
> self push: (self positive64BitIntegerFor: self ioHighResClock).
>
>
> And the platform support code does this:
>
> sqLong
> ioHighResClock(void)
> {
> /* return the value of the high performance counter */
> sqLong value = 0;
> #if defined(__GNUC__) && ( defined(i386) || defined(__i386) || defined(__i386__) \
> || defined(i486) || defined(__i486) || defined (__i486__) \
> || defined(intel) || defined(x86) || defined(i86pc) )
> __asm__ __volatile__ ("rdtsc" : "=A"(value));
> #elif defined(__arm__) && (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_7A__))
> /* tpr - do nothing for now; needs input from eliot to decide further */
> #else
> # error "no high res clock defined"
> #endif
> return value;
> }
>
>
> Ah, OK. So this is the problem. This will answer ex on 64-bit systems, which discard the upper 32-bits. rdtsc loads %edx:%eax with the 64-bit time stamp, but on 64-bits the in-line asm will simply move %eax to %rax. We have to rewrite that in-line assembler to construct %rax correctly from %edx and %eax.
>

Wikipedia points to this code: https://web.archive.org/web/20161215213659/http://www.cs.wm.edu/~kearns/001lab.d/rdtsc.html

"
unsigned long long int x;
unsigned a, d;

__asm__ volatile("rdtsc" : "=a" (a), "=d" (d));

return ((unsigned long long)a) | (((unsigned long long)d) << 32);
"

Which seems reasonable.

Best regards
-Tobias

>
>
> --
> _,,,^..^,,,_
> best, Eliot

timrowledge

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

In reply to this post by David T. Lewis

> On 28-12-2017, at 8:56 AM, David T. Lewis <[hidden email]> wrote:
> And the platform support code does this:
>
> sqLong
> ioHighResClock(void)
> {
> /* return the value of the high performance counter */
> sqLong value = 0;
> #if defined(__GNUC__) && ( defined(i386) || defined(__i386) || defined(__i386__) \
> || defined(i486) || defined(__i486) || defined (__i486__) \
> || defined(intel) || defined(x86) || defined(i86pc) )
> __asm__ __volatile__ ("rdtsc" : "=A"(value));
> #elif defined(__arm__) && (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_7A__))
> /* tpr - do nothing for now; needs input from eliot to decide further */
> #else
> # error "no high res clock defined"
> #endif
> return value;
> }
>

As a reminder to fix the ARM part someday, and because I am *so* not going to mess with git right now, here is an extract from
https://www.raspberrypi.org/forums/viewtopic.php?t=30821 "RDTSC on ARM”
=====================
Issue 1. The use of x86 assembly code.
The FFTW project has a whole bunch of equivalents to rdtsc as part of its own profiling code. See https://github.com/vesperix/FFTW-for-AR ... el/cycle.h
Replace the errant lines in nbeesrc-jan-10-2013/src/nbee/globals/profiling-functions.c (lines 49-50) with
Code: Select all

volatile unsigned cc;
static int init = 0;
if(!init) {
__asm__ __volatile__ ("mcr p15, 0, %0, c9, c12, 2" :: "r"(1<<31)); /* stop the cc */
__asm__ __volatile__ ("mcr p15, 0, %0, c9, c12, 0" :: "r"(5)); /* initialize */
__asm__ __volatile__ ("mcr p15, 0, %0, c9, c12, 1" :: "r"(1<<31)); /* start the cc */
init = 1;
}
__asm__ __volatile__ ("mrc p15, 0, %0, c9, c13, 0" : "=r"(cc));
return cc;
====================

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
All wiyht. Rho sritched mg kegtops awound?

Juan Vuletich-3

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

In reply to this post by Eliot Miranda-2

On 12/28/2017 2:06 PM, Eliot Miranda wrote:

Hi Eliot, Hi David,

Hi David, Hi Jiuan,

On Thu, Dec 28, 2017 at 8:56 AM, David T. Lewis <[hidden email]> wrote:

On Thu, Dec 28, 2017 at 09:32:46AM -0300, Juan Vuletich wrote:
>
> Hi Folks,
>
> In 32 bit Cog VMs, `Time primHighResClock` answers LargePositiveInteger,
> presumably up to 64 bits. This would mean a rollover in 167 years on a
> 3.5GHz machine.
>
> But on 64 bit Cog and Stack Spur VMs, it answers a SmallInteger that is
> truncated to 32 bits. This means a rollover in about one second.
>
> I guesss this is a bug. Answering a SmallInteger, truncating the CPU 64
> bit counter to 60 bits would be ok. I think it makes sense to restrict
> answer to SmallInteger to avoid allocation, and a rollover every 41
> years is not too much :)
>
> Thanks,
>
> --
> Juan Vuletich
> www.cuis-smalltalk.org
> https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
> @JuanVuletich
>
>

Attached is the #primHighResClock accessor for Squeak/Pharo users.

I don't see anything obviously wrong with the primitive, although maybe it
involves the handling of positive64BitIntegerFor: in the 64-bit VM.

The primitive is:

primitiveHighResClock
"Return the value of the high resolution clock if this system has any. The exact frequency of the high res clock is undefined specifically so that we can use processor dependent instructions (like RDTSC). The only use for the high res clock is for profiling where we can allocate time based on sub-msec resolution of the high res clock. If no high-resolution counter is available, the platform should return zero."
<export: true>
self pop: 1.
self push: (self positive64BitIntegerFor: self ioHighResClock).

And the platform support code does this:

sqLong
ioHighResClock(void)
{
/* return the value of the high performance counter */
sqLong value = 0;
#if defined(__GNUC__) && ( defined(i386) || defined(__i386) || defined(__i386__) \
|| defined(i486) || defined(__i486) || defined (__i486__) \
|| defined(intel) || defined(x86) || defined(i86pc) )
__asm__ __volatile__ ("rdtsc" : "=A"(value));
#elif defined(__arm__) && (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_7A__))
/* tpr - do nothing for now; needs input from eliot to decide further */
#else
# error "no high res clock defined"
#endif
return value;
}

Ah, OK. So this is the problem. This will answer ex on 64-bit systems, which discard the upper 32-bits. rdtsc loads %edx:%eax with the 64-bit time stamp, but on 64-bits the in-line asm will simply move %eax to %rax. We have to rewrite that in-line assembler to construct %rax correctly from %edx and %eax.

--

_,,,^..^,,,_

best, Eliot

Thanks!

Cheers,

-- 
Juan Vuletich
www.cuis-smalltalk.org
https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
@JuanVuletich

Tobias Pape

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

In reply to this post by timrowledge

> On 28.12.2017, at 20:07, tim Rowledge <[hidden email]> wrote:
>
>
>
>> On 28-12-2017, at 8:56 AM, David T. Lewis <[hidden email]> wrote:
>> And the platform support code does this:
>>
>> sqLong
>> ioHighResClock(void)
>> {
>> /* return the value of the high performance counter */
>> sqLong value = 0;
>> #if defined(__GNUC__) && ( defined(i386) || defined(__i386) || defined(__i386__) \
>> || defined(i486) || defined(__i486) || defined (__i486__) \
>> || defined(intel) || defined(x86) || defined(i86pc) )
>> __asm__ __volatile__ ("rdtsc" : "=A"(value));
>> #elif defined(__arm__) && (defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_7A__))
>> /* tpr - do nothing for now; needs input from eliot to decide further */
>> #else
>> # error "no high res clock defined"
>> #endif
>> return value;
>> }
>>
>
> As a reminder to fix the ARM part someday, and because I am *so* not going to mess with git right now, here is an extract from
> https://www.raspberrypi.org/forums/viewtopic.php?t=30821 "RDTSC on ARM”
> =====================
> Issue 1. The use of x86 assembly code.
> The FFTW project has a whole bunch of equivalents to rdtsc as part of its own profiling code. See https://github.com/vesperix/FFTW-for-AR ... el/cycle.h
> Replace the errant lines in nbeesrc-jan-10-2013/src/nbee/globals/profiling-functions.c (lines 49-50) with
> Code: Select all
>
> volatile unsigned cc;
> static int init = 0;
> if(!init) {
> __asm__ __volatile__ ("mcr p15, 0, %0, c9, c12, 2" :: "r"(1<<31)); /* stop the cc */
> __asm__ __volatile__ ("mcr p15, 0, %0, c9, c12, 0" :: "r"(5)); /* initialize */
> __asm__ __volatile__ ("mcr p15, 0, %0, c9, c12, 1" :: "r"(1<<31)); /* start the cc */
> init = 1;
> }
> __asm__ __volatile__ ("mrc p15, 0, %0, c9, c13, 0" : "=r"(cc));
> return cc;
> ====================

Looks interesting, but since when is this supported?
When I look for the ARM1176JZF-S Docu (aka Raspberry Pi 1), http://infocenter.arm.com/help/topic/com.arm.doc.ddi0290g/Bihbeabc.html
the CP15 info (see p15 above) has info on the c9 register, but not with c12 and c13 as second register.

OTOH, from the ARM Cortex-A7 (aka Raspberry Pi 2), the registers _are_ documented as "Performance monitor Control": http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464f/BABFIBHD.html

Do we want to be pi2-above compatible or also below?

Best regards
-Tobias

>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> All wiyht. Rho sritched mg kegtops awound?
>
>

timrowledge

Re: Time primHighResClock truncated to 32 bits in 64 bits VMs.

> On 31-12-2017, at 4:37 AM, Tobias Pape <[hidden email]> wrote:
>> [snip]
>> As a reminder to fix the ARM part someday, and because I am *so* not going to mess with git right now, here is an extract from
>> https://www.raspberrypi.org/forums/viewtopic.php?t=30821 "RDTSC on ARM”
[snip]
>
>
> Looks interesting, but since when is this supported?
> When I look for the ARM1176JZF-S Docu (aka Raspberry Pi 1), http://infocenter.arm.com/help/topic/com.arm.doc.ddi0290g/Bihbeabc.html
> the CP15 info (see p15 above) has info on the c9 register, but not with c12 and c13 as second register.
>
> OTOH, from the ARM Cortex-A7 (aka Raspberry Pi 2), the registers _are_ documented as "Performance monitor Control": http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464f/BABFIBHD.html
>
> Do we want to be pi2-above compatible or also below?

Well the older v6 cores are less interesting, at least in Pi-land, because rarer; about 65% of those sold are v7 and over 30% are v7/8. However, it would be lovely to give those older Pi 1’s and the newer Pi 0’s a hires clock and I suspect that we’d use the c15/c12 cycle count register.

The problem is that there is only a 32bit value (in either case) and so we’d have to set up an interrupt to call on overflow and blah-blah-blah. So much more fun to get the v8 Cog done and use the PMU PMCCNTR_EL0 register, which is a proper 64bit reg. Ooh, and it can be set to tick every 64 cycles too, giving a larger range, and even do not tick if various filters set. I think that’s supposed to allow not-counting of time to do a cache fiddle or whatever. Time for some smart person to pay me to get to work on the v8….

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful random insult:- Life by Norman Rockwell, but screenplay by Stephen King.