PetitSmalltalk: now with a self-contained grammar for number literals

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

PetitSmalltalk: now with a self-contained grammar for number literals

Frank Shearar-3
Hi,

I branched off PetitSmalltalk-lr.50 and (finally) completed a number
grammar. I also completed (*) the GNU Smalltalk grammar + parsing,
which is currently in a separate package.

So, my question: I have a version merging my changes with
PetitSmalltalk-lr.56: shall I submit that (together with the other
versions in my branch), or in what other form may I present my work
for submission?

Thanks,

frank

(*) I may have missed any special literals - I think GNU Smalltalk has
an @@(1 2 3) array literal, for instance.

Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Lukas Renggli
Hi Frank,

Sounds cool, where did you post your changes?

Lukas

On 3 September 2011 16:40, Frank Shearar <[hidden email]> wrote:

> Hi,
>
> I branched off PetitSmalltalk-lr.50 and (finally) completed a number
> grammar. I also completed (*) the GNU Smalltalk grammar + parsing,
> which is currently in a separate package.
>
> So, my question: I have a version merging my changes with
> PetitSmalltalk-lr.56: shall I submit that (together with the other
> versions in my branch), or in what other form may I present my work
> for submission?
>
> Thanks,
>
> frank
>
> (*) I may have missed any special literals - I think GNU Smalltalk has
> an @@(1 2 3) array literal, for instance.
>
>



--
Lukas Renggli
www.lukas-renggli.ch

Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Frank Shearar-3
Hi Lukas,

I haven't :) mainly because I'm unsure where to put it - is there
perhaps a PP Inbox, or shall I just post the merged version, or what's
your preference? (How about an mcd between my merge and PP's head?)

frank

On 3 September 2011 15:47, Lukas Renggli <[hidden email]> wrote:

> Hi Frank,
>
> Sounds cool, where did you post your changes?
>
> Lukas
>
> On 3 September 2011 16:40, Frank Shearar <[hidden email]> wrote:
>> Hi,
>>
>> I branched off PetitSmalltalk-lr.50 and (finally) completed a number
>> grammar. I also completed (*) the GNU Smalltalk grammar + parsing,
>> which is currently in a separate package.
>>
>> So, my question: I have a version merging my changes with
>> PetitSmalltalk-lr.56: shall I submit that (together with the other
>> versions in my branch), or in what other form may I present my work
>> for submission?
>>
>> Thanks,
>>
>> frank
>>
>> (*) I may have missed any special literals - I think GNU Smalltalk has
>> an @@(1 2 3) array literal, for instance.
>>
>>
>
>
>
> --
> Lukas Renggli
> www.lukas-renggli.ch
>
>

Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Lukas Renggli
On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
> Hi Lukas,
>
> I haven't :) mainly because I'm unsure where to put it - is there
> perhaps a PP Inbox, or shall I just post the merged version, or what's
> your preference? (How about an mcd between my merge and PP's head?)

Just put the .mcz at some public URL (dropbox, squeak source, ...) or
attach it to a mail.

Lukas

--
Lukas Renggli
www.lukas-renggli.ch

Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Frank Shearar-3
On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>> Hi Lukas,
>>
>> I haven't :) mainly because I'm unsure where to put it - is there
>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>> your preference? (How about an mcd between my merge and PP's head?)
>
> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
> attach it to a mail.

Ah, great - here it is. You'll see I've written the grammar as a
separate class. That was really more to make what I'd done more
obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
it's not a bad idea anyway: it's easy to see the number literal
subgrammar.

frank

> Lukas
>
> --
> Lukas Renggli
> www.lukas-renggli.ch
>
>

PetitSmalltalk-fbs.57.mcz (46K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Lukas Renggli
I think it is a good idea to have the number parser separate, after
all it might also make sense to use it separately.

It seems that the new Smalltalk grammar is significantly slower. The
benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
source code of the collection hierarchy and does not especially target
number literals runs 30% slower.

Also I see that "Number readFrom: ..." is still used within the
grammar. This seems to be a bit strange, no?

Lukas


On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:

> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>> Hi Lukas,
>>>
>>> I haven't :) mainly because I'm unsure where to put it - is there
>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>> your preference? (How about an mcd between my merge and PP's head?)
>>
>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>> attach it to a mail.
>
> Ah, great - here it is. You'll see I've written the grammar as a
> separate class. That was really more to make what I'd done more
> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
> it's not a bad idea anyway: it's easy to see the number literal
> subgrammar.
>
> frank
>
>> Lukas
>>
>> --
>> Lukas Renggli
>> www.lukas-renggli.ch
>>
>>
>



--
Lukas Renggli
www.lukas-renggli.ch

Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Frank Shearar-3
On 3 September 2011 18:50, Lukas Renggli <[hidden email]> wrote:

> I think it is a good idea to have the number parser separate, after
> all it might also make sense to use it separately.
>
> It seems that the new Smalltalk grammar is significantly slower. The
> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
> source code of the collection hierarchy and does not especially target
> number literals runs 30% slower.
>
> Also I see that "Number readFrom: ..." is still used within the
> grammar. This seems to be a bit strange, no?

Yes: it's a double-parse, which is a bit lame. First, we parse the
literal with PPSmalltalkNumberParser, which ensures that the thing
given to Number class >> #readFrom: is a well-formed token (so, in
particular, Squeak's Number doesn't get to see anything other than a
well-formed token).

It sounds like you're happy with the basic concept, so maybe I should
remove the Number class >> #readFrom: stuff, see if I can't remove the
performance issues, and resubmit the patch.

frank

> Lukas
>
>
> On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:
>> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>>> Hi Lukas,
>>>>
>>>> I haven't :) mainly because I'm unsure where to put it - is there
>>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>>> your preference? (How about an mcd between my merge and PP's head?)
>>>
>>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>>> attach it to a mail.
>>
>> Ah, great - here it is. You'll see I've written the grammar as a
>> separate class. That was really more to make what I'd done more
>> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
>> it's not a bad idea anyway: it's easy to see the number literal
>> subgrammar.
>>
>> frank
>>
>>> Lukas
>>>
>>> --
>>> Lukas Renggli
>>> www.lukas-renggli.ch
>>>
>>>
>>
>
>
>
> --
> Lukas Renggli
> www.lukas-renggli.ch
>
>

Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Lukas Renggli
That sounds like a good plan.

I would expect the PetitParser performance to be slightly slower than
the one of the hand-written number parser. However, I would not expect
a significant performance difference on normal source code where most
tokens are not numbers.

Lukas

On 3 September 2011 20:07, Frank Shearar <[hidden email]> wrote:

> On 3 September 2011 18:50, Lukas Renggli <[hidden email]> wrote:
>> I think it is a good idea to have the number parser separate, after
>> all it might also make sense to use it separately.
>>
>> It seems that the new Smalltalk grammar is significantly slower. The
>> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
>> source code of the collection hierarchy and does not especially target
>> number literals runs 30% slower.
>>
>> Also I see that "Number readFrom: ..." is still used within the
>> grammar. This seems to be a bit strange, no?
>
> Yes: it's a double-parse, which is a bit lame. First, we parse the
> literal with PPSmalltalkNumberParser, which ensures that the thing
> given to Number class >> #readFrom: is a well-formed token (so, in
> particular, Squeak's Number doesn't get to see anything other than a
> well-formed token).
>
> It sounds like you're happy with the basic concept, so maybe I should
> remove the Number class >> #readFrom: stuff, see if I can't remove the
> performance issues, and resubmit the patch.
>
> frank
>
>> Lukas
>>
>>
>> On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:
>>> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>>>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>>>> Hi Lukas,
>>>>>
>>>>> I haven't :) mainly because I'm unsure where to put it - is there
>>>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>>>> your preference? (How about an mcd between my merge and PP's head?)
>>>>
>>>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>>>> attach it to a mail.
>>>
>>> Ah, great - here it is. You'll see I've written the grammar as a
>>> separate class. That was really more to make what I'd done more
>>> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
>>> it's not a bad idea anyway: it's easy to see the number literal
>>> subgrammar.
>>>
>>> frank
>>>
>>>> Lukas
>>>>
>>>> --
>>>> Lukas Renggli
>>>> www.lukas-renggli.ch
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Lukas Renggli
>> www.lukas-renggli.ch
>>
>>
>
>



--
Lukas Renggli
www.lukas-renggli.ch

Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Nicolas Cellier
In reply to this post by Frank Shearar-3
2011/9/3 Frank Shearar <[hidden email]>:

> On 3 September 2011 18:50, Lukas Renggli <[hidden email]> wrote:
>> I think it is a good idea to have the number parser separate, after
>> all it might also make sense to use it separately.
>>
>> It seems that the new Smalltalk grammar is significantly slower. The
>> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
>> source code of the collection hierarchy and does not especially target
>> number literals runs 30% slower.
>>
>> Also I see that "Number readFrom: ..." is still used within the
>> grammar. This seems to be a bit strange, no?
>
> Yes: it's a double-parse, which is a bit lame. First, we parse the
> literal with PPSmalltalkNumberParser, which ensures that the thing
> given to Number class >> #readFrom: is a well-formed token (so, in
> particular, Squeak's Number doesn't get to see anything other than a
> well-formed token).
>
> It sounds like you're happy with the basic concept, so maybe I should
> remove the Number class >> #readFrom: stuff, see if I can't remove the
> performance issues, and resubmit the patch.
>
> frank
>

Yes, a NumberParser is essentially parsing, and this duplication sounds useless.
The main feature of interest in NumberParser that I consider a
requirement and should find its equivalence in a PetitNumberParser is:
- round a decimal representation to nearest Float
It's simple, just convert a Fraction asFloat in a single final step to
avoid cumulating round off errors - see
#makeFloatFromMantissa:exponent:base:

The second feature of interest in NumberParser is the ability to
parser LargeInteger efficiently by avoiding (10 * largeValue +
digitValue) loops, and replacing them with a log(n) cost.
This would be a simple thing to implement in a functional language.

Nicolas

>> Lukas
>>
>>
>> On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:
>>> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>>>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>>>> Hi Lukas,
>>>>>
>>>>> I haven't :) mainly because I'm unsure where to put it - is there
>>>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>>>> your preference? (How about an mcd between my merge and PP's head?)
>>>>
>>>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>>>> attach it to a mail.
>>>
>>> Ah, great - here it is. You'll see I've written the grammar as a
>>> separate class. That was really more to make what I'd done more
>>> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
>>> it's not a bad idea anyway: it's easy to see the number literal
>>> subgrammar.
>>>
>>> frank
>>>
>>>> Lukas
>>>>
>>>> --
>>>> Lukas Renggli
>>>> www.lukas-renggli.ch
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Lukas Renggli
>> www.lukas-renggli.ch
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Frank Shearar-3
On 3 September 2011 19:35, Nicolas Cellier
<[hidden email]> wrote:

> 2011/9/3 Frank Shearar <[hidden email]>:
>> On 3 September 2011 18:50, Lukas Renggli <[hidden email]> wrote:
>>> I think it is a good idea to have the number parser separate, after
>>> all it might also make sense to use it separately.
>>>
>>> It seems that the new Smalltalk grammar is significantly slower. The
>>> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
>>> source code of the collection hierarchy and does not especially target
>>> number literals runs 30% slower.
>>>
>>> Also I see that "Number readFrom: ..." is still used within the
>>> grammar. This seems to be a bit strange, no?
>>
>> Yes: it's a double-parse, which is a bit lame. First, we parse the
>> literal with PPSmalltalkNumberParser, which ensures that the thing
>> given to Number class >> #readFrom: is a well-formed token (so, in
>> particular, Squeak's Number doesn't get to see anything other than a
>> well-formed token).
>>
>> It sounds like you're happy with the basic concept, so maybe I should
>> remove the Number class >> #readFrom: stuff, see if I can't remove the
>> performance issues, and resubmit the patch.
>>
>> frank
>>
>
> Yes, a NumberParser is essentially parsing, and this duplication sounds useless.

It's bad for performance but it's definitely not useless: it removes
the platform-dependent issues around Number class >> #readFrom:. In
Squeak we use the ExtendedNumberParser which will quite happily accept
'1.' as a valid number. This way, #readFrom: definitely gets a valid
Smalltalk number literal.

frank

> The main feature of interest in NumberParser that I consider a
> requirement and should find its equivalence in a PetitNumberParser is:
> - round a decimal representation to nearest Float
> It's simple, just convert a Fraction asFloat in a single final step to
> avoid cumulating round off errors - see
> #makeFloatFromMantissa:exponent:base:
>
> The second feature of interest in NumberParser is the ability to
> parser LargeInteger efficiently by avoiding (10 * largeValue +
> digitValue) loops, and replacing them with a log(n) cost.
> This would be a simple thing to implement in a functional language.
>
> Nicolas
>
>>> Lukas
>>>
>>>
>>> On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:
>>>> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>>>>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>>>>> Hi Lukas,
>>>>>>
>>>>>> I haven't :) mainly because I'm unsure where to put it - is there
>>>>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>>>>> your preference? (How about an mcd between my merge and PP's head?)
>>>>>
>>>>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>>>>> attach it to a mail.
>>>>
>>>> Ah, great - here it is. You'll see I've written the grammar as a
>>>> separate class. That was really more to make what I'd done more
>>>> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
>>>> it's not a bad idea anyway: it's easy to see the number literal
>>>> subgrammar.
>>>>
>>>> frank
>>>>
>>>>> Lukas
>>>>>
>>>>> --
>>>>> Lukas Renggli
>>>>> www.lukas-renggli.ch
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lukas Renggli
>>> www.lukas-renggli.ch
>>>
>>>
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Nicolas Cellier
2011/9/3 Frank Shearar <[hidden email]>:

> On 3 September 2011 19:35, Nicolas Cellier
> <[hidden email]> wrote:
>> 2011/9/3 Frank Shearar <[hidden email]>:
>>> On 3 September 2011 18:50, Lukas Renggli <[hidden email]> wrote:
>>>> I think it is a good idea to have the number parser separate, after
>>>> all it might also make sense to use it separately.
>>>>
>>>> It seems that the new Smalltalk grammar is significantly slower. The
>>>> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
>>>> source code of the collection hierarchy and does not especially target
>>>> number literals runs 30% slower.
>>>>
>>>> Also I see that "Number readFrom: ..." is still used within the
>>>> grammar. This seems to be a bit strange, no?
>>>
>>> Yes: it's a double-parse, which is a bit lame. First, we parse the
>>> literal with PPSmalltalkNumberParser, which ensures that the thing
>>> given to Number class >> #readFrom: is a well-formed token (so, in
>>> particular, Squeak's Number doesn't get to see anything other than a
>>> well-formed token).
>>>
>>> It sounds like you're happy with the basic concept, so maybe I should
>>> remove the Number class >> #readFrom: stuff, see if I can't remove the
>>> performance issues, and resubmit the patch.
>>>
>>> frank
>>>
>>
>> Yes, a NumberParser is essentially parsing, and this duplication sounds useless.
>
> It's bad for performance but it's definitely not useless: it removes
> the platform-dependent issues around Number class >> #readFrom:. In
> Squeak we use the ExtendedNumberParser which will quite happily accept
> '1.' as a valid number. This way, #readFrom: definitely gets a valid
> Smalltalk number literal.
>
> frank
>

My question was why using NumberParser at all?
I indicated one essential feature of interest that should be ported to
a PetitNumberParser (round to nearest Float), and another optional one
for optimization.

Nicolas

>> The main feature of interest in NumberParser that I consider a
>> requirement and should find its equivalence in a PetitNumberParser is:
>> - round a decimal representation to nearest Float
>> It's simple, just convert a Fraction asFloat in a single final step to
>> avoid cumulating round off errors - see
>> #makeFloatFromMantissa:exponent:base:
>>
>> The second feature of interest in NumberParser is the ability to
>> parser LargeInteger efficiently by avoiding (10 * largeValue +
>> digitValue) loops, and replacing them with a log(n) cost.
>> This would be a simple thing to implement in a functional language.
>>
>> Nicolas
>>
>>>> Lukas
>>>>
>>>>
>>>> On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:
>>>>> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>>>>>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>>>>>> Hi Lukas,
>>>>>>>
>>>>>>> I haven't :) mainly because I'm unsure where to put it - is there
>>>>>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>>>>>> your preference? (How about an mcd between my merge and PP's head?)
>>>>>>
>>>>>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>>>>>> attach it to a mail.
>>>>>
>>>>> Ah, great - here it is. You'll see I've written the grammar as a
>>>>> separate class. That was really more to make what I'd done more
>>>>> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
>>>>> it's not a bad idea anyway: it's easy to see the number literal
>>>>> subgrammar.
>>>>>
>>>>> frank
>>>>>
>>>>>> Lukas
>>>>>>
>>>>>> --
>>>>>> Lukas Renggli
>>>>>> www.lukas-renggli.ch
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Lukas Renggli
>>>> www.lukas-renggli.ch
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Frank Shearar-3
On 3 September 2011 20:43, Nicolas Cellier
<[hidden email]> wrote:

> 2011/9/3 Frank Shearar <[hidden email]>:
>> On 3 September 2011 19:35, Nicolas Cellier
>> <[hidden email]> wrote:
>>> 2011/9/3 Frank Shearar <[hidden email]>:
>>>> On 3 September 2011 18:50, Lukas Renggli <[hidden email]> wrote:
>>>>> I think it is a good idea to have the number parser separate, after
>>>>> all it might also make sense to use it separately.
>>>>>
>>>>> It seems that the new Smalltalk grammar is significantly slower. The
>>>>> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
>>>>> source code of the collection hierarchy and does not especially target
>>>>> number literals runs 30% slower.
>>>>>
>>>>> Also I see that "Number readFrom: ..." is still used within the
>>>>> grammar. This seems to be a bit strange, no?
>>>>
>>>> Yes: it's a double-parse, which is a bit lame. First, we parse the
>>>> literal with PPSmalltalkNumberParser, which ensures that the thing
>>>> given to Number class >> #readFrom: is a well-formed token (so, in
>>>> particular, Squeak's Number doesn't get to see anything other than a
>>>> well-formed token).
>>>>
>>>> It sounds like you're happy with the basic concept, so maybe I should
>>>> remove the Number class >> #readFrom: stuff, see if I can't remove the
>>>> performance issues, and resubmit the patch.
>>>>
>>>> frank
>>>>
>>>
>>> Yes, a NumberParser is essentially parsing, and this duplication sounds useless.
>>
>> It's bad for performance but it's definitely not useless: it removes
>> the platform-dependent issues around Number class >> #readFrom:. In
>> Squeak we use the ExtendedNumberParser which will quite happily accept
>> '1.' as a valid number. This way, #readFrom: definitely gets a valid
>> Smalltalk number literal.
>>
>> frank
>>
>
> My question was why using NumberParser at all?
> I indicated one essential feature of interest that should be ported to
> a PetitNumberParser (round to nearest Float), and another optional one
> for optimization.

I'm not arguing at all. The first instance - turning the radix digits
into a number - is trivial to remove. The second use of NumberParser's
next on the hit list.

If you have any further suggestions on efficient conversions from
collection-of-digit-characters to numbers that aren't already in
NumberParser, I'm all ears.

My use of Number class >> #readFrom: was entirely a pragmatic first
step - it let me write the grammar while keeping the test suite green.

Now I need just address your and Lukas' suggestions, before I
resubmit. Most importantly, I know that the basic idea/structure's not
crazy.

frank

> Nicolas
>
>>> The main feature of interest in NumberParser that I consider a
>>> requirement and should find its equivalence in a PetitNumberParser is:
>>> - round a decimal representation to nearest Float
>>> It's simple, just convert a Fraction asFloat in a single final step to
>>> avoid cumulating round off errors - see
>>> #makeFloatFromMantissa:exponent:base:
>>>
>>> The second feature of interest in NumberParser is the ability to
>>> parser LargeInteger efficiently by avoiding (10 * largeValue +
>>> digitValue) loops, and replacing them with a log(n) cost.
>>> This would be a simple thing to implement in a functional language.
>>>
>>> Nicolas
>>>
>>>>> Lukas
>>>>>
>>>>>
>>>>> On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:
>>>>>> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>>>>>>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>>>>>>> Hi Lukas,
>>>>>>>>
>>>>>>>> I haven't :) mainly because I'm unsure where to put it - is there
>>>>>>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>>>>>>> your preference? (How about an mcd between my merge and PP's head?)
>>>>>>>
>>>>>>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>>>>>>> attach it to a mail.
>>>>>>
>>>>>> Ah, great - here it is. You'll see I've written the grammar as a
>>>>>> separate class. That was really more to make what I'd done more
>>>>>> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
>>>>>> it's not a bad idea anyway: it's easy to see the number literal
>>>>>> subgrammar.
>>>>>>
>>>>>> frank
>>>>>>
>>>>>>> Lukas
>>>>>>>
>>>>>>> --
>>>>>>> Lukas Renggli
>>>>>>> www.lukas-renggli.ch
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lukas Renggli
>>>>> www.lukas-renggli.ch
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Stéphane Ducasse
In reply to this post by Lukas Renggli


> I think it is a good idea to have the number parser separate, after
> all it might also make sense to use it separately.

+1.
Modularity and composibility is a strong apsect of PP so we should use them.
Especially since in the future we want to use petitSmalltalk for pharo (to be checked speedwise).



>
> It seems that the new Smalltalk grammar is significantly slower. The
> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
> source code of the collection hierarchy and does not especially target
> number literals runs 30% slower.
>
> Also I see that "Number readFrom: ..." is still used within the
> grammar. This seems to be a bit strange, no?
>
> Lukas
>
>
> On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:
>> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>>> Hi Lukas,
>>>>
>>>> I haven't :) mainly because I'm unsure where to put it - is there
>>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>>> your preference? (How about an mcd between my merge and PP's head?)
>>>
>>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>>> attach it to a mail.
>>
>> Ah, great - here it is. You'll see I've written the grammar as a
>> separate class. That was really more to make what I'd done more
>> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
>> it's not a bad idea anyway: it's easy to see the number literal
>> subgrammar.
>>
>> frank
>>
>>> Lukas
>>>
>>> --
>>> Lukas Renggli
>>> www.lukas-renggli.ch
>>>
>>>
>>
>
>
>
> --
> Lukas Renggli
> www.lukas-renggli.ch
>


Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Frank Shearar-3
In reply to this post by Nicolas Cellier
On 3 September 2011 19:35, Nicolas Cellier
<[hidden email]> wrote:

> 2011/9/3 Frank Shearar <[hidden email]>:
>> On 3 September 2011 18:50, Lukas Renggli <[hidden email]> wrote:
>>> I think it is a good idea to have the number parser separate, after
>>> all it might also make sense to use it separately.
>>>
>>> It seems that the new Smalltalk grammar is significantly slower. The
>>> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
>>> source code of the collection hierarchy and does not especially target
>>> number literals runs 30% slower.
>>>
>>> Also I see that "Number readFrom: ..." is still used within the
>>> grammar. This seems to be a bit strange, no?
>>
>> Yes: it's a double-parse, which is a bit lame. First, we parse the
>> literal with PPSmalltalkNumberParser, which ensures that the thing
>> given to Number class >> #readFrom: is a well-formed token (so, in
>> particular, Squeak's Number doesn't get to see anything other than a
>> well-formed token).
>>
>> It sounds like you're happy with the basic concept, so maybe I should
>> remove the Number class >> #readFrom: stuff, see if I can't remove the
>> performance issues, and resubmit the patch.
>>
>> frank
>>
>
> Yes, a NumberParser is essentially parsing, and this duplication sounds useless.
> The main feature of interest in NumberParser that I consider a
> requirement and should find its equivalence in a PetitNumberParser is:
> - round a decimal representation to nearest Float
> It's simple, just convert a Fraction asFloat in a single final step to
> avoid cumulating round off errors - see
> #makeFloatFromMantissa:exponent:base:
>
> The second feature of interest in NumberParser is the ability to
> parser LargeInteger efficiently by avoiding (10 * largeValue +
> digitValue) loops, and replacing them with a log(n) cost.
> This would be a simple thing to implement in a functional language.
Hopefully this won't offend your sensibilities too much :). It does,
in fact, use 10* loops - I wrote an experimental "front half * rear
half" recursion, which was slower in my benchmarks.

This version has the grammar and parser doing no string->number
conversion at all. PPSmalltalkNumberMaker supplies a number of utility
methods designed to stop one from making malformed numbers. It also
supplies a builder interface that the parser uses to construct
numbers.

frank

> Nicolas
>
>>> Lukas
>>>
>>>
>>> On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:
>>>> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>>>>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>>>>> Hi Lukas,
>>>>>>
>>>>>> I haven't :) mainly because I'm unsure where to put it - is there
>>>>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>>>>> your preference? (How about an mcd between my merge and PP's head?)
>>>>>
>>>>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>>>>> attach it to a mail.
>>>>
>>>> Ah, great - here it is. You'll see I've written the grammar as a
>>>> separate class. That was really more to make what I'd done more
>>>> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
>>>> it's not a bad idea anyway: it's easy to see the number literal
>>>> subgrammar.
>>>>
>>>> frank
>>>>
>>>>> Lukas
>>>>>
>>>>> --
>>>>> Lukas Renggli
>>>>> www.lukas-renggli.ch
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lukas Renggli
>>> www.lukas-renggli.ch
>>>
>>>
>>
>>
>
>

PetitSmalltalk-fbs.60.mcz (56K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Frank Shearar-3
Nicolas, you'd mentioned a log_2 n implementation for number parsing.
I think by that you mean the division of digits into two "packets"
that are processed and then added?

I tried a straight-forward fold (using the * 10 that you said you didn't like!)

Time millisecondsToRun: [10000 timesRepeat:
['123456789012345678901234567890' inject: 0 into: [:acc :each | 10 *
acc + each digitValue]]] => 818

and what I think you meant:

| f |
f := 1. "Just to shut the compiler up about the self-reference."
f := [:someDigits | | len |
        len := someDigits size.
        (len = 1)
                ifTrue: [someDigits first digitValue]
                ifFalse: [ | leftHalf rightHalf |
                        leftHalf := someDigits first: len // 2.
                        rightHalf := someDigits allButFirst: len // 2.
                        ((f value: (leftHalf)) * (10 raisedTo: rightHalf size)) + (f value:
rightHalf)]].
Time millisecondsToRun: [10000 timesRepeat: [f value:
'123456789012345678901234567890']] => 1508

I went with the straight-forward fold because it was simpler and my
(quite possibly wrong/inefficient) attempt at a log_2 n version was a
lot less readable.

I'd greatly appreciate being gently corrected!

frank

On 14 September 2011 20:26, Frank Shearar <[hidden email]> wrote:

> On 3 September 2011 19:35, Nicolas Cellier
> <[hidden email]> wrote:
>> 2011/9/3 Frank Shearar <[hidden email]>:
>>> On 3 September 2011 18:50, Lukas Renggli <[hidden email]> wrote:
>>>> I think it is a good idea to have the number parser separate, after
>>>> all it might also make sense to use it separately.
>>>>
>>>> It seems that the new Smalltalk grammar is significantly slower. The
>>>> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
>>>> source code of the collection hierarchy and does not especially target
>>>> number literals runs 30% slower.
>>>>
>>>> Also I see that "Number readFrom: ..." is still used within the
>>>> grammar. This seems to be a bit strange, no?
>>>
>>> Yes: it's a double-parse, which is a bit lame. First, we parse the
>>> literal with PPSmalltalkNumberParser, which ensures that the thing
>>> given to Number class >> #readFrom: is a well-formed token (so, in
>>> particular, Squeak's Number doesn't get to see anything other than a
>>> well-formed token).
>>>
>>> It sounds like you're happy with the basic concept, so maybe I should
>>> remove the Number class >> #readFrom: stuff, see if I can't remove the
>>> performance issues, and resubmit the patch.
>>>
>>> frank
>>>
>>
>> Yes, a NumberParser is essentially parsing, and this duplication sounds useless.
>> The main feature of interest in NumberParser that I consider a
>> requirement and should find its equivalence in a PetitNumberParser is:
>> - round a decimal representation to nearest Float
>> It's simple, just convert a Fraction asFloat in a single final step to
>> avoid cumulating round off errors - see
>> #makeFloatFromMantissa:exponent:base:
>>
>> The second feature of interest in NumberParser is the ability to
>> parser LargeInteger efficiently by avoiding (10 * largeValue +
>> digitValue) loops, and replacing them with a log(n) cost.
>> This would be a simple thing to implement in a functional language.
>
> Hopefully this won't offend your sensibilities too much :). It does,
> in fact, use 10* loops - I wrote an experimental "front half * rear
> half" recursion, which was slower in my benchmarks.
>
> This version has the grammar and parser doing no string->number
> conversion at all. PPSmalltalkNumberMaker supplies a number of utility
> methods designed to stop one from making malformed numbers. It also
> supplies a builder interface that the parser uses to construct
> numbers.
>
> frank
>
>> Nicolas
>>
>>>> Lukas
>>>>
>>>>
>>>> On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:
>>>>> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>>>>>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>>>>>> Hi Lukas,
>>>>>>>
>>>>>>> I haven't :) mainly because I'm unsure where to put it - is there
>>>>>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>>>>>> your preference? (How about an mcd between my merge and PP's head?)
>>>>>>
>>>>>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>>>>>> attach it to a mail.
>>>>>
>>>>> Ah, great - here it is. You'll see I've written the grammar as a
>>>>> separate class. That was really more to make what I'd done more
>>>>> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
>>>>> it's not a bad idea anyway: it's easy to see the number literal
>>>>> subgrammar.
>>>>>
>>>>> frank
>>>>>
>>>>>> Lukas
>>>>>>
>>>>>> --
>>>>>> Lukas Renggli
>>>>>> www.lukas-renggli.ch
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Lukas Renggli
>>>> www.lukas-renggli.ch
>>>>
>>>>
>>>
>>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Nicolas Cellier
2011/9/25 Frank Shearar <[hidden email]>:

> Nicolas, you'd mentioned a log_2 n implementation for number parsing.
> I think by that you mean the division of digits into two "packets"
> that are processed and then added?
>
> I tried a straight-forward fold (using the * 10 that you said you didn't like!)
>
> Time millisecondsToRun: [10000 timesRepeat:
> ['123456789012345678901234567890' inject: 0 into: [:acc :each | 10 *
> acc + each digitValue]]] => 818
>
> and what I think you meant:
>
> | f |
> f := 1. "Just to shut the compiler up about the self-reference."
> f := [:someDigits | | len |
>        len := someDigits size.
>        (len = 1)
>                ifTrue: [someDigits first digitValue]
>                ifFalse: [ | leftHalf rightHalf |
>                        leftHalf := someDigits first: len // 2.
>                        rightHalf := someDigits allButFirst: len // 2.
>                        ((f value: (leftHalf)) * (10 raisedTo: rightHalf size)) + (f value:
> rightHalf)]].
> Time millisecondsToRun: [10000 timesRepeat: [f value:
> '123456789012345678901234567890']] => 1508
>
> I went with the straight-forward fold because it was simpler and my
> (quite possibly wrong/inefficient) attempt at a log_2 n version was a
> lot less readable.
>
> I'd greatly appreciate being gently corrected!
>
> frank
>

Hmm, let's see.

Suppose that SmallIntegers are on 3 digits, and Large begins at 4.
And let's compose a 32 digits number 12345678901234567890123456789012

If I do the naive loop the first 3 loops (6 ops) give small ints.
Then the next 29 loops (58 ops) are performed with large ints.

Now, if I perform these ops:

((1234 * (10**4) + 5678) * (10**8) + (9012 * (10**4) + 3456)) *
(10**16) + ((7890 * (10**16) + 1234) * (10**8) + (5678 * (10**4) +
9012))

This time, I have 24 loops with small int (48 ops) and 8 loops (16
ops) with large ints for evaluating the 4 digits ints.
Plus 14 operations to assemble these large ints into the final result.
Plus a number of operations required to evaluate the powers of 10:
without cache, this is

10**4 = 10 squared squared (1 small op + 1 large) * 4 times.
10**8 = 10 squared squared squared (1 small op + 2 large) * 2 times.
10**16 = 10 squared squared squared (1 small op + 3 large) * 1 time.

The total is 55 ops on small ints + 41 ops on large ints.
So the second algorithm costs more operations on total, but cheaper ones.
Suppose every operation on SmallInteger costs 1 and LargeInteger costs 10,
we have a cost of 586 for the first, 465 for the second.

Also, the cost of Large int operations is not constant.
It is proportional to max(length1,length2) for + and to length1*length2 for *.
So, we'll consider length(10**n)=n.
In first case, length2=1 and length1 is varying linearly between 4 and
32, thus a mean of 18*2 digit ops per loop * 29 loops.

In the second case, the cost in digit ops is:
1) (3*1+4) per loop * 8 loops for forming the 4 digits elementary large ints
2) (4*4+8) per loop * 4 loops for forming the 8 digits large ints
3) (8*8+16) per loop * 2 loops for forming the 16 digits large ints
4) (16*16+32) for the last iteration

Plus the costs of evaluating powers of 10:
1) (100 squared) is evaluated 7 times * (2*2) digit ops
2) (10000 squared) is evaluated 3 times * (4*4) digit ops
3) (100000000 squared) is evaluated 1 times * (8*8) digit ops

grand total is 18*2*29 = 1044 digit ops for the naive algorithm
For the divide and conquer form, that is ((3*1+4)*8) + ((4*4+8)*4) +
((8*8+16)*2) + (16*16+32) + (2*2*7) + (4*4*3) + (8*8)  = 740

In your 30 decimal digits example, large integers are formed of 10
decimal digits (SmallInteger maxVal + 1) printString size, so we would
have to translate above count...
But it's boring, we can lazily bench it:

| stream |
stream := '123456789012345678901234567890' readStream.
{
[stream reset; nextFloat] bench.
[stream reset. Number readFrom: stream] bench.
}
 #('15,600 per second.' '23,500 per second.') "Old 4.2 VM"
 #('56,100 per second.' '97,700 per second.') "recent COG VM"

Stream>>nextFloat is a naive loop with almost every operation inlined.
Number>>readFrom: traverses a bunch of sends before the real
operations take place.

The overhead equlibrates around 12 decimal digits in non COG VM !
That's why it was already faster for scanning a 15 digits Float (non
counting the fact that nextFloat fails to answer the nearest Float).
Now, in COG, it equilibrates around 20 digits.

If we want to draw the cost versus log of number of digits:

| gen counts serie cost1 cost2 |
gen := Generator on: [:g | [g yield: '1234567890' atRandom] repeat].
counts := (2 to: 13) collect: [:e | 2 raisedTo: e].
serie := counts collect: [:count |
        String withAll: (gen next: count)].
cost1 := serie collect: [:e |
        [1000 timesRepeat: [e readStream nextFloat]] timeToRun].
cost2 := serie collect: [:e |
        [1000 timesRepeat: [(Number readFrom: e readStream) asFloat]] timeToRun].
{counts. cost1. cost2.}
 #("NON COG VM"
 #(4 8 16 32 64 128 256 512 1024 2048 4096 8192)
 #(2 2 25 65 146 304 647 1412 3288 8393 24021 77189)
 #(6 6 21 51 94 177 361 742 1550 3872 8471 22647))

 #("COG VM"
#(4 8 16 32 64 128 256 512 1024 2048 4096 8192)
 #(0 2 3 20 56 139 334 838 2323 7010 23614 85925)
#(2 0 4 13 24 53 116 264 626 1646 4552 14191))

It's not a log_2 cost at the end, but it's already some gain.
It's possible that above costs are dominated by garbage collection,
but I don't know how to measure that accurately...

Nicolas

> On 14 September 2011 20:26, Frank Shearar <[hidden email]> wrote:
>> On 3 September 2011 19:35, Nicolas Cellier
>> <[hidden email]> wrote:
>>> 2011/9/3 Frank Shearar <[hidden email]>:
>>>> On 3 September 2011 18:50, Lukas Renggli <[hidden email]> wrote:
>>>>> I think it is a good idea to have the number parser separate, after
>>>>> all it might also make sense to use it separately.
>>>>>
>>>>> It seems that the new Smalltalk grammar is significantly slower. The
>>>>> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
>>>>> source code of the collection hierarchy and does not especially target
>>>>> number literals runs 30% slower.
>>>>>
>>>>> Also I see that "Number readFrom: ..." is still used within the
>>>>> grammar. This seems to be a bit strange, no?
>>>>
>>>> Yes: it's a double-parse, which is a bit lame. First, we parse the
>>>> literal with PPSmalltalkNumberParser, which ensures that the thing
>>>> given to Number class >> #readFrom: is a well-formed token (so, in
>>>> particular, Squeak's Number doesn't get to see anything other than a
>>>> well-formed token).
>>>>
>>>> It sounds like you're happy with the basic concept, so maybe I should
>>>> remove the Number class >> #readFrom: stuff, see if I can't remove the
>>>> performance issues, and resubmit the patch.
>>>>
>>>> frank
>>>>
>>>
>>> Yes, a NumberParser is essentially parsing, and this duplication sounds useless.
>>> The main feature of interest in NumberParser that I consider a
>>> requirement and should find its equivalence in a PetitNumberParser is:
>>> - round a decimal representation to nearest Float
>>> It's simple, just convert a Fraction asFloat in a single final step to
>>> avoid cumulating round off errors - see
>>> #makeFloatFromMantissa:exponent:base:
>>>
>>> The second feature of interest in NumberParser is the ability to
>>> parser LargeInteger efficiently by avoiding (10 * largeValue +
>>> digitValue) loops, and replacing them with a log(n) cost.
>>> This would be a simple thing to implement in a functional language.
>>
>> Hopefully this won't offend your sensibilities too much :). It does,
>> in fact, use 10* loops - I wrote an experimental "front half * rear
>> half" recursion, which was slower in my benchmarks.
>>
>> This version has the grammar and parser doing no string->number
>> conversion at all. PPSmalltalkNumberMaker supplies a number of utility
>> methods designed to stop one from making malformed numbers. It also
>> supplies a builder interface that the parser uses to construct
>> numbers.
>>
>> frank
>>
>>> Nicolas
>>>
>>>>> Lukas
>>>>>
>>>>>
>>>>> On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:
>>>>>> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>>>>>>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>>>>>>> Hi Lukas,
>>>>>>>>
>>>>>>>> I haven't :) mainly because I'm unsure where to put it - is there
>>>>>>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>>>>>>> your preference? (How about an mcd between my merge and PP's head?)
>>>>>>>
>>>>>>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>>>>>>> attach it to a mail.
>>>>>>
>>>>>> Ah, great - here it is. You'll see I've written the grammar as a
>>>>>> separate class. That was really more to make what I'd done more
>>>>>> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
>>>>>> it's not a bad idea anyway: it's easy to see the number literal
>>>>>> subgrammar.
>>>>>>
>>>>>> frank
>>>>>>
>>>>>>> Lukas
>>>>>>>
>>>>>>> --
>>>>>>> Lukas Renggli
>>>>>>> www.lukas-renggli.ch
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lukas Renggli
>>>>> www.lukas-renggli.ch
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Stéphane Ducasse
thanks for the analysis I ejoyed reading it.

Stef
On Sep 25, 2011, at 4:21 PM, Nicolas Cellier wrote:

> 2011/9/25 Frank Shearar <[hidden email]>:
>> Nicolas, you'd mentioned a log_2 n implementation for number parsing.
>> I think by that you mean the division of digits into two "packets"
>> that are processed and then added?
>>
>> I tried a straight-forward fold (using the * 10 that you said you didn't like!)
>>
>> Time millisecondsToRun: [10000 timesRepeat:
>> ['123456789012345678901234567890' inject: 0 into: [:acc :each | 10 *
>> acc + each digitValue]]] => 818
>>
>> and what I think you meant:
>>
>> | f |
>> f := 1. "Just to shut the compiler up about the self-reference."
>> f := [:someDigits | | len |
>>        len := someDigits size.
>>        (len = 1)
>>                ifTrue: [someDigits first digitValue]
>>                ifFalse: [ | leftHalf rightHalf |
>>                        leftHalf := someDigits first: len // 2.
>>                        rightHalf := someDigits allButFirst: len // 2.
>>                        ((f value: (leftHalf)) * (10 raisedTo: rightHalf size)) + (f value:
>> rightHalf)]].
>> Time millisecondsToRun: [10000 timesRepeat: [f value:
>> '123456789012345678901234567890']] => 1508
>>
>> I went with the straight-forward fold because it was simpler and my
>> (quite possibly wrong/inefficient) attempt at a log_2 n version was a
>> lot less readable.
>>
>> I'd greatly appreciate being gently corrected!
>>
>> frank
>>
>
> Hmm, let's see.
>
> Suppose that SmallIntegers are on 3 digits, and Large begins at 4.
> And let's compose a 32 digits number 12345678901234567890123456789012
>
> If I do the naive loop the first 3 loops (6 ops) give small ints.
> Then the next 29 loops (58 ops) are performed with large ints.
>
> Now, if I perform these ops:
>
> ((1234 * (10**4) + 5678) * (10**8) + (9012 * (10**4) + 3456)) *
> (10**16) + ((7890 * (10**16) + 1234) * (10**8) + (5678 * (10**4) +
> 9012))
>
> This time, I have 24 loops with small int (48 ops) and 8 loops (16
> ops) with large ints for evaluating the 4 digits ints.
> Plus 14 operations to assemble these large ints into the final result.
> Plus a number of operations required to evaluate the powers of 10:
> without cache, this is
>
> 10**4 = 10 squared squared (1 small op + 1 large) * 4 times.
> 10**8 = 10 squared squared squared (1 small op + 2 large) * 2 times.
> 10**16 = 10 squared squared squared (1 small op + 3 large) * 1 time.
>
> The total is 55 ops on small ints + 41 ops on large ints.
> So the second algorithm costs more operations on total, but cheaper ones.
> Suppose every operation on SmallInteger costs 1 and LargeInteger costs 10,
> we have a cost of 586 for the first, 465 for the second.
>
> Also, the cost of Large int operations is not constant.
> It is proportional to max(length1,length2) for + and to length1*length2 for *.
> So, we'll consider length(10**n)=n.
> In first case, length2=1 and length1 is varying linearly between 4 and
> 32, thus a mean of 18*2 digit ops per loop * 29 loops.
>
> In the second case, the cost in digit ops is:
> 1) (3*1+4) per loop * 8 loops for forming the 4 digits elementary large ints
> 2) (4*4+8) per loop * 4 loops for forming the 8 digits large ints
> 3) (8*8+16) per loop * 2 loops for forming the 16 digits large ints
> 4) (16*16+32) for the last iteration
>
> Plus the costs of evaluating powers of 10:
> 1) (100 squared) is evaluated 7 times * (2*2) digit ops
> 2) (10000 squared) is evaluated 3 times * (4*4) digit ops
> 3) (100000000 squared) is evaluated 1 times * (8*8) digit ops
>
> grand total is 18*2*29 = 1044 digit ops for the naive algorithm
> For the divide and conquer form, that is ((3*1+4)*8) + ((4*4+8)*4) +
> ((8*8+16)*2) + (16*16+32) + (2*2*7) + (4*4*3) + (8*8)  = 740
>
> In your 30 decimal digits example, large integers are formed of 10
> decimal digits (SmallInteger maxVal + 1) printString size, so we would
> have to translate above count...
> But it's boring, we can lazily bench it:
>
> | stream |
> stream := '123456789012345678901234567890' readStream.
> {
> [stream reset; nextFloat] bench.
> [stream reset. Number readFrom: stream] bench.
> }
> #('15,600 per second.' '23,500 per second.') "Old 4.2 VM"
> #('56,100 per second.' '97,700 per second.') "recent COG VM"
>
> Stream>>nextFloat is a naive loop with almost every operation inlined.
> Number>>readFrom: traverses a bunch of sends before the real
> operations take place.
>
> The overhead equlibrates around 12 decimal digits in non COG VM !
> That's why it was already faster for scanning a 15 digits Float (non
> counting the fact that nextFloat fails to answer the nearest Float).
> Now, in COG, it equilibrates around 20 digits.
>
> If we want to draw the cost versus log of number of digits:
>
> | gen counts serie cost1 cost2 |
> gen := Generator on: [:g | [g yield: '1234567890' atRandom] repeat].
> counts := (2 to: 13) collect: [:e | 2 raisedTo: e].
> serie := counts collect: [:count |
> String withAll: (gen next: count)].
> cost1 := serie collect: [:e |
> [1000 timesRepeat: [e readStream nextFloat]] timeToRun].
> cost2 := serie collect: [:e |
> [1000 timesRepeat: [(Number readFrom: e readStream) asFloat]] timeToRun].
> {counts. cost1. cost2.}
> #("NON COG VM"
> #(4 8 16 32 64 128 256 512 1024 2048 4096 8192)
> #(2 2 25 65 146 304 647 1412 3288 8393 24021 77189)
> #(6 6 21 51 94 177 361 742 1550 3872 8471 22647))
>
> #("COG VM"
> #(4 8 16 32 64 128 256 512 1024 2048 4096 8192)
> #(0 2 3 20 56 139 334 838 2323 7010 23614 85925)
> #(2 0 4 13 24 53 116 264 626 1646 4552 14191))
>
> It's not a log_2 cost at the end, but it's already some gain.
> It's possible that above costs are dominated by garbage collection,
> but I don't know how to measure that accurately...
>
> Nicolas
>
>> On 14 September 2011 20:26, Frank Shearar <[hidden email]> wrote:
>>> On 3 September 2011 19:35, Nicolas Cellier
>>> <[hidden email]> wrote:
>>>> 2011/9/3 Frank Shearar <[hidden email]>:
>>>>> On 3 September 2011 18:50, Lukas Renggli <[hidden email]> wrote:
>>>>>> I think it is a good idea to have the number parser separate, after
>>>>>> all it might also make sense to use it separately.
>>>>>>
>>>>>> It seems that the new Smalltalk grammar is significantly slower. The
>>>>>> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
>>>>>> source code of the collection hierarchy and does not especially target
>>>>>> number literals runs 30% slower.
>>>>>>
>>>>>> Also I see that "Number readFrom: ..." is still used within the
>>>>>> grammar. This seems to be a bit strange, no?
>>>>>
>>>>> Yes: it's a double-parse, which is a bit lame. First, we parse the
>>>>> literal with PPSmalltalkNumberParser, which ensures that the thing
>>>>> given to Number class >> #readFrom: is a well-formed token (so, in
>>>>> particular, Squeak's Number doesn't get to see anything other than a
>>>>> well-formed token).
>>>>>
>>>>> It sounds like you're happy with the basic concept, so maybe I should
>>>>> remove the Number class >> #readFrom: stuff, see if I can't remove the
>>>>> performance issues, and resubmit the patch.
>>>>>
>>>>> frank
>>>>>
>>>>
>>>> Yes, a NumberParser is essentially parsing, and this duplication sounds useless.
>>>> The main feature of interest in NumberParser that I consider a
>>>> requirement and should find its equivalence in a PetitNumberParser is:
>>>> - round a decimal representation to nearest Float
>>>> It's simple, just convert a Fraction asFloat in a single final step to
>>>> avoid cumulating round off errors - see
>>>> #makeFloatFromMantissa:exponent:base:
>>>>
>>>> The second feature of interest in NumberParser is the ability to
>>>> parser LargeInteger efficiently by avoiding (10 * largeValue +
>>>> digitValue) loops, and replacing them with a log(n) cost.
>>>> This would be a simple thing to implement in a functional language.
>>>
>>> Hopefully this won't offend your sensibilities too much :). It does,
>>> in fact, use 10* loops - I wrote an experimental "front half * rear
>>> half" recursion, which was slower in my benchmarks.
>>>
>>> This version has the grammar and parser doing no string->number
>>> conversion at all. PPSmalltalkNumberMaker supplies a number of utility
>>> methods designed to stop one from making malformed numbers. It also
>>> supplies a builder interface that the parser uses to construct
>>> numbers.
>>>
>>> frank
>>>
>>>> Nicolas
>>>>
>>>>>> Lukas
>>>>>>
>>>>>>
>>>>>> On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:
>>>>>>> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>>>>>>>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>>>>>>>> Hi Lukas,
>>>>>>>>>
>>>>>>>>> I haven't :) mainly because I'm unsure where to put it - is there
>>>>>>>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>>>>>>>> your preference? (How about an mcd between my merge and PP's head?)
>>>>>>>>
>>>>>>>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>>>>>>>> attach it to a mail.
>>>>>>>
>>>>>>> Ah, great - here it is. You'll see I've written the grammar as a
>>>>>>> separate class. That was really more to make what I'd done more
>>>>>>> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
>>>>>>> it's not a bad idea anyway: it's easy to see the number literal
>>>>>>> subgrammar.
>>>>>>>
>>>>>>> frank
>>>>>>>
>>>>>>>> Lukas
>>>>>>>>
>>>>>>>> --
>>>>>>>> Lukas Renggli
>>>>>>>> www.lukas-renggli.ch
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lukas Renggli
>>>>>> www.lukas-renggli.ch
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Nicolas Cellier
And there is more: with second algorithm, you can use karatsuba
product (or more evolved ones Toom or FFT) and get a significant
speed-up, while that has no effect on naive algorithm.
I plotted the timings of Integer>>readFrom: for the two algorithms
versus number of decimal digits in log log scaled in attached pdf.

Nicolas

2011/9/25 Stéphane Ducasse <[hidden email]>:

> thanks for the analysis I ejoyed reading it.
>
> Stef
> On Sep 25, 2011, at 4:21 PM, Nicolas Cellier wrote:
>
>> 2011/9/25 Frank Shearar <[hidden email]>:
>>> Nicolas, you'd mentioned a log_2 n implementation for number parsing.
>>> I think by that you mean the division of digits into two "packets"
>>> that are processed and then added?
>>>
>>> I tried a straight-forward fold (using the * 10 that you said you didn't like!)
>>>
>>> Time millisecondsToRun: [10000 timesRepeat:
>>> ['123456789012345678901234567890' inject: 0 into: [:acc :each | 10 *
>>> acc + each digitValue]]] => 818
>>>
>>> and what I think you meant:
>>>
>>> | f |
>>> f := 1. "Just to shut the compiler up about the self-reference."
>>> f := [:someDigits | | len |
>>>        len := someDigits size.
>>>        (len = 1)
>>>                ifTrue: [someDigits first digitValue]
>>>                ifFalse: [ | leftHalf rightHalf |
>>>                        leftHalf := someDigits first: len // 2.
>>>                        rightHalf := someDigits allButFirst: len // 2.
>>>                        ((f value: (leftHalf)) * (10 raisedTo: rightHalf size)) + (f value:
>>> rightHalf)]].
>>> Time millisecondsToRun: [10000 timesRepeat: [f value:
>>> '123456789012345678901234567890']] => 1508
>>>
>>> I went with the straight-forward fold because it was simpler and my
>>> (quite possibly wrong/inefficient) attempt at a log_2 n version was a
>>> lot less readable.
>>>
>>> I'd greatly appreciate being gently corrected!
>>>
>>> frank
>>>
>>
>> Hmm, let's see.
>>
>> Suppose that SmallIntegers are on 3 digits, and Large begins at 4.
>> And let's compose a 32 digits number 12345678901234567890123456789012
>>
>> If I do the naive loop the first 3 loops (6 ops) give small ints.
>> Then the next 29 loops (58 ops) are performed with large ints.
>>
>> Now, if I perform these ops:
>>
>> ((1234 * (10**4) + 5678) * (10**8) + (9012 * (10**4) + 3456)) *
>> (10**16) + ((7890 * (10**16) + 1234) * (10**8) + (5678 * (10**4) +
>> 9012))
>>
>> This time, I have 24 loops with small int (48 ops) and 8 loops (16
>> ops) with large ints for evaluating the 4 digits ints.
>> Plus 14 operations to assemble these large ints into the final result.
>> Plus a number of operations required to evaluate the powers of 10:
>> without cache, this is
>>
>> 10**4 = 10 squared squared (1 small op + 1 large) * 4 times.
>> 10**8 = 10 squared squared squared (1 small op + 2 large) * 2 times.
>> 10**16 = 10 squared squared squared (1 small op + 3 large) * 1 time.
>>
>> The total is 55 ops on small ints + 41 ops on large ints.
>> So the second algorithm costs more operations on total, but cheaper ones.
>> Suppose every operation on SmallInteger costs 1 and LargeInteger costs 10,
>> we have a cost of 586 for the first, 465 for the second.
>>
>> Also, the cost of Large int operations is not constant.
>> It is proportional to max(length1,length2) for + and to length1*length2 for *.
>> So, we'll consider length(10**n)=n.
>> In first case, length2=1 and length1 is varying linearly between 4 and
>> 32, thus a mean of 18*2 digit ops per loop * 29 loops.
>>
>> In the second case, the cost in digit ops is:
>> 1) (3*1+4) per loop * 8 loops for forming the 4 digits elementary large ints
>> 2) (4*4+8) per loop * 4 loops for forming the 8 digits large ints
>> 3) (8*8+16) per loop * 2 loops for forming the 16 digits large ints
>> 4) (16*16+32) for the last iteration
>>
>> Plus the costs of evaluating powers of 10:
>> 1) (100 squared) is evaluated 7 times * (2*2) digit ops
>> 2) (10000 squared) is evaluated 3 times * (4*4) digit ops
>> 3) (100000000 squared) is evaluated 1 times * (8*8) digit ops
>>
>> grand total is 18*2*29 = 1044 digit ops for the naive algorithm
>> For the divide and conquer form, that is ((3*1+4)*8) + ((4*4+8)*4) +
>> ((8*8+16)*2) + (16*16+32) + (2*2*7) + (4*4*3) + (8*8)  = 740
>>
>> In your 30 decimal digits example, large integers are formed of 10
>> decimal digits (SmallInteger maxVal + 1) printString size, so we would
>> have to translate above count...
>> But it's boring, we can lazily bench it:
>>
>> | stream |
>> stream := '123456789012345678901234567890' readStream.
>> {
>> [stream reset; nextFloat] bench.
>> [stream reset. Number readFrom: stream] bench.
>> }
>> #('15,600 per second.' '23,500 per second.') "Old 4.2 VM"
>> #('56,100 per second.' '97,700 per second.') "recent COG VM"
>>
>> Stream>>nextFloat is a naive loop with almost every operation inlined.
>> Number>>readFrom: traverses a bunch of sends before the real
>> operations take place.
>>
>> The overhead equlibrates around 12 decimal digits in non COG VM !
>> That's why it was already faster for scanning a 15 digits Float (non
>> counting the fact that nextFloat fails to answer the nearest Float).
>> Now, in COG, it equilibrates around 20 digits.
>>
>> If we want to draw the cost versus log of number of digits:
>>
>> | gen counts serie cost1 cost2 |
>> gen := Generator on: [:g | [g yield: '1234567890' atRandom] repeat].
>> counts := (2 to: 13) collect: [:e | 2 raisedTo: e].
>> serie := counts collect: [:count |
>>       String withAll: (gen next: count)].
>> cost1 := serie collect: [:e |
>>       [1000 timesRepeat: [e readStream nextFloat]] timeToRun].
>> cost2 := serie collect: [:e |
>>       [1000 timesRepeat: [(Number readFrom: e readStream) asFloat]] timeToRun].
>> {counts. cost1. cost2.}
>> #("NON COG VM"
>> #(4 8 16 32 64 128 256 512 1024 2048 4096 8192)
>> #(2 2 25 65 146 304 647 1412 3288 8393 24021 77189)
>> #(6 6 21 51 94 177 361 742 1550 3872 8471 22647))
>>
>> #("COG VM"
>> #(4 8 16 32 64 128 256 512 1024 2048 4096 8192)
>> #(0 2 3 20 56 139 334 838 2323 7010 23614 85925)
>> #(2 0 4 13 24 53 116 264 626 1646 4552 14191))
>>
>> It's not a log_2 cost at the end, but it's already some gain.
>> It's possible that above costs are dominated by garbage collection,
>> but I don't know how to measure that accurately...
>>
>> Nicolas
>>
>>> On 14 September 2011 20:26, Frank Shearar <[hidden email]> wrote:
>>>> On 3 September 2011 19:35, Nicolas Cellier
>>>> <[hidden email]> wrote:
>>>>> 2011/9/3 Frank Shearar <[hidden email]>:
>>>>>> On 3 September 2011 18:50, Lukas Renggli <[hidden email]> wrote:
>>>>>>> I think it is a good idea to have the number parser separate, after
>>>>>>> all it might also make sense to use it separately.
>>>>>>>
>>>>>>> It seems that the new Smalltalk grammar is significantly slower. The
>>>>>>> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
>>>>>>> source code of the collection hierarchy and does not especially target
>>>>>>> number literals runs 30% slower.
>>>>>>>
>>>>>>> Also I see that "Number readFrom: ..." is still used within the
>>>>>>> grammar. This seems to be a bit strange, no?
>>>>>>
>>>>>> Yes: it's a double-parse, which is a bit lame. First, we parse the
>>>>>> literal with PPSmalltalkNumberParser, which ensures that the thing
>>>>>> given to Number class >> #readFrom: is a well-formed token (so, in
>>>>>> particular, Squeak's Number doesn't get to see anything other than a
>>>>>> well-formed token).
>>>>>>
>>>>>> It sounds like you're happy with the basic concept, so maybe I should
>>>>>> remove the Number class >> #readFrom: stuff, see if I can't remove the
>>>>>> performance issues, and resubmit the patch.
>>>>>>
>>>>>> frank
>>>>>>
>>>>>
>>>>> Yes, a NumberParser is essentially parsing, and this duplication sounds useless.
>>>>> The main feature of interest in NumberParser that I consider a
>>>>> requirement and should find its equivalence in a PetitNumberParser is:
>>>>> - round a decimal representation to nearest Float
>>>>> It's simple, just convert a Fraction asFloat in a single final step to
>>>>> avoid cumulating round off errors - see
>>>>> #makeFloatFromMantissa:exponent:base:
>>>>>
>>>>> The second feature of interest in NumberParser is the ability to
>>>>> parser LargeInteger efficiently by avoiding (10 * largeValue +
>>>>> digitValue) loops, and replacing them with a log(n) cost.
>>>>> This would be a simple thing to implement in a functional language.
>>>>
>>>> Hopefully this won't offend your sensibilities too much :). It does,
>>>> in fact, use 10* loops - I wrote an experimental "front half * rear
>>>> half" recursion, which was slower in my benchmarks.
>>>>
>>>> This version has the grammar and parser doing no string->number
>>>> conversion at all. PPSmalltalkNumberMaker supplies a number of utility
>>>> methods designed to stop one from making malformed numbers. It also
>>>> supplies a builder interface that the parser uses to construct
>>>> numbers.
>>>>
>>>> frank
>>>>
>>>>> Nicolas
>>>>>
>>>>>>> Lukas
>>>>>>>
>>>>>>>
>>>>>>> On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:
>>>>>>>> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>>>>>>>>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>>>>>>>>> Hi Lukas,
>>>>>>>>>>
>>>>>>>>>> I haven't :) mainly because I'm unsure where to put it - is there
>>>>>>>>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>>>>>>>>> your preference? (How about an mcd between my merge and PP's head?)
>>>>>>>>>
>>>>>>>>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>>>>>>>>> attach it to a mail.
>>>>>>>>
>>>>>>>> Ah, great - here it is. You'll see I've written the grammar as a
>>>>>>>> separate class. That was really more to make what I'd done more
>>>>>>>> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
>>>>>>>> it's not a bad idea anyway: it's easy to see the number literal
>>>>>>>> subgrammar.
>>>>>>>>
>>>>>>>> frank
>>>>>>>>
>>>>>>>>> Lukas
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Lukas Renggli
>>>>>>>>> www.lukas-renggli.ch
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Lukas Renggli
>>>>>>> www.lukas-renggli.ch
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
>

Cost-Integer-readFrom-COG.pdf (92K) Download Attachment
Cout-Integer-readFrom.numbers (171K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Eliot Miranda-2
In reply to this post by Nicolas Cellier


On Sun, Sep 25, 2011 at 7:21 AM, Nicolas Cellier <[hidden email]> wrote:
2011/9/25 Frank Shearar <[hidden email]>:
> Nicolas, you'd mentioned a log_2 n implementation for number parsing.
> I think by that you mean the division of digits into two "packets"
> that are processed and then added?
>
> I tried a straight-forward fold (using the * 10 that you said you didn't like!)
>
> Time millisecondsToRun: [10000 timesRepeat:
> ['123456789012345678901234567890' inject: 0 into: [:acc :each | 10 *
> acc + each digitValue]]] => 818
>
> and what I think you meant:
>
> | f |
> f := 1. "Just to shut the compiler up about the self-reference."
> f := [:someDigits | | len |
>        len := someDigits size.
>        (len = 1)
>                ifTrue: [someDigits first digitValue]
>                ifFalse: [ | leftHalf rightHalf |
>                        leftHalf := someDigits first: len // 2.
>                        rightHalf := someDigits allButFirst: len // 2.
>                        ((f value: (leftHalf)) * (10 raisedTo: rightHalf size)) + (f value:
> rightHalf)]].
> Time millisecondsToRun: [10000 timesRepeat: [f value:
> '123456789012345678901234567890']] => 1508
>
> I went with the straight-forward fold because it was simpler and my
> (quite possibly wrong/inefficient) attempt at a log_2 n version was a
> lot less readable.
>
> I'd greatly appreciate being gently corrected!
>
> frank
>

Hmm, let's see.

Suppose that SmallIntegers are on 3 digits, and Large begins at 4.
And let's compose a 32 digits number 12345678901234567890123456789012

If I do the naive loop the first 3 loops (6 ops) give small ints.
Then the next 29 loops (58 ops) are performed with large ints.

Now, if I perform these ops:

((1234 * (10**4) + 5678) * (10**8) + (9012 * (10**4) + 3456)) *
(10**16) + ((7890 * (10**16) + 1234) * (10**8) + (5678 * (10**4) +
9012))

This time, I have 24 loops with small int (48 ops) and 8 loops (16
ops) with large ints for evaluating the 4 digits ints.
Plus 14 operations to assemble these large ints into the final result.
Plus a number of operations required to evaluate the powers of 10:
without cache, this is

10**4 = 10 squared squared (1 small op + 1 large) * 4 times.
10**8 = 10 squared squared squared (1 small op + 2 large) * 2 times.
10**16 = 10 squared squared squared (1 small op + 3 large) * 1 time.

The total is 55 ops on small ints + 41 ops on large ints.
So the second algorithm costs more operations on total, but cheaper ones.
Suppose every operation on SmallInteger costs 1 and LargeInteger costs 10,
we have a cost of 586 for the first, 465 for the second.

Also, the cost of Large int operations is not constant.
It is proportional to max(length1,length2) for + and to length1*length2 for *.
So, we'll consider length(10**n)=n.
In first case, length2=1 and length1 is varying linearly between 4 and
32, thus a mean of 18*2 digit ops per loop * 29 loops.

In the second case, the cost in digit ops is:
1) (3*1+4) per loop * 8 loops for forming the 4 digits elementary large ints
2) (4*4+8) per loop * 4 loops for forming the 8 digits large ints
3) (8*8+16) per loop * 2 loops for forming the 16 digits large ints
4) (16*16+32) for the last iteration

Plus the costs of evaluating powers of 10:
1) (100 squared) is evaluated 7 times * (2*2) digit ops
2) (10000 squared) is evaluated 3 times * (4*4) digit ops
3) (100000000 squared) is evaluated 1 times * (8*8) digit ops

grand total is 18*2*29 = 1044 digit ops for the naive algorithm
For the divide and conquer form, that is ((3*1+4)*8) + ((4*4+8)*4) +
((8*8+16)*2) + (16*16+32) + (2*2*7) + (4*4*3) + (8*8)  = 740

In your 30 decimal digits example, large integers are formed of 10
decimal digits (SmallInteger maxVal + 1) printString size, so we would
have to translate above count...
But it's boring, we can lazily bench it:

| stream |
stream := '123456789012345678901234567890' readStream.
{
[stream reset; nextFloat] bench.
[stream reset. Number readFrom: stream] bench.
}
 #('15,600 per second.' '23,500 per second.') "Old 4.2 VM"
 #('56,100 per second.' '97,700 per second.') "recent COG VM"

Nice to see (since it is to be expected) that as the algorithm spends less time in LargeInteger primitives and more time in simple operations Cog provides more speedup over the interpreter.  56,100/15,600 =~ 3.6, 97,700/23,500 =~ 4.2.

Stream>>nextFloat is a naive loop with almost every operation inlined.
Number>>readFrom: traverses a bunch of sends before the real
operations take place.

The overhead equlibrates around 12 decimal digits in non COG VM !
That's why it was already faster for scanning a 15 digits Float (non
counting the fact that nextFloat fails to answer the nearest Float).
Now, in COG, it equilibrates around 20 digits.

If we want to draw the cost versus log of number of digits:

| gen counts serie cost1 cost2 |
gen := Generator on: [:g | [g yield: '1234567890' atRandom] repeat].
counts := (2 to: 13) collect: [:e | 2 raisedTo: e].
serie := counts collect: [:count |
       String withAll: (gen next: count)].
cost1 := serie collect: [:e |
       [1000 timesRepeat: [e readStream nextFloat]] timeToRun].
cost2 := serie collect: [:e |
       [1000 timesRepeat: [(Number readFrom: e readStream) asFloat]] timeToRun].
{counts. cost1. cost2.}
 #("NON COG VM"
 #(4 8 16 32 64 128 256 512 1024 2048 4096 8192)
 #(2 2 25 65 146 304 647 1412 3288 8393 24021 77189)
 #(6 6 21 51 94 177 361 742 1550 3872 8471 22647))

 #("COG VM"
#(4 8 16 32 64 128 256 512 1024 2048 4096 8192)
 #(0 2 3 20 56 139 334 838 2323 7010 23614 85925)
#(2 0 4 13 24 53 116 264 626 1646 4552 14191))

It's not a log_2 cost at the end, but it's already some gain.
It's possible that above costs are dominated by garbage collection,
but I don't know how to measure that accurately...

Nicolas

> On 14 September 2011 20:26, Frank Shearar <[hidden email]> wrote:
>> On 3 September 2011 19:35, Nicolas Cellier
>> <[hidden email]> wrote:
>>> 2011/9/3 Frank Shearar <[hidden email]>:
>>>> On 3 September 2011 18:50, Lukas Renggli <[hidden email]> wrote:
>>>>> I think it is a good idea to have the number parser separate, after
>>>>> all it might also make sense to use it separately.
>>>>>
>>>>> It seems that the new Smalltalk grammar is significantly slower. The
>>>>> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
>>>>> source code of the collection hierarchy and does not especially target
>>>>> number literals runs 30% slower.
>>>>>
>>>>> Also I see that "Number readFrom: ..." is still used within the
>>>>> grammar. This seems to be a bit strange, no?
>>>>
>>>> Yes: it's a double-parse, which is a bit lame. First, we parse the
>>>> literal with PPSmalltalkNumberParser, which ensures that the thing
>>>> given to Number class >> #readFrom: is a well-formed token (so, in
>>>> particular, Squeak's Number doesn't get to see anything other than a
>>>> well-formed token).
>>>>
>>>> It sounds like you're happy with the basic concept, so maybe I should
>>>> remove the Number class >> #readFrom: stuff, see if I can't remove the
>>>> performance issues, and resubmit the patch.
>>>>
>>>> frank
>>>>
>>>
>>> Yes, a NumberParser is essentially parsing, and this duplication sounds useless.
>>> The main feature of interest in NumberParser that I consider a
>>> requirement and should find its equivalence in a PetitNumberParser is:
>>> - round a decimal representation to nearest Float
>>> It's simple, just convert a Fraction asFloat in a single final step to
>>> avoid cumulating round off errors - see
>>> #makeFloatFromMantissa:exponent:base:
>>>
>>> The second feature of interest in NumberParser is the ability to
>>> parser LargeInteger efficiently by avoiding (10 * largeValue +
>>> digitValue) loops, and replacing them with a log(n) cost.
>>> This would be a simple thing to implement in a functional language.
>>
>> Hopefully this won't offend your sensibilities too much :). It does,
>> in fact, use 10* loops - I wrote an experimental "front half * rear
>> half" recursion, which was slower in my benchmarks.
>>
>> This version has the grammar and parser doing no string->number
>> conversion at all. PPSmalltalkNumberMaker supplies a number of utility
>> methods designed to stop one from making malformed numbers. It also
>> supplies a builder interface that the parser uses to construct
>> numbers.
>>
>> frank
>>
>>> Nicolas
>>>
>>>>> Lukas
>>>>>
>>>>>
>>>>> On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:
>>>>>> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>>>>>>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>>>>>>> Hi Lukas,
>>>>>>>>
>>>>>>>> I haven't :) mainly because I'm unsure where to put it - is there
>>>>>>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>>>>>>> your preference? (How about an mcd between my merge and PP's head?)
>>>>>>>
>>>>>>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>>>>>>> attach it to a mail.
>>>>>>
>>>>>> Ah, great - here it is. You'll see I've written the grammar as a
>>>>>> separate class. That was really more to make what I'd done more
>>>>>> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
>>>>>> it's not a bad idea anyway: it's easy to see the number literal
>>>>>> subgrammar.
>>>>>>
>>>>>> frank
>>>>>>
>>>>>>> Lukas
>>>>>>>
>>>>>>> --
>>>>>>> Lukas Renggli
>>>>>>> www.lukas-renggli.ch
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lukas Renggli
>>>>> www.lukas-renggli.ch
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>




--
best,
Eliot

Reply | Threaded
Open this post in threaded view
|

Re: PetitSmalltalk: now with a self-contained grammar for number literals

Frank Shearar-3
In reply to this post by Nicolas Cellier
On 25 September 2011 15:21, Nicolas Cellier
<[hidden email]> wrote:

> 2011/9/25 Frank Shearar <[hidden email]>:
>> Nicolas, you'd mentioned a log_2 n implementation for number parsing.
>> I think by that you mean the division of digits into two "packets"
>> that are processed and then added?
>>
>> I tried a straight-forward fold (using the * 10 that you said you didn't like!)
>>
>> Time millisecondsToRun: [10000 timesRepeat:
>> ['123456789012345678901234567890' inject: 0 into: [:acc :each | 10 *
>> acc + each digitValue]]] => 818
>>
>> and what I think you meant:
>>
>> | f |
>> f := 1. "Just to shut the compiler up about the self-reference."
>> f := [:someDigits | | len |
>>        len := someDigits size.
>>        (len = 1)
>>                ifTrue: [someDigits first digitValue]
>>                ifFalse: [ | leftHalf rightHalf |
>>                        leftHalf := someDigits first: len // 2.
>>                        rightHalf := someDigits allButFirst: len // 2.
>>                        ((f value: (leftHalf)) * (10 raisedTo: rightHalf size)) + (f value:
>> rightHalf)]].
>> Time millisecondsToRun: [10000 timesRepeat: [f value:
>> '123456789012345678901234567890']] => 1508
>>
>> I went with the straight-forward fold because it was simpler and my
>> (quite possibly wrong/inefficient) attempt at a log_2 n version was a
>> lot less readable.
>>
>> I'd greatly appreciate being gently corrected!
>>
>> frank
>>
>
> Hmm, let's see.
>
> Suppose that SmallIntegers are on 3 digits, and Large begins at 4.
> And let's compose a 32 digits number 12345678901234567890123456789012
>
> If I do the naive loop the first 3 loops (6 ops) give small ints.
> Then the next 29 loops (58 ops) are performed with large ints.
>
> Now, if I perform these ops:
>
> ((1234 * (10**4) + 5678) * (10**8) + (9012 * (10**4) + 3456)) *
> (10**16) + ((7890 * (10**16) + 1234) * (10**8) + (5678 * (10**4) +
> 9012))
>
> This time, I have 24 loops with small int (48 ops) and 8 loops (16
> ops) with large ints for evaluating the 4 digits ints.
> Plus 14 operations to assemble these large ints into the final result.
> Plus a number of operations required to evaluate the powers of 10:
> without cache, this is
>
> 10**4 = 10 squared squared (1 small op + 1 large) * 4 times.
> 10**8 = 10 squared squared squared (1 small op + 2 large) * 2 times.
> 10**16 = 10 squared squared squared (1 small op + 3 large) * 1 time.
>
> The total is 55 ops on small ints + 41 ops on large ints.
> So the second algorithm costs more operations on total, but cheaper ones.
> Suppose every operation on SmallInteger costs 1 and LargeInteger costs 10,
> we have a cost of 586 for the first, 465 for the second.
>
> Also, the cost of Large int operations is not constant.
> It is proportional to max(length1,length2) for + and to length1*length2 for *.
> So, we'll consider length(10**n)=n.
> In first case, length2=1 and length1 is varying linearly between 4 and
> 32, thus a mean of 18*2 digit ops per loop * 29 loops.
>
> In the second case, the cost in digit ops is:
> 1) (3*1+4) per loop * 8 loops for forming the 4 digits elementary large ints
> 2) (4*4+8) per loop * 4 loops for forming the 8 digits large ints
> 3) (8*8+16) per loop * 2 loops for forming the 16 digits large ints
> 4) (16*16+32) for the last iteration
>
> Plus the costs of evaluating powers of 10:
> 1) (100 squared) is evaluated 7 times * (2*2) digit ops
> 2) (10000 squared) is evaluated 3 times * (4*4) digit ops
> 3) (100000000 squared) is evaluated 1 times * (8*8) digit ops
>
> grand total is 18*2*29 = 1044 digit ops for the naive algorithm
> For the divide and conquer form, that is ((3*1+4)*8) + ((4*4+8)*4) +
> ((8*8+16)*2) + (16*16+32) + (2*2*7) + (4*4*3) + (8*8)  = 740
>
> In your 30 decimal digits example, large integers are formed of 10
> decimal digits (SmallInteger maxVal + 1) printString size, so we would
> have to translate above count...
> But it's boring, we can lazily bench it:
>
> | stream |
> stream := '123456789012345678901234567890' readStream.
> {
> [stream reset; nextFloat] bench.
> [stream reset. Number readFrom: stream] bench.
> }
>  #('15,600 per second.' '23,500 per second.') "Old 4.2 VM"
>  #('56,100 per second.' '97,700 per second.') "recent COG VM"
>
> Stream>>nextFloat is a naive loop with almost every operation inlined.
> Number>>readFrom: traverses a bunch of sends before the real
> operations take place.

To be fair though, stream>>nextFloat is strictly less expressive than
either Number class>>readFrom: or the extension to PetitSmalltalk
under review:

    '16r1.0e10' readStream nextFloat 16.0
    Number readFrom: '16r1.0e10' 1.054931640625
    PPSmalltalkNumberParser new parse: '16r1.0e10' 1.054931640625.

(And Number class >> #readFrom: is precisely the thing I'm trying to
avoid: the Squeak and Pharo implementations have very different ideas
of what a number is.)

PPSmalltalkNumberParser2 uses the divide-and-conquer algorithm from above:
| stream pp pp2 |
stream := '123456789012345678901234567890' readStream.
pp := PPSmalltalkNumberParser new.
pp2 := PPSmalltalkNumberParser2 new.
{
[stream reset; nextFloat] bench.
[stream reset. Number readFrom: stream] bench.
[stream reset. pp parse: stream.] bench.
[stream reset. pp2 parse: stream.] bench
} #('10,500 per second.' '13,400 per second.' '1,560 per second.'
'1,390 per second.')

so right now my parser's rubbish: it's 10x slower. Most of that cost
will probably come from the nature of the separation of grammar from
evaluation: #nextFloat parses and creates the float in one go.

frank

> The overhead equlibrates around 12 decimal digits in non COG VM !
> That's why it was already faster for scanning a 15 digits Float (non
> counting the fact that nextFloat fails to answer the nearest Float).
> Now, in COG, it equilibrates around 20 digits.
>
> If we want to draw the cost versus log of number of digits:
>
> | gen counts serie cost1 cost2 |
> gen := Generator on: [:g | [g yield: '1234567890' atRandom] repeat].
> counts := (2 to: 13) collect: [:e | 2 raisedTo: e].
> serie := counts collect: [:count |
>        String withAll: (gen next: count)].
> cost1 := serie collect: [:e |
>        [1000 timesRepeat: [e readStream nextFloat]] timeToRun].
> cost2 := serie collect: [:e |
>        [1000 timesRepeat: [(Number readFrom: e readStream) asFloat]] timeToRun].
> {counts. cost1. cost2.}
>  #("NON COG VM"
>  #(4 8 16 32 64 128 256 512 1024 2048 4096 8192)
>  #(2 2 25 65 146 304 647 1412 3288 8393 24021 77189)
>  #(6 6 21 51 94 177 361 742 1550 3872 8471 22647))
>
>  #("COG VM"
> #(4 8 16 32 64 128 256 512 1024 2048 4096 8192)
>  #(0 2 3 20 56 139 334 838 2323 7010 23614 85925)
> #(2 0 4 13 24 53 116 264 626 1646 4552 14191))
>
> It's not a log_2 cost at the end, but it's already some gain.
> It's possible that above costs are dominated by garbage collection,
> but I don't know how to measure that accurately...
>
> Nicolas
>
>> On 14 September 2011 20:26, Frank Shearar <[hidden email]> wrote:
>>> On 3 September 2011 19:35, Nicolas Cellier
>>> <[hidden email]> wrote:
>>>> 2011/9/3 Frank Shearar <[hidden email]>:
>>>>> On 3 September 2011 18:50, Lukas Renggli <[hidden email]> wrote:
>>>>>> I think it is a good idea to have the number parser separate, after
>>>>>> all it might also make sense to use it separately.
>>>>>>
>>>>>> It seems that the new Smalltalk grammar is significantly slower. The
>>>>>> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses the
>>>>>> source code of the collection hierarchy and does not especially target
>>>>>> number literals runs 30% slower.
>>>>>>
>>>>>> Also I see that "Number readFrom: ..." is still used within the
>>>>>> grammar. This seems to be a bit strange, no?
>>>>>
>>>>> Yes: it's a double-parse, which is a bit lame. First, we parse the
>>>>> literal with PPSmalltalkNumberParser, which ensures that the thing
>>>>> given to Number class >> #readFrom: is a well-formed token (so, in
>>>>> particular, Squeak's Number doesn't get to see anything other than a
>>>>> well-formed token).
>>>>>
>>>>> It sounds like you're happy with the basic concept, so maybe I should
>>>>> remove the Number class >> #readFrom: stuff, see if I can't remove the
>>>>> performance issues, and resubmit the patch.
>>>>>
>>>>> frank
>>>>>
>>>>
>>>> Yes, a NumberParser is essentially parsing, and this duplication sounds useless.
>>>> The main feature of interest in NumberParser that I consider a
>>>> requirement and should find its equivalence in a PetitNumberParser is:
>>>> - round a decimal representation to nearest Float
>>>> It's simple, just convert a Fraction asFloat in a single final step to
>>>> avoid cumulating round off errors - see
>>>> #makeFloatFromMantissa:exponent:base:
>>>>
>>>> The second feature of interest in NumberParser is the ability to
>>>> parser LargeInteger efficiently by avoiding (10 * largeValue +
>>>> digitValue) loops, and replacing them with a log(n) cost.
>>>> This would be a simple thing to implement in a functional language.
>>>
>>> Hopefully this won't offend your sensibilities too much :). It does,
>>> in fact, use 10* loops - I wrote an experimental "front half * rear
>>> half" recursion, which was slower in my benchmarks.
>>>
>>> This version has the grammar and parser doing no string->number
>>> conversion at all. PPSmalltalkNumberMaker supplies a number of utility
>>> methods designed to stop one from making malformed numbers. It also
>>> supplies a builder interface that the parser uses to construct
>>> numbers.
>>>
>>> frank
>>>
>>>> Nicolas
>>>>
>>>>>> Lukas
>>>>>>
>>>>>>
>>>>>> On 3 September 2011 17:18, Frank Shearar <[hidden email]> wrote:
>>>>>>> On 3 September 2011 15:56, Lukas Renggli <[hidden email]> wrote:
>>>>>>>> On 3 September 2011 16:51, Frank Shearar <[hidden email]> wrote:
>>>>>>>>> Hi Lukas,
>>>>>>>>>
>>>>>>>>> I haven't :) mainly because I'm unsure where to put it - is there
>>>>>>>>> perhaps a PP Inbox, or shall I just post the merged version, or what's
>>>>>>>>> your preference? (How about an mcd between my merge and PP's head?)
>>>>>>>>
>>>>>>>> Just put the .mcz at some public URL (dropbox, squeak source, ...) or
>>>>>>>> attach it to a mail.
>>>>>>>
>>>>>>> Ah, great - here it is. You'll see I've written the grammar as a
>>>>>>> separate class. That was really more to make what I'd done more
>>>>>>> obvious and to minimise the change to PPSmalltalkGrammar, but perhaps
>>>>>>> it's not a bad idea anyway: it's easy to see the number literal
>>>>>>> subgrammar.
>>>>>>>
>>>>>>> frank
>>>>>>>
>>>>>>>> Lukas
>>>>>>>>
>>>>>>>> --
>>>>>>>> Lukas Renggli
>>>>>>>> www.lukas-renggli.ch
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lukas Renggli
>>>>>> www.lukas-renggli.ch
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>

12