Smalltalk › Squeak › Squeak - Dev

The Inbox: Collections-ct.851.mcz

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

10 messages Options

commits-2

The Inbox: Collections-ct.851.mcz

A new version of Collections was added to project The Inbox:
http://source.squeak.org/inbox/Collections-ct.851.mcz

==================== Summary ====================

Name: Collections-ct.851
Author: ct
Time: 15 August 2019, 11:36:23.694735 pm
UUID: 6c7113c6-9d09-bf4c-9b32-e2001870bb56
Ancestors: Collections-ct.850

Refactor String>>#format: according to Text>>#format:

I could not find any significant performance impacts.

We have some duplication between both #format: implementations. Do you thing this is a problem at the current scale?

=============== Diff against Collections-ct.850 ===============

Item was changed:
----- Method: String>>format: (in category 'formatting') -----
+ format: arguments
+ "format the receiver with arguments
- format: aCollection
- "format the receiver with aCollection

simplest example:
'foo {1} bar' format: {Date today}.

complete example:
'\{ \} \\ foo {1} bar {2}' format: {12. 'string'}.
"
+ ^self class new: self size * 11 // 10 streamContents: [ :output |
- ^self class new: self size * 11 // 10 "+10%" streamContents: [ :output |
| lastIndex nextIndex |
lastIndex := 1.
[ (nextIndex := self indexOfAnyOf: FormatCharacterSet startingAt: lastIndex) = 0 ] whileFalse: [
nextIndex = lastIndex ifFalse: [
output next: nextIndex - lastIndex putAll: self startingAt: lastIndex ].
+ (self at: nextIndex) caseOf: {
+ [$\] -> [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ].
+ [${] -> [
- (self at: nextIndex) == $\
- ifTrue: [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ]
- ifFalse: [ "${"
"Parse the index - a positive integer in base 10."
+ | character collectionIndex |
- | digitValue collectionIndex |
collectionIndex := 0.
+ [ (character := self at: (nextIndex := nextIndex + 1)) isDigit ] whileTrue: [
+ collectionIndex := collectionIndex * 10 + character digitValue ].
+ character = $} ifFalse: [ self error: '$} expected' ].
+ output nextPutAll: (arguments at: collectionIndex) asString ] }.
- [ (digitValue := self basicAt: (nextIndex := nextIndex + 1)) between: 48 "$0 asciiValue" and: 57 "$9 asciiValue" ] whileTrue: [
- collectionIndex := collectionIndex * 10 + digitValue - 48. "$0 asciiValue" ].
- digitValue = 125 "$} asciiValue" ifFalse: [ self error: '$} expected' ].
- output nextPutAll: (aCollection at: collectionIndex) asString ].
lastIndex := nextIndex + 1 ].
lastIndex <= self size ifTrue: [
output next: self size - lastIndex + 1 putAll: self startingAt: lastIndex ] ]!

Levente Uzonyi

Re: The Inbox: Collections-ct.851.mcz

On Thu, 15 Aug 2019, [hidden email] wrote:

> A new version of Collections was added to project The Inbox:
> http://source.squeak.org/inbox/Collections-ct.851.mcz
>
> ==================== Summary ====================
>
> Name: Collections-ct.851
> Author: ct
> Time: 15 August 2019, 11:36:23.694735 pm
> UUID: 6c7113c6-9d09-bf4c-9b32-e2001870bb56
> Ancestors: Collections-ct.850
>
> Refactor String>>#format: according to Text>>#format:
>
> I could not find any significant performance impacts.

Difference only appears when there are substitutions to be done. The more
substitutions you have, the higher the difference in performance will be.
This new version takes up to ~40% longer than the current implementation,
which is still not optimal btw.
The new variant also changes the semantics, because it accepts unicode
digits. For example, the following won't raise an error but yield 9:

'{', (Character value: 16r1d7eb) asString, '}' format: (1 to: 10).

Levente

>
> We have some duplication between both #format: implementations. Do you thing this is a problem at the current scale?
>
> =============== Diff against Collections-ct.850 ===============
>
> Item was changed:
> ----- Method: String>>format: (in category 'formatting') -----
> + format: arguments
> + "format the receiver with arguments
> - format: aCollection
> - "format the receiver with aCollection
>
> simplest example:
> 'foo {1} bar' format: {Date today}.
>
> complete example:
> '\{ \} \\ foo {1} bar {2}' format: {12. 'string'}.
> "
> + ^self class new: self size * 11 // 10 streamContents: [ :output |
> - ^self class new: self size * 11 // 10 "+10%" streamContents: [ :output |
> | lastIndex nextIndex |
> lastIndex := 1.
> [ (nextIndex := self indexOfAnyOf: FormatCharacterSet startingAt: lastIndex) = 0 ] whileFalse: [
> nextIndex = lastIndex ifFalse: [
> output next: nextIndex - lastIndex putAll: self startingAt: lastIndex ].
> + (self at: nextIndex) caseOf: {
> + [$\] -> [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ].
> + [${] -> [
> - (self at: nextIndex) == $\
> - ifTrue: [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ]
> - ifFalse: [ "${"
> "Parse the index - a positive integer in base 10."
> + | character collectionIndex |
> - | digitValue collectionIndex |
> collectionIndex := 0.
> + [ (character := self at: (nextIndex := nextIndex + 1)) isDigit ] whileTrue: [
> + collectionIndex := collectionIndex * 10 + character digitValue ].
> + character = $} ifFalse: [ self error: '$} expected' ].
> + output nextPutAll: (arguments at: collectionIndex) asString ] }.
> - [ (digitValue := self basicAt: (nextIndex := nextIndex + 1)) between: 48 "$0 asciiValue" and: 57 "$9 asciiValue" ] whileTrue: [
> - collectionIndex := collectionIndex * 10 + digitValue - 48. "$0 asciiValue" ].
> - digitValue = 125 "$} asciiValue" ifFalse: [ self error: '$} expected' ].
> - output nextPutAll: (aCollection at: collectionIndex) asString ].
> lastIndex := nextIndex + 1 ].
> lastIndex <= self size ifTrue: [
> output next: self size - lastIndex + 1 putAll: self startingAt: lastIndex ] ]!

Christoph Thiede

Re: The Inbox: Collections-ct.851.mcz

Thanks for your thoughts!

> The new variant also changes the semantics, because it accepts unicode

> digits. For example, the following won't raise an error but yield 9:

> '{', (Character value: 16r1d7eb) asString, '}' format: (1 to: 10).

Interesting, but do you really think this is an undesired behavior?

I changed the method with the aim to increase the readability, which was also impaired by the hardcoded numbers.

In my eyes it is a nice side effect to support other kinds of Unicode values - NumberParser does the same.

Christoph

Von: Squeak-dev <[hidden email]> im Auftrag von Levente Uzonyi <[hidden email]>
Gesendet: Freitag, 16. August 2019 00:51:14
An: [hidden email]
Betreff: Re: [squeak-dev] The Inbox: Collections-ct.851.mcz

On Thu, 15 Aug 2019, [hidden email] wrote:

> A new version of Collections was added to project The Inbox:
> http://source.squeak.org/inbox/Collections-ct.851.mcz
>
> ==================== Summary ====================
>
> Name: Collections-ct.851
> Author: ct
> Time: 15 August 2019, 11:36:23.694735 pm
> UUID: 6c7113c6-9d09-bf4c-9b32-e2001870bb56
> Ancestors: Collections-ct.850
>
> Refactor String>>#format: according to Text>>#format:
>
> I could not find any significant performance impacts.

Difference only appears when there are substitutions to be done. The more
substitutions you have, the higher the difference in performance will be.
This new version takes up to ~40% longer than the current implementation,
which is still not optimal btw.
The new variant also changes the semantics, because it accepts unicode
digits. For example, the following won't raise an error but yield 9:

         '{', (Character value: 16r1d7eb) asString, '}' format: (1 to: 10).

Levente

>
> We have some duplication between both #format: implementations. Do you thing this is a problem at the current scale?
>
> =============== Diff against Collections-ct.850 ===============
>
> Item was changed:
> ----- Method: String>>format: (in category 'formatting') -----
> + format: arguments
> +      "format the receiver with arguments
> - format: aCollection
> -      "format the receiver with aCollection
>
>        simplest example:
>        'foo {1} bar' format: {Date today}.
>
>        complete example:
>        '\{ \} \\ foo {1} bar {2}' format: {12. 'string'}.
>        "
> +      ^self class new: self size * 11 // 10 streamContents: [ :output |
> -      ^self class new: self size * 11 // 10 "+10%" streamContents: [ :output |
>                | lastIndex nextIndex |
>                lastIndex := 1.
>                [ (nextIndex := self indexOfAnyOf: FormatCharacterSet startingAt: lastIndex) = 0 ] whileFalse: [
>                        nextIndex = lastIndex ifFalse: [
>                                output next: nextIndex - lastIndex putAll: self startingAt: lastIndex ].
> +                      (self at: nextIndex) caseOf: {
> +                              [$\] -> [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ].
> +                              [${] -> [
> -                      (self at: nextIndex) == $\
> -                              ifTrue: [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ]
> -                              ifFalse: [ "${"
>                                        "Parse the index - a positive integer in base 10."
> +                                      | character collectionIndex |
> -                                      | digitValue collectionIndex |
>                                        collectionIndex := 0.
> +                                      [ (character := self at: (nextIndex := nextIndex + 1)) isDigit ] whileTrue: [
> +                                              collectionIndex := collectionIndex * 10 + character digitValue ].
> +                                      character = $} ifFalse: [ self error: '$} expected' ].
> +                                      output nextPutAll: (arguments at: collectionIndex) asString ] }.
> -                                      [ (digitValue := self basicAt: (nextIndex := nextIndex + 1)) between: 48 "$0 asciiValue" and: 57 "$9 asciiValue" ] whileTrue: [
> -                                              collectionIndex := collectionIndex * 10 + digitValue - 48. "$0 asciiValue" ].
> -                                      digitValue = 125 "$} asciiValue" ifFalse: [ self error: '$} expected' ].
> -                                      output nextPutAll: (aCollection at: collectionIndex) asString ].
>                        lastIndex := nextIndex + 1 ].
>                lastIndex <= self size ifTrue: [
>                        output next: self size - lastIndex + 1 putAll: self startingAt: lastIndex ] ]!

Carpe Squeak!

Levente Uzonyi

Re: The Inbox: Collections-ct.851.mcz

On Thu, 15 Aug 2019, Thiede, Christoph wrote:

>
> Thanks for your thoughts!
>
>
> > The new variant also changes the semantics, because it accepts unicode
>
> > digits. For example, the following won't raise an error but yield 9:
> >
> > '{', (Character value: 16r1d7eb) asString, '}' format: (1 to: 10).
>
> Interesting, but do you really think this is an undesired behavior?

It sounds fun until you notice the side effects.

> I changed the method with the aim to increase the readability, which was also impaired by the hardcoded numbers.

Indeed, that method is a prime example of trading legibility for performance,
even though the comments helped you out quite a bit, didn't they? :)

> In my eyes it is a nice side effect to support other kinds of Unicode values - NumberParser does the same.

IMO, it opens a can of worms:
- WideStrings use 4x as much memory as ByteStings, and they lack the VM support ByteStrings have, so many operations are significantly slower with them.
- WideStrings spread like plague:
- Wrote a WideString into a stream? your stream's buffer is now a WideString.
- Did some operation with a WideString, e.g. #,? The result is very likely a WideString.
- Why doesn't this string match my regex '.*[0-9].*'? There's clearly a 9 in there... Oh, wait, it's a WideString with a "Mathematical sans-serif digit nine".

Levente

>
> Christoph
>
> _________________________________________________________________________________________________________________________________________________________________________________________________________________________________
> Von: Squeak-dev <[hidden email]> im Auftrag von Levente Uzonyi <[hidden email]>
> Gesendet: Freitag, 16. August 2019 00:51:14
> An: [hidden email]
> Betreff: Re: [squeak-dev] The Inbox: Collections-ct.851.mcz
> On Thu, 15 Aug 2019, [hidden email] wrote:
>
> > A new version of Collections was added to project The Inbox:
> > http://source.squeak.org/inbox/Collections-ct.851.mcz
> >
> > ==================== Summary ====================
> >
> > Name: Collections-ct.851
> > Author: ct
> > Time: 15 August 2019, 11:36:23.694735 pm
> > UUID: 6c7113c6-9d09-bf4c-9b32-e2001870bb56
> > Ancestors: Collections-ct.850
> >
> > Refactor String>>#format: according to Text>>#format:
> >
> > I could not find any significant performance impacts.
>
> Difference only appears when there are substitutions to be done. The more
> substitutions you have, the higher the difference in performance will be.
> This new version takes up to ~40% longer than the current implementation,
> which is still not optimal btw.
> The new variant also changes the semantics, because it accepts unicode
> digits. For example, the following won't raise an error but yield 9:
>
>          '{', (Character value: 16r1d7eb) asString, '}' format: (1 to: 10).
>
> Levente
>
> >
> > We have some duplication between both #format: implementations. Do you thing this is a problem at the current scale?
> >
> > =============== Diff against Collections-ct.850 ===============
> >
> > Item was changed:
> > ----- Method: String>>format: (in category 'formatting') -----
> > + format: arguments
> > +      "format the receiver with arguments
> > - format: aCollection
> > -      "format the receiver with aCollection
> >
> >        simplest example:
> >        'foo {1} bar' format: {Date today}.
> >
> >        complete example:
> >        '\{ \} \\ foo {1} bar {2}' format: {12. 'string'}.
> >        "
> > +      ^self class new: self size * 11 // 10 streamContents: [ :output |
> > -      ^self class new: self size * 11 // 10 "+10%" streamContents: [ :output |
> >                | lastIndex nextIndex |
> >                lastIndex := 1.
> >                [ (nextIndex := self indexOfAnyOf: FormatCharacterSet startingAt: lastIndex) = 0 ] whileFalse: [
> >                        nextIndex = lastIndex ifFalse: [
> >                                output next: nextIndex - lastIndex putAll: self startingAt: lastIndex ].
> > +                      (self at: nextIndex) caseOf: {
> > +                              [$\] -> [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ].
> > +                              [${] -> [
> > -                      (self at: nextIndex) == $\
> > -                              ifTrue: [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ]
> > -                              ifFalse: [ "${"
> >                                        "Parse the index - a positive integer in base 10."
> > +                                      | character collectionIndex |
> > -                                      | digitValue collectionIndex |
> >                                        collectionIndex := 0.
> > +                                      [ (character := self at: (nextIndex := nextIndex + 1)) isDigit ] whileTrue: [
> > +                                              collectionIndex := collectionIndex * 10 + character digitValue ].
> > +                                      character = $} ifFalse: [ self error: '$} expected' ].
> > +                                      output nextPutAll: (arguments at: collectionIndex) asString ] }.
> > -                                      [ (digitValue := self basicAt: (nextIndex := nextIndex + 1)) between: 48 "$0 asciiValue" and: 57 "$9 asciiValue" ] whileTrue: [
> > -                                              collectionIndex := collectionIndex * 10 + digitValue - 48. "$0 asciiValue" ].
> > -                                      digitValue = 125 "$} asciiValue" ifFalse: [ self error: '$} expected' ].
> > -                                      output nextPutAll: (aCollection at: collectionIndex) asString ].
> >                        lastIndex := nextIndex + 1 ].
> >                lastIndex <= self size ifTrue: [
> >                        output next: self size - lastIndex + 1 putAll: self startingAt: lastIndex ] ]!
>
>
>

Jakob Reschke

Re: The Inbox: Collections-ct.851.mcz

Am Fr., 16. Aug. 2019 um 01:24 Uhr schrieb Levente Uzonyi <[hidden email]>:

On Thu, 15 Aug 2019, Thiede, Christoph wrote:
> In my eyes it is a nice side effect to support other kinds of Unicode values - NumberParser does the same.

IMO, it opens a can of worms:
- WideStrings use 4x as much memory as ByteStings, and they lack the VM support ByteStrings have, so many operations are significantly slower with them.
- WideStrings spread like plague:
- Wrote a WideString into a stream? your stream's buffer is now a WideString.
- Did some operation with a WideString, e.g. #,? The result is very likely a WideString.
- Why doesn't this string match my regex '.*[0-9].*'? There's clearly a 9 in there... Oh, wait, it's a WideString with a "Mathematical sans-serif digit nine".

Looks like the usual can of worms you get when you want to support international text. And if your regex wants both to be applied to unicode text and to find strings with any kind of number in it, then it is incomplete. :-) In general, treating the unicode digits as digits should actually alleviate this debugging confusion where you wonder why a digit was not processed as such, shouldn't it?

Question is: do the Smalltalk writers expect that their string, which incidentally contains '... {', (Mathematical sans-serif digit one), '} ...' (could be in part supplied by the user?), will have that sequence replaced by the first formatting argument or do they not expect it? Also: if user input is sanitized to escape format sequences before applying further formatting on the extended text later (imaginary scenario), this sanitization must now also support such unicode cases.

Levente Uzonyi

Re: The Inbox: Collections-ct.851.mcz

On Fri, 16 Aug 2019, Jakob Reschke wrote:

> Am Fr., 16. Aug. 2019 um 01:24 Uhr schrieb Levente Uzonyi <[hidden email]>:
> On Thu, 15 Aug 2019, Thiede, Christoph wrote:
> > In my eyes it is a nice side effect to support other kinds of Unicode values - NumberParser does the same.
>
> IMO, it opens a can of worms:
> - WideStrings use 4x as much memory as ByteStings, and they lack the VM support ByteStrings have, so many operations are significantly slower with them.
> - WideStrings spread like plague:
> - Wrote a WideString into a stream? your stream's buffer is now a WideString.
> - Did some operation with a WideString, e.g. #,? The result is very likely a WideString.
> - Why doesn't this string match my regex '.*[0-9].*'? There's clearly a 9 in there... Oh, wait, it's a WideString with a "Mathematical sans-serif digit nine".
>
>
> Looks like the usual can of worms you get when you want to support international text. And if your regex wants both to be applied to unicode text and to find strings with any kind of number in it, then it is incomplete. :-)

I guess you missed my point. You do not want to match unicode digits when
you write [0-9], but the unicode character may visually appear as a
regular digit, making it harder to debug your code.

> In general, treating the unicode digits as digits should actually alleviate this debugging confusion where you wonder why a digit was not processed as such, shouldn't it?

It depends on how you process those numbers.

>
> Question is: do the Smalltalk writers expect that their string, which incidentally contains '... {', (Mathematical sans-serif digit one), '} ...' (could be in part supplied by the user?), will have that sequence replaced by
> the first formatting argument or do they not expect it? Also: if user input is sanitized to escape format sequences before applying further formatting on the extended text later (imaginary scenario), this sanitization must
> now also support such unicode cases.

I personally see little value in having 63 ways to write a single
digit in my Smalltalk method.

Levente

>
>

Christoph Thiede

Re: The Inbox: Collections-ct.851.mcz

> I personally see little value in having 63 ways to write a single digit in my Smalltalk method.

But if we don't support this, we are breaking with the standards from NumberParser. Wouldn't this be inconsistent?

Christoph

Von: Levente Uzonyi

Gesendet: Freitag, 16. August, 13:41

Betreff: Re: [squeak-dev] The Inbox: Collections-ct.851.mcz

An: The general-purpose Squeak developers list

On Fri, 16 Aug 2019, Jakob Reschke wrote: > Am Fr., 16. Aug. 2019 um 01:24 Uhr schrieb Levente Uzonyi : > On Thu, 15 Aug 2019, Thiede, Christoph wrote: > > In my eyes it is a nice side effect to support other kinds of Unicode values - NumberParser does the same. > > IMO, it opens a can of worms: > - WideStrings use 4x as much memory as ByteStings, and they lack the VM support ByteStrings have, so many operations are significantly slower with them. > - WideStrings spread like plague: > - Wrote a WideString into a stream? your stream's buffer is now a WideString. > - Did some operation with a WideString, e.g. #,? The result is very likely a WideString. > - Why doesn't this string match my regex '.*[0-9].*'? There's clearly a 9 in there... Oh, wait, it's a WideString with a "Mathematical sans-serif digit nine". > > > Looks like the usual can of worms you get when you want to support international text. And if your regex wants both to be applied to unicode text and to find strings with any kind of number in it, then it is incomplete. :-) I guess you missed my point. You do not want to match unicode digits when you write [0-9], but the unicode character may visually appear as a regular digit, making it harder to debug your code. > In general, treating the unicode digits as digits should actually alleviate this debugging confusion where you wonder why a digit was not processed as such, shouldn't it? It depends on how you process those numbers. > > Question is: do the Smalltalk writers expect that their string, which incidentally contains '... {', (Mathematical sans-serif digit one), '} ...' (could be in part supplied by the user?), will have that sequence replaced by > the first formatting argument or do they not expect it? Also: if user input is sanitized to escape format sequences before applying further formatting on the extended text later (imaginary scenario), this sanitization must > now also support such unicode cases. I personally see little value in having 63 ways to write a single digit in my Smalltalk method. Levente > >

Carpe Squeak!

Jakob Reschke

Re: The Inbox: Collections-ct.851.mcz

In reply to this post by Levente Uzonyi

Am Fr., 16. Aug. 2019 um 13:41 Uhr schrieb Levente Uzonyi <[hidden email]>:

On Fri, 16 Aug 2019, Jakob Reschke wrote:
>
> Question is: do the Smalltalk writers expect that their string, which incidentally contains '... {', (Mathematical sans-serif digit one), '} ...' (could be in part supplied by the user?), will have that sequence replaced by
> the first formatting argument or do they not expect it? Also: if user input is sanitized to escape format sequences before applying further formatting on the extended text later (imaginary scenario), this sanitization must
> now also support such unicode cases.

I personally see little value in having 63 ways to write a single
digit in my Smalltalk method.

Me neither. I think it would rather surprise me if these higher-unicode digits were considered for the placeholders during the formatting. Since this construct of {0} and so forth is one for programmers, the input should only come from Smalltalk code. Since Smalltalk as a language does not support other numerals than the "basic" digits, I think it is reasonable to restrict the format method to just consider these digits.

If we decided one day that Squeak should also support Roman numerals or Han numbers in Smalltalk code, I would reconsider. ;-)

https://en.wikipedia.org/wiki/Numerals_in_Unicode

Jakob Reschke

Re: The Inbox: Collections-ct.851.mcz

In reply to this post by Christoph Thiede

Am Fr., 16. Aug. 2019 um 14:38 Uhr schrieb Thiede, Christoph <[hidden email]>:

> I personally see little value in having 63 ways to write a single digit in my Smalltalk method.

But if we don't support this, we are breaking with the standards from NumberParser. Wouldn't this be inconsistent?

What is the objective of NumberParser? Parse numbers from arbitrary sources of text? Then I would say it has a different scope than what is required for the formatting placeholders.

If SqNumberParser is the one that is used to parse numbers from Smalltalk expressions and if it supports non-ascii digits, then I wonder whether this is really desired behavior.

marcel.taeumel

Re: The Inbox: Collections-ct.851.mcz

Hi, all. :-)

I think that it would be better to only allow a small set of characters to be used as placeholders. So, -1 for Unicode support in {}-placeholders. Unicode applications should have to convert user-defined unicode patterns manually if they want to support unicode placeholders.

Here are some benchmarks. Not that much difference.

[ '{2} He\{ll\}o {1} World!' format: { 'Smalltalk' . 'Hey' } ] bench.

'BEFORE: 1,270,000 per second. 789 nanoseconds per run.'

'AFTER: 1,070,000 per second. 931 nanoseconds per run.'

Best,

Marcel

Am 16.08.2019 14:50:00 schrieb Jakob Reschke <[hidden email]>:

Am Fr., 16. Aug. 2019 um 14:38 Uhr schrieb Thiede, Christoph <[hidden email]>:

> I personally see little value in having 63 ways to write a single digit in my Smalltalk method.

But if we don't support this, we are breaking with the standards from NumberParser. Wouldn't this be inconsistent?

What is the objective of NumberParser? Parse numbers from arbitrary sources of text? Then I would say it has a different scope than what is required for the formatting placeholders.

If SqNumberParser is the one that is used to parse numbers from Smalltalk expressions and if it supports non-ascii digits, then I wonder whether this is really desired behavior.