A new version of Collections was added to project The Inbox:
http://source.squeak.org/inbox/Collections-ct.851.mcz ==================== Summary ==================== Name: Collections-ct.851 Author: ct Time: 15 August 2019, 11:36:23.694735 pm UUID: 6c7113c6-9d09-bf4c-9b32-e2001870bb56 Ancestors: Collections-ct.850 Refactor String>>#format: according to Text>>#format: I could not find any significant performance impacts. We have some duplication between both #format: implementations. Do you thing this is a problem at the current scale? =============== Diff against Collections-ct.850 =============== Item was changed: ----- Method: String>>format: (in category 'formatting') ----- + format: arguments + "format the receiver with arguments - format: aCollection - "format the receiver with aCollection simplest example: 'foo {1} bar' format: {Date today}. complete example: '\{ \} \\ foo {1} bar {2}' format: {12. 'string'}. " + ^self class new: self size * 11 // 10 streamContents: [ :output | - ^self class new: self size * 11 // 10 "+10%" streamContents: [ :output | | lastIndex nextIndex | lastIndex := 1. [ (nextIndex := self indexOfAnyOf: FormatCharacterSet startingAt: lastIndex) = 0 ] whileFalse: [ nextIndex = lastIndex ifFalse: [ output next: nextIndex - lastIndex putAll: self startingAt: lastIndex ]. + (self at: nextIndex) caseOf: { + [$\] -> [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ]. + [${] -> [ - (self at: nextIndex) == $\ - ifTrue: [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ] - ifFalse: [ "${" "Parse the index - a positive integer in base 10." + | character collectionIndex | - | digitValue collectionIndex | collectionIndex := 0. + [ (character := self at: (nextIndex := nextIndex + 1)) isDigit ] whileTrue: [ + collectionIndex := collectionIndex * 10 + character digitValue ]. + character = $} ifFalse: [ self error: '$} expected' ]. + output nextPutAll: (arguments at: collectionIndex) asString ] }. - [ (digitValue := self basicAt: (nextIndex := nextIndex + 1)) between: 48 "$0 asciiValue" and: 57 "$9 asciiValue" ] whileTrue: [ - collectionIndex := collectionIndex * 10 + digitValue - 48. "$0 asciiValue" ]. - digitValue = 125 "$} asciiValue" ifFalse: [ self error: '$} expected' ]. - output nextPutAll: (aCollection at: collectionIndex) asString ]. lastIndex := nextIndex + 1 ]. lastIndex <= self size ifTrue: [ output next: self size - lastIndex + 1 putAll: self startingAt: lastIndex ] ]! |
On Thu, 15 Aug 2019, [hidden email] wrote:
> A new version of Collections was added to project The Inbox: > http://source.squeak.org/inbox/Collections-ct.851.mcz > > ==================== Summary ==================== > > Name: Collections-ct.851 > Author: ct > Time: 15 August 2019, 11:36:23.694735 pm > UUID: 6c7113c6-9d09-bf4c-9b32-e2001870bb56 > Ancestors: Collections-ct.850 > > Refactor String>>#format: according to Text>>#format: > > I could not find any significant performance impacts. Difference only appears when there are substitutions to be done. The more substitutions you have, the higher the difference in performance will be. This new version takes up to ~40% longer than the current implementation, which is still not optimal btw. The new variant also changes the semantics, because it accepts unicode digits. For example, the following won't raise an error but yield 9: '{', (Character value: 16r1d7eb) asString, '}' format: (1 to: 10). Levente > > We have some duplication between both #format: implementations. Do you thing this is a problem at the current scale? > > =============== Diff against Collections-ct.850 =============== > > Item was changed: > ----- Method: String>>format: (in category 'formatting') ----- > + format: arguments > + "format the receiver with arguments > - format: aCollection > - "format the receiver with aCollection > > simplest example: > 'foo {1} bar' format: {Date today}. > > complete example: > '\{ \} \\ foo {1} bar {2}' format: {12. 'string'}. > " > + ^self class new: self size * 11 // 10 streamContents: [ :output | > - ^self class new: self size * 11 // 10 "+10%" streamContents: [ :output | > | lastIndex nextIndex | > lastIndex := 1. > [ (nextIndex := self indexOfAnyOf: FormatCharacterSet startingAt: lastIndex) = 0 ] whileFalse: [ > nextIndex = lastIndex ifFalse: [ > output next: nextIndex - lastIndex putAll: self startingAt: lastIndex ]. > + (self at: nextIndex) caseOf: { > + [$\] -> [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ]. > + [${] -> [ > - (self at: nextIndex) == $\ > - ifTrue: [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ] > - ifFalse: [ "${" > "Parse the index - a positive integer in base 10." > + | character collectionIndex | > - | digitValue collectionIndex | > collectionIndex := 0. > + [ (character := self at: (nextIndex := nextIndex + 1)) isDigit ] whileTrue: [ > + collectionIndex := collectionIndex * 10 + character digitValue ]. > + character = $} ifFalse: [ self error: '$} expected' ]. > + output nextPutAll: (arguments at: collectionIndex) asString ] }. > - [ (digitValue := self basicAt: (nextIndex := nextIndex + 1)) between: 48 "$0 asciiValue" and: 57 "$9 asciiValue" ] whileTrue: [ > - collectionIndex := collectionIndex * 10 + digitValue - 48. "$0 asciiValue" ]. > - digitValue = 125 "$} asciiValue" ifFalse: [ self error: '$} expected' ]. > - output nextPutAll: (aCollection at: collectionIndex) asString ]. > lastIndex := nextIndex + 1 ]. > lastIndex <= self size ifTrue: [ > output next: self size - lastIndex + 1 putAll: self startingAt: lastIndex ] ]! |
Thanks for your thoughts!
> The new variant also changes the semantics, because it accepts unicode > digits. For example, the following won't raise an error but yield 9:
>
> '{', (Character value: 16r1d7eb) asString, '}' format: (1 to: 10).
Interesting, but do you really think this is an undesired behavior?
I changed the method with the aim to increase the readability, which was also impaired by the hardcoded numbers.
In my eyes it is a nice side effect to support other kinds of Unicode values - NumberParser does the same.
Christoph
Von: Squeak-dev <[hidden email]> im Auftrag von Levente Uzonyi <[hidden email]>
Gesendet: Freitag, 16. August 2019 00:51:14 An: [hidden email] Betreff: Re: [squeak-dev] The Inbox: Collections-ct.851.mcz On Thu, 15 Aug 2019, [hidden email] wrote:
> A new version of Collections was added to project The Inbox: > http://source.squeak.org/inbox/Collections-ct.851.mcz > > ==================== Summary ==================== > > Name: Collections-ct.851 > Author: ct > Time: 15 August 2019, 11:36:23.694735 pm > UUID: 6c7113c6-9d09-bf4c-9b32-e2001870bb56 > Ancestors: Collections-ct.850 > > Refactor String>>#format: according to Text>>#format: > > I could not find any significant performance impacts. Difference only appears when there are substitutions to be done. The more substitutions you have, the higher the difference in performance will be. This new version takes up to ~40% longer than the current implementation, which is still not optimal btw. The new variant also changes the semantics, because it accepts unicode digits. For example, the following won't raise an error but yield 9: '{', (Character value: 16r1d7eb) asString, '}' format: (1 to: 10). Levente > > We have some duplication between both #format: implementations. Do you thing this is a problem at the current scale? > > =============== Diff against Collections-ct.850 =============== > > Item was changed: > ----- Method: String>>format: (in category 'formatting') ----- > + format: arguments > + "format the receiver with arguments > - format: aCollection > - "format the receiver with aCollection > > simplest example: > 'foo {1} bar' format: {Date today}. > > complete example: > '\{ \} \\ foo {1} bar {2}' format: {12. 'string'}. > " > + ^self class new: self size * 11 // 10 streamContents: [ :output | > - ^self class new: self size * 11 // 10 "+10%" streamContents: [ :output | > | lastIndex nextIndex | > lastIndex := 1. > [ (nextIndex := self indexOfAnyOf: FormatCharacterSet startingAt: lastIndex) = 0 ] whileFalse: [ > nextIndex = lastIndex ifFalse: [ > output next: nextIndex - lastIndex putAll: self startingAt: lastIndex ]. > + (self at: nextIndex) caseOf: { > + [$\] -> [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ]. > + [${] -> [ > - (self at: nextIndex) == $\ > - ifTrue: [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ] > - ifFalse: [ "${" > "Parse the index - a positive integer in base 10." > + | character collectionIndex | > - | digitValue collectionIndex | > collectionIndex := 0. > + [ (character := self at: (nextIndex := nextIndex + 1)) isDigit ] whileTrue: [ > + collectionIndex := collectionIndex * 10 + character digitValue ]. > + character = $} ifFalse: [ self error: '$} expected' ]. > + output nextPutAll: (arguments at: collectionIndex) asString ] }. > - [ (digitValue := self basicAt: (nextIndex := nextIndex + 1)) between: 48 "$0 asciiValue" and: 57 "$9 asciiValue" ] whileTrue: [ > - collectionIndex := collectionIndex * 10 + digitValue - 48. "$0 asciiValue" ]. > - digitValue = 125 "$} asciiValue" ifFalse: [ self error: '$} expected' ]. > - output nextPutAll: (aCollection at: collectionIndex) asString ]. > lastIndex := nextIndex + 1 ]. > lastIndex <= self size ifTrue: [ > output next: self size - lastIndex + 1 putAll: self startingAt: lastIndex ] ]!
Carpe Squeak!
|
On Thu, 15 Aug 2019, Thiede, Christoph wrote:
> > Thanks for your thoughts! > > > > The new variant also changes the semantics, because it accepts unicode > > > digits. For example, the following won't raise an error but yield 9: > > > > '{', (Character value: 16r1d7eb) asString, '}' format: (1 to: 10). > > Interesting, but do you really think this is an undesired behavior? > I changed the method with the aim to increase the readability, which was also impaired by the hardcoded numbers. Indeed, that method is a prime example of trading legibility for performance, even though the comments helped you out quite a bit, didn't they? :) > In my eyes it is a nice side effect to support other kinds of Unicode values - NumberParser does the same. IMO, it opens a can of worms: - WideStrings use 4x as much memory as ByteStings, and they lack the VM support ByteStrings have, so many operations are significantly slower with them. - WideStrings spread like plague: - Wrote a WideString into a stream? your stream's buffer is now a WideString. - Did some operation with a WideString, e.g. #,? The result is very likely a WideString. - Why doesn't this string match my regex '.*[0-9].*'? There's clearly a 9 in there... Oh, wait, it's a WideString with a "Mathematical sans-serif digit nine". Levente > > Christoph > > _________________________________________________________________________________________________________________________________________________________________________________________________________________________________ > Von: Squeak-dev <[hidden email]> im Auftrag von Levente Uzonyi <[hidden email]> > Gesendet: Freitag, 16. August 2019 00:51:14 > An: [hidden email] > Betreff: Re: [squeak-dev] The Inbox: Collections-ct.851.mcz > On Thu, 15 Aug 2019, [hidden email] wrote: > > > A new version of Collections was added to project The Inbox: > > http://source.squeak.org/inbox/Collections-ct.851.mcz > > > > ==================== Summary ==================== > > > > Name: Collections-ct.851 > > Author: ct > > Time: 15 August 2019, 11:36:23.694735 pm > > UUID: 6c7113c6-9d09-bf4c-9b32-e2001870bb56 > > Ancestors: Collections-ct.850 > > > > Refactor String>>#format: according to Text>>#format: > > > > I could not find any significant performance impacts. > > Difference only appears when there are substitutions to be done. The more > substitutions you have, the higher the difference in performance will be. > This new version takes up to ~40% longer than the current implementation, > which is still not optimal btw. > The new variant also changes the semantics, because it accepts unicode > digits. For example, the following won't raise an error but yield 9: > > '{', (Character value: 16r1d7eb) asString, '}' format: (1 to: 10). > > Levente > > > > > We have some duplication between both #format: implementations. Do you thing this is a problem at the current scale? > > > > =============== Diff against Collections-ct.850 =============== > > > > Item was changed: > > ----- Method: String>>format: (in category 'formatting') ----- > > + format: arguments > > + "format the receiver with arguments > > - format: aCollection > > - "format the receiver with aCollection > > > > simplest example: > > 'foo {1} bar' format: {Date today}. > > > > complete example: > > '\{ \} \\ foo {1} bar {2}' format: {12. 'string'}. > > " > > + ^self class new: self size * 11 // 10 streamContents: [ :output | > > - ^self class new: self size * 11 // 10 "+10%" streamContents: [ :output | > > | lastIndex nextIndex | > > lastIndex := 1. > > [ (nextIndex := self indexOfAnyOf: FormatCharacterSet startingAt: lastIndex) = 0 ] whileFalse: [ > > nextIndex = lastIndex ifFalse: [ > > output next: nextIndex - lastIndex putAll: self startingAt: lastIndex ]. > > + (self at: nextIndex) caseOf: { > > + [$\] -> [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ]. > > + [${] -> [ > > - (self at: nextIndex) == $\ > > - ifTrue: [ output nextPut: (self at: (nextIndex := nextIndex + 1)) ] > > - ifFalse: [ "${" > > "Parse the index - a positive integer in base 10." > > + | character collectionIndex | > > - | digitValue collectionIndex | > > collectionIndex := 0. > > + [ (character := self at: (nextIndex := nextIndex + 1)) isDigit ] whileTrue: [ > > + collectionIndex := collectionIndex * 10 + character digitValue ]. > > + character = $} ifFalse: [ self error: '$} expected' ]. > > + output nextPutAll: (arguments at: collectionIndex) asString ] }. > > - [ (digitValue := self basicAt: (nextIndex := nextIndex + 1)) between: 48 "$0 asciiValue" and: 57 "$9 asciiValue" ] whileTrue: [ > > - collectionIndex := collectionIndex * 10 + digitValue - 48. "$0 asciiValue" ]. > > - digitValue = 125 "$} asciiValue" ifFalse: [ self error: '$} expected' ]. > > - output nextPutAll: (aCollection at: collectionIndex) asString ]. > > lastIndex := nextIndex + 1 ]. > > lastIndex <= self size ifTrue: [ > > output next: self size - lastIndex + 1 putAll: self startingAt: lastIndex ] ]! > > > |
Am Fr., 16. Aug. 2019 um 01:24 Uhr schrieb Levente Uzonyi <[hidden email]>: On Thu, 15 Aug 2019, Thiede, Christoph wrote: Looks like the usual can of worms you get when you want to support international text. And if your regex wants both to be applied to unicode text and to find strings with any kind of number in it, then it is incomplete. :-) In general, treating the unicode digits as digits should actually alleviate this debugging confusion where you wonder why a digit was not processed as such, shouldn't it? Question is: do the Smalltalk writers expect that their string, which incidentally contains '... {', (Mathematical sans-serif digit one), '} ...' (could be in part supplied by the user?), will have that sequence replaced by the first formatting argument or do they not expect it? Also: if user input is sanitized to escape format sequences before applying further formatting on the extended text later (imaginary scenario), this sanitization must now also support such unicode cases. |
On Fri, 16 Aug 2019, Jakob Reschke wrote:
> Am Fr., 16. Aug. 2019 um 01:24 Uhr schrieb Levente Uzonyi <[hidden email]>: > On Thu, 15 Aug 2019, Thiede, Christoph wrote: > > In my eyes it is a nice side effect to support other kinds of Unicode values - NumberParser does the same. > > IMO, it opens a can of worms: > - WideStrings use 4x as much memory as ByteStings, and they lack the VM support ByteStrings have, so many operations are significantly slower with them. > - WideStrings spread like plague: > - Wrote a WideString into a stream? your stream's buffer is now a WideString. > - Did some operation with a WideString, e.g. #,? The result is very likely a WideString. > - Why doesn't this string match my regex '.*[0-9].*'? There's clearly a 9 in there... Oh, wait, it's a WideString with a "Mathematical sans-serif digit nine". > > > Looks like the usual can of worms you get when you want to support international text. And if your regex wants both to be applied to unicode text and to find strings with any kind of number in it, then it is incomplete. :-) you write [0-9], but the unicode character may visually appear as a regular digit, making it harder to debug your code. > In general, treating the unicode digits as digits should actually alleviate this debugging confusion where you wonder why a digit was not processed as such, shouldn't it? It depends on how you process those numbers. > > Question is: do the Smalltalk writers expect that their string, which incidentally contains '... {', (Mathematical sans-serif digit one), '} ...' (could be in part supplied by the user?), will have that sequence replaced by > the first formatting argument or do they not expect it? Also: if user input is sanitized to escape format sequences before applying further formatting on the extended text later (imaginary scenario), this sanitization must > now also support such unicode cases. I personally see little value in having 63 ways to write a single digit in my Smalltalk method. Levente > > |
> I personally see little value in having 63 ways to write a single digit in my Smalltalk method.
But if we don't support this, we are breaking with the standards from NumberParser. Wouldn't this be inconsistent?
Christoph
Von: Levente Uzonyi
Gesendet: Freitag, 16. August, 13:41
Betreff: Re: [squeak-dev] The Inbox: Collections-ct.851.mcz
An: The general-purpose Squeak developers list
On Fri, 16 Aug 2019, Jakob Reschke wrote: > Am Fr., 16. Aug. 2019 um 01:24 Uhr schrieb Levente Uzonyi : > On Thu, 15 Aug 2019, Thiede, Christoph wrote: > > In my eyes it is a nice side effect to support other kinds of Unicode values - NumberParser does the
same. > > IMO, it opens a can of worms: > - WideStrings use 4x as much memory as ByteStings, and they lack the VM support ByteStrings have, so many operations are significantly slower with them. > - WideStrings spread like plague: > - Wrote a WideString
into a stream? your stream's buffer is now a WideString. > - Did some operation with a WideString, e.g. #,? The result is very likely a WideString. > - Why doesn't this string match my regex '.*[0-9].*'? There's clearly a 9 in there... Oh, wait, it's
a WideString with a "Mathematical sans-serif digit nine". > > > Looks like the usual can of worms you get when you want to support international text. And if your regex wants both to be applied to unicode text and to find strings with any kind of number in
it, then it is incomplete. :-) I guess you missed my point. You do not want to match unicode digits when you write [0-9], but the unicode character may visually appear as a regular digit, making it harder to debug your code. > In general, treating the unicode
digits as digits should actually alleviate this debugging confusion where you wonder why a digit was not processed as such, shouldn't it? It depends on how you process those numbers. > > Question is: do the Smalltalk writers expect that their string, which
incidentally contains '... {', (Mathematical sans-serif digit one), '} ...' (could be in part supplied by the user?), will have that sequence replaced by > the first formatting argument or do they not expect it? Also: if user input is sanitized to escape format
sequences before applying further formatting on the extended text later (imaginary scenario), this sanitization must > now also support such unicode cases. I personally see little value in having 63 ways to write a single digit in my Smalltalk method. Levente
> >
Carpe Squeak!
|
In reply to this post by Levente Uzonyi
Am Fr., 16. Aug. 2019 um 13:41 Uhr schrieb Levente Uzonyi <[hidden email]>: On Fri, 16 Aug 2019, Jakob Reschke wrote: Me neither. I think it would rather surprise me if these higher-unicode digits were considered for the placeholders during the formatting. Since this construct of {0} and so forth is one for programmers, the input should only come from Smalltalk code. Since Smalltalk as a language does not support other numerals than the "basic" digits, I think it is reasonable to restrict the format method to just consider these digits. If we decided one day that Squeak should also support Roman numerals or Han numbers in Smalltalk code, I would reconsider. ;-) |
In reply to this post by Christoph Thiede
Am Fr., 16. Aug. 2019 um 14:38 Uhr schrieb Thiede, Christoph <[hidden email]>:
What is the objective of NumberParser? Parse numbers from arbitrary sources of text? Then I would say it has a different scope than what is required for the formatting placeholders. If SqNumberParser is the one that is used to parse numbers from Smalltalk expressions and if it supports non-ascii digits, then I wonder whether this is really desired behavior. |
Hi, all. :-) I think that it would be better to only allow a small set of characters to be used as placeholders. So, -1 for Unicode support in {}-placeholders. Unicode applications should have to convert user-defined unicode patterns manually if they want to support unicode placeholders. Here are some benchmarks. Not that much difference. [ '{2} He\{ll\}o {1} World!' format: { 'Smalltalk' . 'Hey' } ] bench. 'BEFORE: 1,270,000 per second. 789 nanoseconds per run.' 'AFTER: 1,070,000 per second. 931 nanoseconds per run.' Best, Marcel
|
Free forum by Nabble | Edit this page |