[ENH] isSeparator

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[ENH] isSeparator

Christoph Thiede
Hi all,

here is one tiny changeset for you: isSeparator.cs adds proper encoding-aware support for testing of separator chars. As opposed to the former implementation, non-ASCII characters such as the no-break space (U+00A0) will be identified correctly now, too.

Please review and merge! :-)

Best,
Christoph

["isSeparator.cs.gz"]


isSeparator.cs.gz (600 bytes) Download Attachment
Carpe Squeak!
Reply | Threaded
Open this post in threaded view
|

Re: [ENH] isSeparator

Christoph Thiede

Hi all,


here is another tiny changeset, depending on isSeparator.cs: withAllBlanksTrimmed.cs uses the encoding-aware #isSeparator implementation to trim all kinds of spaces correctly from a string.


Best,
Christoph

Von: Squeak-dev <[hidden email]> im Auftrag von Thiede, Christoph
Gesendet: Donnerstag, 6. Mai 2021 22:27:57
An: [hidden email]
Betreff: [squeak-dev] [ENH] isSeparator
 
Hi all,

here is one tiny changeset for you: isSeparator.cs adds proper encoding-aware support for testing of separator chars. As opposed to the former implementation, non-ASCII characters such as the no-break space (U+00A0) will be identified correctly now, too.

Please review and merge! :-)

Best,
Christoph

["isSeparator.cs.gz"]



withAllBlanksTrimmed.1.cs (2K) Download Attachment
Carpe Squeak!
Reply | Threaded
Open this post in threaded view
|

Re: [ENH] isSeparator

Christoph Thiede

Community support: Inlined changesets


--- isSeparator.1.cs ---

'From Squeak6.0alpha of 29 April 2021 [latest update: #20483] on 6 May 2021 at 10:21:24 pm'!

!Character methodsFor: 'testing' stamp: 'ct 5/6/2021 21:41'!
isSeparator
"Answer whether the receiver is a separator such as space, cr, tab, line feed, or form feed."

^ self encodedCharSet isSeparator: self! !


!EncodedCharSet class methodsFor: 'character classification' stamp: 'ct 5/6/2021 21:46'!
isSeparator: char
"Answer whether char has the code of a separator in this encoding."

^ self isSeparatorCode: char charCode! !

!EncodedCharSet class methodsFor: 'character classification' stamp: 'ct 5/6/2021 21:39'!
isSeparatorCode: anInteger
"Answer whether anInteger is the code of a separator."

^ Character separators includesCode: anInteger! !


!Unicode class methodsFor: 'character classification' stamp: 'ct 5/6/2021 21:51'!
isSeparatorCode: charCode

| cat |
cat := self generalCategoryOf: charCode.
^ cat = Cc or: [cat >= Zl and: [cat <= Zs]]! !
------

--- withAllBlanksTrimmed.1.cs ---
'From Squeak6.0alpha of 29 April 2021 [latest update: #20483] on 6 May 2021 at 10:24:39 pm'!

!String methodsFor: 'converting' stamp: 'ct 5/6/2021 21:56'!
withBlanksTrimmed
"Return a copy of the receiver from which leading and trailing blanks have been trimmed."

| first last |
first := (self findFirst: [:character | character isSeparator not]).
first = 0 ifTrue: [^ ''].
"no non-separator character"
last := self findLast: [:character | character isSeparator not].
last = 0 ifTrue: [last := self size].
(first = 1 and: [last = self size]) ifTrue: [^ self copy].
^ self copyFrom: first to: last! !


!StringTest methodsFor: 'tests - converting' stamp: 'ct 5/6/2021 22:00'!
testWithBlanksTrimmed

| s |
self assert: ' abc  d   ' withBlanksTrimmed = 'abc  d'.
self assert: 'abc  d   ' withBlanksTrimmed = 'abc  d'.
self assert: ' abc  d' withBlanksTrimmed = 'abc  d'.
self assert: (((0 to: 255) collect: [:each | each asCharacter] thenSelect: [:each | each isSeparator]) as: String) withBlanksTrimmed = ''.
self assert: ' nbsps around ' withBlanksTrimmed = 'nbsps around'.
s := 'abcd'.
self assert: s withBlanksTrimmed = s.
self assert: s withBlanksTrimmed ~~ s! !


!Text methodsFor: 'converting' stamp: 'ct 5/6/2021 21:57'!
withBlanksTrimmed
"Return a copy of the receiver from which leading and trailing blanks have been trimmed."

| first last |
first := string findFirst: [:character | character isSeparator not].
first = 0 ifTrue: [^ ''].
"no non-separator character"
last := string findLast: [:character | character isSeparator not].
last = 0 ifTrue: [last := self size].
(first = 1 and: [last = self size]) ifTrue: [^ self copy].
^ self copyFrom: first to: last! !
------


Von: Thiede, Christoph
Gesendet: Donnerstag, 6. Mai 2021 22:30:56
An: [hidden email]
Betreff: AW: [squeak-dev] [ENH] isSeparator
 

Hi all,


here is another tiny changeset, depending on isSeparator.cs: withAllBlanksTrimmed.cs uses the encoding-aware #isSeparator implementation to trim all kinds of spaces correctly from a string.


Best,
Christoph

Von: Squeak-dev <[hidden email]> im Auftrag von Thiede, Christoph
Gesendet: Donnerstag, 6. Mai 2021 22:27:57
An: [hidden email]
Betreff: [squeak-dev] [ENH] isSeparator
 
Hi all,

here is one tiny changeset for you: isSeparator.cs adds proper encoding-aware support for testing of separator chars. As opposed to the former implementation, non-ASCII characters such as the no-break space (U+00A0) will be identified correctly now, too.

Please review and merge! :-)

Best,
Christoph

["isSeparator.cs.gz"]


Carpe Squeak!
Reply | Threaded
Open this post in threaded view
|

Re: [ENH] isSeparator

Levente Uzonyi
In reply to this post by Christoph Thiede
Hi Christoph,

There was a discussion on this subject before:
http://forum.world.st/The-Trunk-Collections-topa-806-mcz-td5084658.html
Main concerns are
- definition: What is a separator?
- consistency: CharacterSet separators would differ from the rest with
your change set.
- performance: I haven't measured it, but I wouldn't be surprised if
#isSeparator would become a magnitude slower with that implementation.


Levente

On Thu, 6 May 2021, [hidden email] wrote:

> Hi all,
>
> here is one tiny changeset for you: isSeparator.cs adds proper encoding-aware support for testing of separator chars. As opposed to the former implementation, non-ASCII characters such as the no-break space (U+00A0) will be identified correctly now, too.
>
> Please review and merge! :-)
>
> Best,
> Christoph
>
> ["isSeparator.cs.gz"]

Reply | Threaded
Open this post in threaded view
|

Re: [ENH] isSeparator

Christoph Thiede

Hi Levente,


thanks for the pointer. As far I can see from the linked discussion, Tobias' proposal has never been rejected but only postponed due to the upcoming release. I also see your point of performance, but IMHO correctness is more important than performance. If necessary, we could still hard-code the relevant code points into #isSeparator.


- consistency: CharacterSet separators would differ from the rest with your change set.


Fair point, but I think we should instead fix the definitions of Character(Set) constants to respect the encoding as well ... By the way, Character alphabet and Character allCharacters also don't do this at the moment.

Of course, all your concerns are valid points and need to be discussed, but I would be sorry if we failed to - finally - establish current standards in our Character library. I doubt that any modern parser for JSON or whatever would treat Unicode space characters incorrectly, and still, they are satisfyingly fast. I think we should be able to keep pace with them in Squeak as well. :-)

Best,
Christoph

Von: Squeak-dev <[hidden email]> im Auftrag von Levente Uzonyi <[hidden email]>
Gesendet: Freitag, 7. Mai 2021 22:01:18
An: The general-purpose Squeak developers list
Betreff: Re: [squeak-dev] [ENH] isSeparator
 
Hi Christoph,

There was a discussion on this subject before:
http://forum.world.st/The-Trunk-Collections-topa-806-mcz-td5084658.html
Main concerns are
- definition: What is a separator?
- consistency: CharacterSet separators would differ from the rest with
your change set.
- performance: I haven't measured it, but I wouldn't be surprised if
#isSeparator would become a magnitude slower with that implementation.


Levente

On Thu, 6 May 2021, [hidden email] wrote:

> Hi all,
>
> here is one tiny changeset for you: isSeparator.cs adds proper encoding-aware support for testing of separator chars. As opposed to the former implementation, non-ASCII characters such as the no-break space (U+00A0) will be identified correctly now, too.
>
> Please review and merge! :-)
>
> Best,
> Christoph
>
> ["isSeparator.cs.gz"]



Carpe Squeak!
Reply | Threaded
Open this post in threaded view
|

Re: [ENH] isSeparator

Levente Uzonyi
On Fri, 7 May 2021, Thiede, Christoph wrote:

>
> Hi Levente,
>
>
> thanks for the pointer. As far I can see from the linked discussion, Tobias' proposal has never been rejected but only postponed due to the upcoming release. I also see your point of performance, but IMHO correctness is more
> important than performance. If necessary, we could still hard-code the relevant code points into #isSeparator.
>
>
> > - consistency: CharacterSet separators would differ from the rest with your change set.
>
>
> Fair point, but I think we should instead fix the definitions of Character(Set) constants to respect the encoding as well ... By the way, Character alphabet and Character allCharacters also don't do this at the moment.
>
> Of course, all your concerns are valid points and need to be discussed, but I would be sorry if we failed to - finally - establish current standards in our Character library. I doubt that any modern parser for JSON or
> whatever would treat Unicode space characters incorrectly, and still, they are satisfyingly fast. I think we should be able to keep pace with them in Squeak as well. :-)
Well, you ignored my question "What is a separator?".
IMO a separator is a whitespace that separates tokens in the source
code.
Would you like to use zero-width space as a separator? Not likely.
#isSeparator is deeply buried into the system. Changing it would mean
changing other code your changeset doesn't touch, e.g. the parsers.

The method you propose is welcome, but IMO it shouldn't be called
#isSeparator. #isWhitespace is a much better fit.


Levente

>
> Best,
> Christoph
>
> _________________________________________________________________________________________________________________________________________________________________________________________________________________________________
> Von: Squeak-dev <[hidden email]> im Auftrag von Levente Uzonyi <[hidden email]>
> Gesendet: Freitag, 7. Mai 2021 22:01:18
> An: The general-purpose Squeak developers list
> Betreff: Re: [squeak-dev] [ENH] isSeparator  
> Hi Christoph,
>
> There was a discussion on this subject before:
> http://forum.world.st/The-Trunk-Collections-topa-806-mcz-td5084658.html
> Main concerns are
> - definition: What is a separator?
> - consistency: CharacterSet separators would differ from the rest with
> your change set.
> - performance: I haven't measured it, but I wouldn't be surprised if
> #isSeparator would become a magnitude slower with that implementation.
>
>
> Levente
>
> On Thu, 6 May 2021, [hidden email] wrote:
>
> > Hi all,
> >
> > here is one tiny changeset for you: isSeparator.cs adds proper encoding-aware support for testing of separator chars. As opposed to the former implementation, non-ASCII characters such as the no-break space (U+00A0) will be
> identified correctly now, too.
> >
> > Please review and merge! :-)
> >
> > Best,
> > Christoph
> >
> > ["isSeparator.cs.gz"]
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [ENH] isSeparator

marcel.taeumel
This reminds me of our #asNumber (or number parser) discussion where we agreed to not parse number-like appearances in Unicode to Integer. :-)

Instead of modifying CharacterSet etc., one could maybe extend TextConverter to support encoding-aware identification of separators etc and also provide encoding-aware #trim.

Best,
Marcel

Am 08.05.2021 04:12:21 schrieb Levente Uzonyi <[hidden email]>:

On Fri, 7 May 2021, Thiede, Christoph wrote:

>
> Hi Levente,
>
>
> thanks for the pointer. As far I can see from the linked discussion, Tobias' proposal has never been rejected but only postponed due to the upcoming release. I also see your point of performance, but IMHO correctness is more
> important than performance. If necessary, we could still hard-code the relevant code points into #isSeparator.
>
>
> > - consistency: CharacterSet separators would differ from the rest with your change set.
>
>
> Fair point, but I think we should instead fix the definitions of Character(Set) constants to respect the encoding as well ... By the way, Character alphabet and Character allCharacters also don't do this at the moment.
>
> Of course, all your concerns are valid points and need to be discussed, but I would be sorry if we failed to - finally - establish current standards in our Character library. I doubt that any modern parser for JSON or
> whatever would treat Unicode space characters incorrectly, and still, they are satisfyingly fast. I think we should be able to keep pace with them in Squeak as well. :-)

Well, you ignored my question "What is a separator?".
IMO a separator is a whitespace that separates tokens in the source
code.
Would you like to use zero-width space as a separator? Not likely.
#isSeparator is deeply buried into the system. Changing it would mean
changing other code your changeset doesn't touch, e.g. the parsers.

The method you propose is welcome, but IMO it shouldn't be called
#isSeparator. #isWhitespace is a much better fit.


Levente

>
> Best,
> Christoph
>
> _________________________________________________________________________________________________________________________________________________________________________________________________________________________________
> Von: Squeak-dev im Auftrag von Levente Uzonyi
> Gesendet: Freitag, 7. Mai 2021 22:01:18
> An: The general-purpose Squeak developers list
> Betreff: Re: [squeak-dev] [ENH] isSeparator  
> Hi Christoph,
>
> There was a discussion on this subject before:
> http://forum.world.st/The-Trunk-Collections-topa-806-mcz-td5084658.html
> Main concerns are
> - definition: What is a separator?
> - consistency: CharacterSet separators would differ from the rest with
> your change set.
> - performance: I haven't measured it, but I wouldn't be surprised if
> #isSeparator would become a magnitude slower with that implementation.
>
>
> Levente
>
> On Thu, 6 May 2021, [hidden email] wrote:
>
> > Hi all,
> >
> > here is one tiny changeset for you: isSeparator.cs adds proper encoding-aware support for testing of separator chars. As opposed to the former implementation, non-ASCII characters such as the no-break space (U+00A0) will be
> identified correctly now, too.
> >
> > Please review and merge! :-)
> >
> > Best,
> > Christoph
> >
> > ["isSeparator.cs.gz"]
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: [ENH] isSeparator

Christoph Thiede

Hi Marcel, hi Levente.


Well, you ignored my question "What is a separator?".


Sorry for that. Well, I don't completely agree with your definition, I would rather follow the semantics of Unicode. The Unicode character category "separators" includes line separators (Zl), paragraph separators (Zp), and space separators (Zs) and nbsp is like Character space part of the space separators group. In other contexts like the web, afaik they are used very interchangeably anyway, so I hardly can imagine any scenario where something like #trimWhatespace should ignore these characters.
(By the way, zero-width space is not a space character at all according to Unicode but rather a formatting character (Cf).)

Instead of modifying CharacterSet etc., one could maybe extend TextConverter to support encoding-aware identification of separators etc and also provide encoding-aware #trim.

I see your arguments for maintaining backward compatibility, but this proposal scares me a bit. I would really like String to see Unicode-aware by default (like perhaps every other modern programming language) instead of providing a separate interface that "a few exotic clients that care about other encodings" can use. :-)

I originally stumbled upon this when I was using the HtmlReadWriter to parse a piece of HTML that contained a nbsp in its CSS. That is perfectly valid HTML/CSS, but #mapContainerTag: failed on it because #withBlanksTrimmed did not stripe away this nbsp. Yes, of course you could ask TextConverter or Unicode or whatever else in that place, but this feels like the wrong approach to me. Unicode awareness should be opt-in nowadays, not opt-out.

Best,
Christoph


Von: Squeak-dev <[hidden email]> im Auftrag von Taeumel, Marcel
Gesendet: Mittwoch, 12. Mai 2021 07:37:17
An: squeak-dev
Betreff: Re: [squeak-dev] [ENH] isSeparator
 
This reminds me of our #asNumber (or number parser) discussion where we agreed to not parse number-like appearances in Unicode to Integer. :-)

Instead of modifying CharacterSet etc., one could maybe extend TextConverter to support encoding-aware identification of separators etc and also provide encoding-aware #trim.

Best,
Marcel

Am 08.05.2021 04:12:21 schrieb Levente Uzonyi <[hidden email]>:

On Fri, 7 May 2021, Thiede, Christoph wrote:

>
> Hi Levente,
>
>
> thanks for the pointer. As far I can see from the linked discussion, Tobias' proposal has never been rejected but only postponed due to the upcoming release. I also see your point of performance, but IMHO correctness is more
> important than performance. If necessary, we could still hard-code the relevant code points into #isSeparator.
>
>
> > - consistency: CharacterSet separators would differ from the rest with your change set.
>
>
> Fair point, but I think we should instead fix the definitions of Character(Set) constants to respect the encoding as well ... By the way, Character alphabet and Character allCharacters also don't do this at the moment.
>
> Of course, all your concerns are valid points and need to be discussed, but I would be sorry if we failed to - finally - establish current standards in our Character library. I doubt that any modern parser for JSON or
> whatever would treat Unicode space characters incorrectly, and still, they are satisfyingly fast. I think we should be able to keep pace with them in Squeak as well. :-)

Well, you ignored my question "What is a separator?".
IMO a separator is a whitespace that separates tokens in the source
code.
Would you like to use zero-width space as a separator? Not likely.
#isSeparator is deeply buried into the system. Changing it would mean
changing other code your changeset doesn't touch, e.g. the parsers.

The method you propose is welcome, but IMO it shouldn't be called
#isSeparator. #isWhitespace is a much better fit.


Levente

>
> Best,
> Christoph
>
> _________________________________________________________________________________________________________________________________________________________________________________________________________________________________
> Von: Squeak-dev im Auftrag von Levente Uzonyi
> Gesendet: Freitag, 7. Mai 2021 22:01:18
> An: The general-purpose Squeak developers list
> Betreff: Re: [squeak-dev] [ENH] isSeparator  
> Hi Christoph,
>
> There was a discussion on this subject before:
> http://forum.world.st/The-Trunk-Collections-topa-806-mcz-td5084658.html
> Main concerns are
> - definition: What is a separator?
> - consistency: CharacterSet separators would differ from the rest with
> your change set.
> - performance: I haven't measured it, but I wouldn't be surprised if
> #isSeparator would become a magnitude slower with that implementation.
>
>
> Levente
>
> On Thu, 6 May 2021, [hidden email] wrote:
>
> > Hi all,
> >
> > here is one tiny changeset for you: isSeparator.cs adds proper encoding-aware support for testing of separator chars. As opposed to the former implementation, non-ASCII characters such as the no-break space (U+00A0) will be
> identified correctly now, too.
> >
> > Please review and merge! :-)
> >
> > Best,
> > Christoph
> >
> > ["isSeparator.cs.gz"]
>
>
>


Carpe Squeak!