The Trunk: Collections-topa.806.mcz

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

The Trunk: Collections-topa.806.mcz

commits-2
Tobias Pape uploaded a new version of Collections to project The Trunk:
http://source.squeak.org/trunk/Collections-topa.806.mcz

==================== Summary ====================

Name: Collections-topa.806
Author: topa
Time: 12 September 2018, 3:28:40.687052 pm
UUID: 46b95db5-a773-4113-92f0-5ee905404b49
Ancestors: Collections-cmm.805

Fix separators to include U+00A0 (no break space)

Thanks Ron!

=============== Diff against Collections-cmm.805 ===============

Item was changed:
  ----- Method: Character class>>separators (in category 'instance creation') -----
  separators
+ "Answer a collection of space-like separator characters.
+ Note that we do not consider spaces in >8bit code points yet.
+ "
- "Answer a collection of the standard ASCII separator characters."
 
+ ^ #(9 "tab"
- ^ #(32 "space"
- 13 "cr"
- 9 "tab"
  10 "line feed"
+ 12 "form feed"
+ 13 "cr"
+ 32 "space"
+ 160 "non-breaking space, see Unicode Z general category")
+ collect: [:v | Character value: v] as: String
+ " To be considered:
+ 16r1680 OGHAM SPACE MARK
+ 16r2000 EN QUAD
+ 16r2001 EM QUAD
+ 16r2002 EN SPACE
+ 16r2003 EM SPACE
+ 16r2004 THREE-PER-EM SPACE
+ 16r2005 FOUR-PER-EM SPACE
+ 16r2006 SIX-PER-EM SPACE
+ 16r2007 FIGURE SPACE
+ 16r2008 PUNCTUATION SPACE
+ 16r2009 THIN SPACE
+ 16r200A HAIR SPACE
+ 16r2028 LINE SEPARATOR
+ 16r2029 PARAGRAPH SEPARATOR
+ 16r202F NARROW NO-BREAK SPACE
+ 16r205F MEDIUM MATHEMATICAL SPACE
+ 16r3000 IDEOGRAPHIC SPACE
+ "!
- 12 "form feed")
- collect: [:v | Character value: v] as: String!

Item was changed:
+ (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!
- (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!


Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Collections-topa.806.mcz

Levente Uzonyi
You're opening a can of worms with this. There are several other
separator/white space characters missing from that list.
Also, this change makes the various #*separator* implementations (e.g.
#isSeparator) inconsistent, so I strongly disagree with this change.

Levente

On Wed, 12 Sep 2018, [hidden email] wrote:

> Tobias Pape uploaded a new version of Collections to project The Trunk:
> http://source.squeak.org/trunk/Collections-topa.806.mcz
>
> ==================== Summary ====================
>
> Name: Collections-topa.806
> Author: topa
> Time: 12 September 2018, 3:28:40.687052 pm
> UUID: 46b95db5-a773-4113-92f0-5ee905404b49
> Ancestors: Collections-cmm.805
>
> Fix separators to include U+00A0 (no break space)
>
> Thanks Ron!
>
> =============== Diff against Collections-cmm.805 ===============
>
> Item was changed:
>  ----- Method: Character class>>separators (in category 'instance creation') -----
>  separators
> + "Answer a collection of space-like separator characters.
> + Note that we do not consider spaces in >8bit code points yet.
> + "
> - "Answer a collection of the standard ASCII separator characters."
>
> + ^ #(9 "tab"
> - ^ #(32 "space"
> - 13 "cr"
> - 9 "tab"
>   10 "line feed"
> + 12 "form feed"
> + 13 "cr"
> + 32 "space"
> + 160 "non-breaking space, see Unicode Z general category")
> + collect: [:v | Character value: v] as: String
> + " To be considered:
> + 16r1680 OGHAM SPACE MARK
> + 16r2000 EN QUAD
> + 16r2001 EM QUAD
> + 16r2002 EN SPACE
> + 16r2003 EM SPACE
> + 16r2004 THREE-PER-EM SPACE
> + 16r2005 FOUR-PER-EM SPACE
> + 16r2006 SIX-PER-EM SPACE
> + 16r2007 FIGURE SPACE
> + 16r2008 PUNCTUATION SPACE
> + 16r2009 THIN SPACE
> + 16r200A HAIR SPACE
> + 16r2028 LINE SEPARATOR
> + 16r2029 PARAGRAPH SEPARATOR
> + 16r202F NARROW NO-BREAK SPACE
> + 16r205F MEDIUM MATHEMATICAL SPACE
> + 16r3000 IDEOGRAPHIC SPACE
> + "!
> - 12 "form feed")
> - collect: [:v | Character value: v] as: String!
>
> Item was changed:
> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!
> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Collections-topa.806.mcz

Tobias Pape

> On 13.09.2018, at 16:35, Levente Uzonyi <[hidden email]> wrote:
>
> You're opening a can of worms with this. There are several other separator/white space characters missing from that list.

Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.

> Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.

Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right?
See the discussion with Ron.
On a related note, is a very fast #isSeparator important?
Otherwise I'd just propose

        ^ #( 9 10 12 13 32 160 ) includes: self asInteger
for now…

All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.



>
> Levente
>
> On Wed, 12 Sep 2018, [hidden email] wrote:
>
>> Tobias Pape uploaded a new version of Collections to project The Trunk:
>> http://source.squeak.org/trunk/Collections-topa.806.mcz
>>
>> ==================== Summary ====================
>>
>> Name: Collections-topa.806
>> Author: topa
>> Time: 12 September 2018, 3:28:40.687052 pm
>> UUID: 46b95db5-a773-4113-92f0-5ee905404b49
>> Ancestors: Collections-cmm.805
>>
>> Fix separators to include U+00A0 (no break space)
>>
>> Thanks Ron!
>>
>> =============== Diff against Collections-cmm.805 ===============
>>
>> Item was changed:
>> ----- Method: Character class>>separators (in category 'instance creation') -----
>> separators
>> + "Answer a collection of space-like separator characters.
>> + Note that we do not consider spaces in >8bit code points yet.
>> + "
>> - "Answer a collection of the standard ASCII separator characters."
>> + ^ #(9 "tab"
>> - ^ #(32 "space"
>> - 13 "cr"
>> - 9 "tab"
>> 10 "line feed"
>> + 12 "form feed"
>> + 13 "cr"
>> + 32 "space"
>> + 160 "non-breaking space, see Unicode Z general category")
>> + collect: [:v | Character value: v] as: String
>> + " To be considered:
>> + 16r1680 OGHAM SPACE MARK
>> + 16r2000 EN QUAD
>> + 16r2001 EM QUAD
>> + 16r2002 EN SPACE
>> + 16r2003 EM SPACE
>> + 16r2004 THREE-PER-EM SPACE
>> + 16r2005 FOUR-PER-EM SPACE
>> + 16r2006 SIX-PER-EM SPACE
>> + 16r2007 FIGURE SPACE
>> + 16r2008 PUNCTUATION SPACE
>> + 16r2009 THIN SPACE
>> + 16r200A HAIR SPACE
>> + 16r2028 LINE SEPARATOR
>> + 16r2029 PARAGRAPH SEPARATOR
>> + 16r202F NARROW NO-BREAK SPACE
>> + 16r205F MEDIUM MATHEMATICAL SPACE
>> + 16r3000 IDEOGRAPHIC SPACE
>> + "!
>> - 12 "form feed")
>> - collect: [:v | Character value: v] as: String!
>>
>> Item was changed:
>> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!
>> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!
>


Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Collections-topa.806.mcz

Levente Uzonyi
On Thu, 13 Sep 2018, Tobias Pape wrote:

>
>> On 13.09.2018, at 16:35, Levente Uzonyi <[hidden email]> wrote:
>>
>> You're opening a can of worms with this. There are several other separator/white space characters missing from that list.
>
> Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.

That list is still incomplete (e.g. zero width space), and you still have
to deal with the can of worms - aka answering "What is a separator?".

>
>> Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.
>
> Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right?
> See the discussion with Ron.
> On a related note, is a very fast #isSeparator important?

Yes, it is. It's used extensively by various parsers. For example, see the
senders of #isSeparator and #skipSeparators.
Also, consider how the change of behavior affects those methods (along
with other users, e.g. those methods which use the character sets).

> Otherwise I'd just propose
>
> ^ #( 9 10 12 13 32 160 ) includes: self asInteger
> for now…

According to my measurements, that would be 10-15x slower than the
current implementation. I optimized it for a reason not just for fun.

>
> All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.

That's true, but those are inconsistent now.

Levente

>
>
>
>>
>> Levente
>>
>> On Wed, 12 Sep 2018, [hidden email] wrote:
>>
>>> Tobias Pape uploaded a new version of Collections to project The Trunk:
>>> http://source.squeak.org/trunk/Collections-topa.806.mcz
>>>
>>> ==================== Summary ====================
>>>
>>> Name: Collections-topa.806
>>> Author: topa
>>> Time: 12 September 2018, 3:28:40.687052 pm
>>> UUID: 46b95db5-a773-4113-92f0-5ee905404b49
>>> Ancestors: Collections-cmm.805
>>>
>>> Fix separators to include U+00A0 (no break space)
>>>
>>> Thanks Ron!
>>>
>>> =============== Diff against Collections-cmm.805 ===============
>>>
>>> Item was changed:
>>> ----- Method: Character class>>separators (in category 'instance creation') -----
>>> separators
>>> + "Answer a collection of space-like separator characters.
>>> + Note that we do not consider spaces in >8bit code points yet.
>>> + "
>>> - "Answer a collection of the standard ASCII separator characters."
>>> + ^ #(9 "tab"
>>> - ^ #(32 "space"
>>> - 13 "cr"
>>> - 9 "tab"
>>> 10 "line feed"
>>> + 12 "form feed"
>>> + 13 "cr"
>>> + 32 "space"
>>> + 160 "non-breaking space, see Unicode Z general category")
>>> + collect: [:v | Character value: v] as: String
>>> + " To be considered:
>>> + 16r1680 OGHAM SPACE MARK
>>> + 16r2000 EN QUAD
>>> + 16r2001 EM QUAD
>>> + 16r2002 EN SPACE
>>> + 16r2003 EM SPACE
>>> + 16r2004 THREE-PER-EM SPACE
>>> + 16r2005 FOUR-PER-EM SPACE
>>> + 16r2006 SIX-PER-EM SPACE
>>> + 16r2007 FIGURE SPACE
>>> + 16r2008 PUNCTUATION SPACE
>>> + 16r2009 THIN SPACE
>>> + 16r200A HAIR SPACE
>>> + 16r2028 LINE SEPARATOR
>>> + 16r2029 PARAGRAPH SEPARATOR
>>> + 16r202F NARROW NO-BREAK SPACE
>>> + 16r205F MEDIUM MATHEMATICAL SPACE
>>> + 16r3000 IDEOGRAPHIC SPACE
>>> + "!
>>> - 12 "form feed")
>>> - collect: [:v | Character value: v] as: String!
>>>
>>> Item was changed:
>>> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!
>>> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!
>>

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Collections-topa.806.mcz

Chris Muller-3
I think Levente raises very good points, Squeak should present
a consistent implementation of what a separator is.  I've always
considered hard space and hard page break, etc. as "Word Processor"
characters, since they have "functionality", not merely "separators".

I think we should give more time for proper consideration, discussion
and full implementation (with consistent behaviors everywhere), and
testing, too.  IMO, this type of change is low-level enough that it
should not be a last-minute change put in merely minutes before the
5.2 release but we should discuss it for the next release.

Best,
  Chris

On Thu, Sep 13, 2018 at 12:13 PM Levente Uzonyi <[hidden email]> wrote:

>
> On Thu, 13 Sep 2018, Tobias Pape wrote:
>
> >
> >> On 13.09.2018, at 16:35, Levente Uzonyi <[hidden email]> wrote:
> >>
> >> You're opening a can of worms with this. There are several other separator/white space characters missing from that list.
> >
> > Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.
>
> That list is still incomplete (e.g. zero width space), and you still have
> to deal with the can of worms - aka answering "What is a separator?".
>
> >
> >> Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.
> >
> > Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right?
> > See the discussion with Ron.
> > On a related note, is a very fast #isSeparator important?
>
> Yes, it is. It's used extensively by various parsers. For example, see the
> senders of #isSeparator and #skipSeparators.
> Also, consider how the change of behavior affects those methods (along
> with other users, e.g. those methods which use the character sets).
>
> > Otherwise I'd just propose
> >
> >       ^ #( 9 10 12 13 32 160 ) includes: self asInteger
> > for now…
>
> According to my measurements, that would be 10-15x slower than the
> current implementation. I optimized it for a reason not just for fun.
>
> >
> > All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.
>
> That's true, but those are inconsistent now.
>
> Levente
>
> >
> >
> >
> >>
> >> Levente
> >>
> >> On Wed, 12 Sep 2018, [hidden email] wrote:
> >>
> >>> Tobias Pape uploaded a new version of Collections to project The Trunk:
> >>> http://source.squeak.org/trunk/Collections-topa.806.mcz
> >>>
> >>> ==================== Summary ====================
> >>>
> >>> Name: Collections-topa.806
> >>> Author: topa
> >>> Time: 12 September 2018, 3:28:40.687052 pm
> >>> UUID: 46b95db5-a773-4113-92f0-5ee905404b49
> >>> Ancestors: Collections-cmm.805
> >>>
> >>> Fix separators to include U+00A0 (no break space)
> >>>
> >>> Thanks Ron!
> >>>
> >>> =============== Diff against Collections-cmm.805 ===============
> >>>
> >>> Item was changed:
> >>> ----- Method: Character class>>separators (in category 'instance creation') -----
> >>> separators
> >>> +   "Answer a collection of space-like separator characters.
> >>> +   Note that we do not consider spaces in >8bit code points yet.
> >>> +   "
> >>> -   "Answer a collection of the standard ASCII separator characters."
> >>> +   ^ #(9 "tab"
> >>> -   ^ #(32 "space"
> >>> -           13 "cr"
> >>> -           9 "tab"
> >>>             10 "line feed"
> >>> +           12 "form feed"
> >>> +           13 "cr"
> >>> +           32 "space"
> >>> +           160 "non-breaking space, see Unicode Z general category")
> >>> +           collect: [:v | Character value: v] as: String
> >>> + " To be considered:
> >>> + 16r1680 OGHAM SPACE MARK
> >>> + 16r2000 EN QUAD
> >>> + 16r2001 EM QUAD
> >>> + 16r2002 EN SPACE
> >>> + 16r2003 EM SPACE
> >>> + 16r2004 THREE-PER-EM SPACE
> >>> + 16r2005 FOUR-PER-EM SPACE
> >>> + 16r2006 SIX-PER-EM SPACE
> >>> + 16r2007 FIGURE SPACE
> >>> + 16r2008 PUNCTUATION SPACE
> >>> + 16r2009 THIN SPACE
> >>> + 16r200A HAIR SPACE
> >>> + 16r2028 LINE SEPARATOR
> >>> + 16r2029 PARAGRAPH SEPARATOR
> >>> + 16r202F NARROW NO-BREAK SPACE
> >>> + 16r205F MEDIUM MATHEMATICAL SPACE
> >>> + 16r3000 IDEOGRAPHIC SPACE
> >>> + "!
> >>> -           12 "form feed")
> >>> -           collect: [:v | Character value: v] as: String!
> >>>
> >>> Item was changed:
> >>> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!
> >>> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!
> >>

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Collections-topa.806.mcz

Frank Shearar-3
On Thu, 13 Sep 2018 at 12:00, Chris Muller <[hidden email]> wrote:
I think Levente raises very good points, Squeak should present
a consistent implementation of what a separator is.

That sounds like a category error. A _character set_ knows what a separator is. Unicode, ASCII, etc.

The question should, IMO at least, be "what character set should Squeak use" and, again IMO, that should be Unicode and, in particular, the UTF-8 encoding. (http://utf8everywhere.org/)
 
  I've always
considered hard space and hard page break, etc. as "Word Processor"
characters, since they have "functionality", not merely "separators".

I think we should give more time for proper consideration, discussion
and full implementation (with consistent behaviors everywhere), and
testing, too.  IMO, this type of change is low-level enough that it
should not be a last-minute change put in merely minutes before the
5.2 release but we should discuss it for the next release.

+1 to this. Even if everyone decided that UTF-8 is the perfect encoding to use, and we should proceed with alacrity towards using it, now is not the time to start. My impression was that 5.2 was in a feature freeze, bugfix only phase.

frank

Best,
  Chris

On Thu, Sep 13, 2018 at 12:13 PM Levente Uzonyi <[hidden email]> wrote:
>
> On Thu, 13 Sep 2018, Tobias Pape wrote:
>
> >
> >> On 13.09.2018, at 16:35, Levente Uzonyi <[hidden email]> wrote:
> >>
> >> You're opening a can of worms with this. There are several other separator/white space characters missing from that list.
> >
> > Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.
>
> That list is still incomplete (e.g. zero width space), and you still have
> to deal with the can of worms - aka answering "What is a separator?".
>
> >
> >> Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.
> >
> > Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right?
> > See the discussion with Ron.
> > On a related note, is a very fast #isSeparator important?
>
> Yes, it is. It's used extensively by various parsers. For example, see the
> senders of #isSeparator and #skipSeparators.
> Also, consider how the change of behavior affects those methods (along
> with other users, e.g. those methods which use the character sets).
>
> > Otherwise I'd just propose
> >
> >       ^ #( 9 10 12 13 32 160 ) includes: self asInteger
> > for now…
>
> According to my measurements, that would be 10-15x slower than the
> current implementation. I optimized it for a reason not just for fun.
>
> >
> > All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.
>
> That's true, but those are inconsistent now.
>
> Levente
>
> >
> >
> >
> >>
> >> Levente
> >>
> >> On Wed, 12 Sep 2018, [hidden email] wrote:
> >>
> >>> Tobias Pape uploaded a new version of Collections to project The Trunk:
> >>> http://source.squeak.org/trunk/Collections-topa.806.mcz
> >>>
> >>> ==================== Summary ====================
> >>>
> >>> Name: Collections-topa.806
> >>> Author: topa
> >>> Time: 12 September 2018, 3:28:40.687052 pm
> >>> UUID: 46b95db5-a773-4113-92f0-5ee905404b49
> >>> Ancestors: Collections-cmm.805
> >>>
> >>> Fix separators to include U+00A0 (no break space)
> >>>
> >>> Thanks Ron!
> >>>
> >>> =============== Diff against Collections-cmm.805 ===============
> >>>
> >>> Item was changed:
> >>> ----- Method: Character class>>separators (in category 'instance creation') -----
> >>> separators
> >>> +   "Answer a collection of space-like separator characters.
> >>> +   Note that we do not consider spaces in >8bit code points yet.
> >>> +   "
> >>> -   "Answer a collection of the standard ASCII separator characters."
> >>> +   ^ #(9 "tab"
> >>> -   ^ #(32 "space"
> >>> -           13 "cr"
> >>> -           9 "tab"
> >>>             10 "line feed"
> >>> +           12 "form feed"
> >>> +           13 "cr"
> >>> +           32 "space"
> >>> +           160 "non-breaking space, see Unicode Z general category")
> >>> +           collect: [:v | Character value: v] as: String
> >>> + " To be considered:
> >>> + 16r1680 OGHAM SPACE MARK
> >>> + 16r2000 EN QUAD
> >>> + 16r2001 EM QUAD
> >>> + 16r2002 EN SPACE
> >>> + 16r2003 EM SPACE
> >>> + 16r2004 THREE-PER-EM SPACE
> >>> + 16r2005 FOUR-PER-EM SPACE
> >>> + 16r2006 SIX-PER-EM SPACE
> >>> + 16r2007 FIGURE SPACE
> >>> + 16r2008 PUNCTUATION SPACE
> >>> + 16r2009 THIN SPACE
> >>> + 16r200A HAIR SPACE
> >>> + 16r2028 LINE SEPARATOR
> >>> + 16r2029 PARAGRAPH SEPARATOR
> >>> + 16r202F NARROW NO-BREAK SPACE
> >>> + 16r205F MEDIUM MATHEMATICAL SPACE
> >>> + 16r3000 IDEOGRAPHIC SPACE
> >>> + "!
> >>> -           12 "form feed")
> >>> -           collect: [:v | Character value: v] as: String!
> >>>
> >>> Item was changed:
> >>> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!
> >>> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!
> >>



Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Collections-topa.806.mcz

Levente Uzonyi
On Thu, 13 Sep 2018, Frank Shearar wrote:

> On Thu, 13 Sep 2018 at 12:00, Chris Muller <[hidden email]> wrote:
>       I think Levente raises very good points, Squeak should present
>       a consistent implementation of what a separator is.
>
>
> That sounds like a category error. A _character set_ knows what a separator is. Unicode, ASCII, etc.
>
> The question should, IMO at least, be "what character set should Squeak use" and, again IMO, that should be Unicode and, in particular, the UTF-8 encoding. (http://utf8everywhere.org/)

My impression is that UTF-8 is slightly better and slightly worse at the
same time than the current UTF-32 (+leading char extension)
representation. So, I don't find it very tempting to make a huge change
for something "different".

Levente

>  
>         I've always
>       considered hard space and hard page break, etc. as "Word Processor"
>       characters, since they have "functionality", not merely "separators".
>
>       I think we should give more time for proper consideration, discussion
>       and full implementation (with consistent behaviors everywhere), and
>       testing, too.  IMO, this type of change is low-level enough that it
>       should not be a last-minute change put in merely minutes before the
>       5.2 release but we should discuss it for the next release.
>
>
> +1 to this. Even if everyone decided that UTF-8 is the perfect encoding to use, and we should proceed with alacrity towards using it, now is not the time to start. My impression was that 5.2 was in a
> feature freeze, bugfix only phase.
>
> frank
>
>       Best,
>         Chris
>
>       On Thu, Sep 13, 2018 at 12:13 PM Levente Uzonyi <[hidden email]> wrote:
>       >
>       > On Thu, 13 Sep 2018, Tobias Pape wrote:
>       >
>       > >
>       > >> On 13.09.2018, at 16:35, Levente Uzonyi <[hidden email]> wrote:
>       > >>
>       > >> You're opening a can of worms with this. There are several other separator/white space characters missing from that list.
>       > >
>       > > Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.
>       >
>       > That list is still incomplete (e.g. zero width space), and you still have
>       > to deal with the can of worms - aka answering "What is a separator?".
>       >
>       > >
>       > >> Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.
>       > >
>       > > Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right?
>       > > See the discussion with Ron.
>       > > On a related note, is a very fast #isSeparator important?
>       >
>       > Yes, it is. It's used extensively by various parsers. For example, see the
>       > senders of #isSeparator and #skipSeparators.
>       > Also, consider how the change of behavior affects those methods (along
>       > with other users, e.g. those methods which use the character sets).
>       >
>       > > Otherwise I'd just propose
>       > >
>       > >       ^ #( 9 10 12 13 32 160 ) includes: self asInteger
>       > > for now…
>       >
>       > According to my measurements, that would be 10-15x slower than the
>       > current implementation. I optimized it for a reason not just for fun.
>       >
>       > >
>       > > All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.
>       >
>       > That's true, but those are inconsistent now.
>       >
>       > Levente
>       >
>       > >
>       > >
>       > >
>       > >>
>       > >> Levente
>       > >>
>       > >> On Wed, 12 Sep 2018, [hidden email] wrote:
>       > >>
>       > >>> Tobias Pape uploaded a new version of Collections to project The Trunk:
>       > >>> http://source.squeak.org/trunk/Collections-topa.806.mcz
>       > >>>
>       > >>> ==================== Summary ====================
>       > >>>
>       > >>> Name: Collections-topa.806
>       > >>> Author: topa
>       > >>> Time: 12 September 2018, 3:28:40.687052 pm
>       > >>> UUID: 46b95db5-a773-4113-92f0-5ee905404b49
>       > >>> Ancestors: Collections-cmm.805
>       > >>>
>       > >>> Fix separators to include U+00A0 (no break space)
>       > >>>
>       > >>> Thanks Ron!
>       > >>>
>       > >>> =============== Diff against Collections-cmm.805 ===============
>       > >>>
>       > >>> Item was changed:
>       > >>> ----- Method: Character class>>separators (in category 'instance creation') -----
>       > >>> separators
>       > >>> +   "Answer a collection of space-like separator characters.
>       > >>> +   Note that we do not consider spaces in >8bit code points yet.
>       > >>> +   "
>       > >>> -   "Answer a collection of the standard ASCII separator characters."
>       > >>> +   ^ #(9 "tab"
>       > >>> -   ^ #(32 "space"
>       > >>> -           13 "cr"
>       > >>> -           9 "tab"
>       > >>>             10 "line feed"
>       > >>> +           12 "form feed"
>       > >>> +           13 "cr"
>       > >>> +           32 "space"
>       > >>> +           160 "non-breaking space, see Unicode Z general category")
>       > >>> +           collect: [:v | Character value: v] as: String
>       > >>> + " To be considered:
>       > >>> + 16r1680 OGHAM SPACE MARK
>       > >>> + 16r2000 EN QUAD
>       > >>> + 16r2001 EM QUAD
>       > >>> + 16r2002 EN SPACE
>       > >>> + 16r2003 EM SPACE
>       > >>> + 16r2004 THREE-PER-EM SPACE
>       > >>> + 16r2005 FOUR-PER-EM SPACE
>       > >>> + 16r2006 SIX-PER-EM SPACE
>       > >>> + 16r2007 FIGURE SPACE
>       > >>> + 16r2008 PUNCTUATION SPACE
>       > >>> + 16r2009 THIN SPACE
>       > >>> + 16r200A HAIR SPACE
>       > >>> + 16r2028 LINE SEPARATOR
>       > >>> + 16r2029 PARAGRAPH SEPARATOR
>       > >>> + 16r202F NARROW NO-BREAK SPACE
>       > >>> + 16r205F MEDIUM MATHEMATICAL SPACE
>       > >>> + 16r3000 IDEOGRAPHIC SPACE
>       > >>> + "!
>       > >>> -           12 "form feed")
>       > >>> -           collect: [:v | Character value: v] as: String!
>       > >>>
>       > >>> Item was changed:
>       > >>> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!
>       > >>> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!
>       > >>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Collections-topa.806.mcz

timrowledge
>> The question should, IMO at least, be "what character set should Squeak use" and, again IMO, that should be Unicode and, in particular, the UTF-8 encoding. (http://utf8everywhere.org/)

We should probably have a proper UTF8String class so that at least we know that it is encoded and needs conversion to a 'real' String. During the NuScratch work I toiled mightily with string stuff and really ought to have done it then. The current widestring/bytestring stuff works quite well though for most internal cases, though the cost of converting an entire string anytime a big char is inserted could get annoying.

If one were making a word processor for large amounts of text, rather than a text editor with some prettiness tweaks for code editting etc, it might pay to have a form of text that allows for mixed byte & wide sub-parts. Perhaps even possible to use text attributes in yet another twisted and sneaky way? As we discovered in the Sophie Project, handling formatted texts is decidedly non-trivial. Especially when the customer can't even define a paragraph for you....


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: PSM: Print and SMear