Smalltalk - Re: The Trunk: Collections-topa.806.mcz

Smalltalk › Squeak › Squeak - Dev

Re: The Trunk: Collections-topa.806.mcz

Posted by Levente Uzonyi on Sep 13, 2018; 10:38pm
URL: https://forum.world.st/The-Trunk-Collections-topa-806-mcz-tp5084658p5084724.html

On Thu, 13 Sep 2018, Frank Shearar wrote:

> On Thu, 13 Sep 2018 at 12:00, Chris Muller <[hidden email]> wrote:
> I think Levente raises very good points, Squeak should present
> a consistent implementation of what a separator is.
>
>
> That sounds like a category error. A _character set_ knows what a separator is. Unicode, ASCII, etc.
>
> The question should, IMO at least, be "what character set should Squeak use" and, again IMO, that should be Unicode and, in particular, the UTF-8 encoding. (http://utf8everywhere.org/)

My impression is that UTF-8 is slightly better and slightly worse at the
same time than the current UTF-32 (+leading char extension)
representation. So, I don't find it very tempting to make a huge change
for something "different".

Levente

>
> I've always
> considered hard space and hard page break, etc. as "Word Processor"
> characters, since they have "functionality", not merely "separators".
>
> I think we should give more time for proper consideration, discussion
> and full implementation (with consistent behaviors everywhere), and
> testing, too. IMO, this type of change is low-level enough that it
> should not be a last-minute change put in merely minutes before the
> 5.2 release but we should discuss it for the next release.
>
>
> +1 to this. Even if everyone decided that UTF-8 is the perfect encoding to use, and we should proceed with alacrity towards using it, now is not the time to start. My impression was that 5.2 was in a
> feature freeze, bugfix only phase.
>
> frank
>
> Best,
> Chris
>
> On Thu, Sep 13, 2018 at 12:13 PM Levente Uzonyi <[hidden email]> wrote:
> >
> > On Thu, 13 Sep 2018, Tobias Pape wrote:
> >
> > >
> > >> On 13.09.2018, at 16:35, Levente Uzonyi <[hidden email]> wrote:
> > >>
> > >> You're opening a can of worms with this. There are several other separator/white space characters missing from that list.
> > >
> > > Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment.
> >
> > That list is still incomplete (e.g. zero width space), and you still have
> > to deal with the can of worms - aka answering "What is a separator?".
> >
> > >
> > >> Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change.
> > >
> > > Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right?
> > > See the discussion with Ron.
> > > On a related note, is a very fast #isSeparator important?
> >
> > Yes, it is. It's used extensively by various parsers. For example, see the
> > senders of #isSeparator and #skipSeparators.
> > Also, consider how the change of behavior affects those methods (along
> > with other users, e.g. those methods which use the character sets).
> >
> > > Otherwise I'd just propose
> > >
> > > ^ #( 9 10 12 13 32 160 ) includes: self asInteger
> > > for now…
> >
> > According to my measurements, that would be 10-15x slower than the
> > current implementation. I optimized it for a reason not just for fun.
> >
> > >
> > > All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators.
> >
> > That's true, but those are inconsistent now.
> >
> > Levente
> >
> > >
> > >
> > >
> > >>
> > >> Levente
> > >>
> > >> On Wed, 12 Sep 2018, [hidden email] wrote:
> > >>
> > >>> Tobias Pape uploaded a new version of Collections to project The Trunk:
> > >>> http://source.squeak.org/trunk/Collections-topa.806.mcz
> > >>>
> > >>> ==================== Summary ====================
> > >>>
> > >>> Name: Collections-topa.806
> > >>> Author: topa
> > >>> Time: 12 September 2018, 3:28:40.687052 pm
> > >>> UUID: 46b95db5-a773-4113-92f0-5ee905404b49
> > >>> Ancestors: Collections-cmm.805
> > >>>
> > >>> Fix separators to include U+00A0 (no break space)
> > >>>
> > >>> Thanks Ron!
> > >>>
> > >>> =============== Diff against Collections-cmm.805 ===============
> > >>>
> > >>> Item was changed:
> > >>> ----- Method: Character class>>separators (in category 'instance creation') -----
> > >>> separators
> > >>> + "Answer a collection of space-like separator characters.
> > >>> + Note that we do not consider spaces in >8bit code points yet.
> > >>> + "
> > >>> - "Answer a collection of the standard ASCII separator characters."
> > >>> + ^ #(9 "tab"
> > >>> - ^ #(32 "space"
> > >>> - 13 "cr"
> > >>> - 9 "tab"
> > >>> 10 "line feed"
> > >>> + 12 "form feed"
> > >>> + 13 "cr"
> > >>> + 32 "space"
> > >>> + 160 "non-breaking space, see Unicode Z general category")
> > >>> + collect: [:v | Character value: v] as: String
> > >>> + " To be considered:
> > >>> + 16r1680 OGHAM SPACE MARK
> > >>> + 16r2000 EN QUAD
> > >>> + 16r2001 EM QUAD
> > >>> + 16r2002 EN SPACE
> > >>> + 16r2003 EM SPACE
> > >>> + 16r2004 THREE-PER-EM SPACE
> > >>> + 16r2005 FOUR-PER-EM SPACE
> > >>> + 16r2006 SIX-PER-EM SPACE
> > >>> + 16r2007 FIGURE SPACE
> > >>> + 16r2008 PUNCTUATION SPACE
> > >>> + 16r2009 THIN SPACE
> > >>> + 16r200A HAIR SPACE
> > >>> + 16r2028 LINE SEPARATOR
> > >>> + 16r2029 PARAGRAPH SEPARATOR
> > >>> + 16r202F NARROW NO-BREAK SPACE
> > >>> + 16r205F MEDIUM MATHEMATICAL SPACE
> > >>> + 16r3000 IDEOGRAPHIC SPACE
> > >>> + "!
> > >>> - 12 "form feed")
> > >>> - collect: [:v | Character value: v] as: String!
> > >>>
> > >>> Item was changed:
> > >>> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'!
> > >>> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!
> > >>
>
>
>