Tobias Pape uploaded a new version of Collections to project The Trunk:
http://source.squeak.org/trunk/Collections-topa.806.mcz ==================== Summary ==================== Name: Collections-topa.806 Author: topa Time: 12 September 2018, 3:28:40.687052 pm UUID: 46b95db5-a773-4113-92f0-5ee905404b49 Ancestors: Collections-cmm.805 Fix separators to include U+00A0 (no break space) Thanks Ron! =============== Diff against Collections-cmm.805 =============== Item was changed: ----- Method: Character class>>separators (in category 'instance creation') ----- separators + "Answer a collection of space-like separator characters. + Note that we do not consider spaces in >8bit code points yet. + " - "Answer a collection of the standard ASCII separator characters." + ^ #(9 "tab" - ^ #(32 "space" - 13 "cr" - 9 "tab" 10 "line feed" + 12 "form feed" + 13 "cr" + 32 "space" + 160 "non-breaking space, see Unicode Z general category") + collect: [:v | Character value: v] as: String + " To be considered: + 16r1680 OGHAM SPACE MARK + 16r2000 EN QUAD + 16r2001 EM QUAD + 16r2002 EN SPACE + 16r2003 EM SPACE + 16r2004 THREE-PER-EM SPACE + 16r2005 FOUR-PER-EM SPACE + 16r2006 SIX-PER-EM SPACE + 16r2007 FIGURE SPACE + 16r2008 PUNCTUATION SPACE + 16r2009 THIN SPACE + 16r200A HAIR SPACE + 16r2028 LINE SEPARATOR + 16r2029 PARAGRAPH SEPARATOR + 16r202F NARROW NO-BREAK SPACE + 16r205F MEDIUM MATHEMATICAL SPACE + 16r3000 IDEOGRAPHIC SPACE + "! - 12 "form feed") - collect: [:v | Character value: v] as: String! Item was changed: + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'! - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'! |
You're opening a can of worms with this. There are several other
separator/white space characters missing from that list. Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change. Levente On Wed, 12 Sep 2018, [hidden email] wrote: > Tobias Pape uploaded a new version of Collections to project The Trunk: > http://source.squeak.org/trunk/Collections-topa.806.mcz > > ==================== Summary ==================== > > Name: Collections-topa.806 > Author: topa > Time: 12 September 2018, 3:28:40.687052 pm > UUID: 46b95db5-a773-4113-92f0-5ee905404b49 > Ancestors: Collections-cmm.805 > > Fix separators to include U+00A0 (no break space) > > Thanks Ron! > > =============== Diff against Collections-cmm.805 =============== > > Item was changed: > ----- Method: Character class>>separators (in category 'instance creation') ----- > separators > + "Answer a collection of space-like separator characters. > + Note that we do not consider spaces in >8bit code points yet. > + " > - "Answer a collection of the standard ASCII separator characters." > > + ^ #(9 "tab" > - ^ #(32 "space" > - 13 "cr" > - 9 "tab" > 10 "line feed" > + 12 "form feed" > + 13 "cr" > + 32 "space" > + 160 "non-breaking space, see Unicode Z general category") > + collect: [:v | Character value: v] as: String > + " To be considered: > + 16r1680 OGHAM SPACE MARK > + 16r2000 EN QUAD > + 16r2001 EM QUAD > + 16r2002 EN SPACE > + 16r2003 EM SPACE > + 16r2004 THREE-PER-EM SPACE > + 16r2005 FOUR-PER-EM SPACE > + 16r2006 SIX-PER-EM SPACE > + 16r2007 FIGURE SPACE > + 16r2008 PUNCTUATION SPACE > + 16r2009 THIN SPACE > + 16r200A HAIR SPACE > + 16r2028 LINE SEPARATOR > + 16r2029 PARAGRAPH SEPARATOR > + 16r202F NARROW NO-BREAK SPACE > + 16r205F MEDIUM MATHEMATICAL SPACE > + 16r3000 IDEOGRAPHIC SPACE > + "! > - 12 "form feed") > - collect: [:v | Character value: v] as: String! > > Item was changed: > + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'! > - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'! |
> On 13.09.2018, at 16:35, Levente Uzonyi <[hidden email]> wrote: > > You're opening a can of worms with this. There are several other separator/white space characters missing from that list. Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment. > Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change. Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right? See the discussion with Ron. On a related note, is a very fast #isSeparator important? Otherwise I'd just propose ^ #( 9 10 12 13 32 160 ) includes: self asInteger for now… All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators. > > Levente > > On Wed, 12 Sep 2018, [hidden email] wrote: > >> Tobias Pape uploaded a new version of Collections to project The Trunk: >> http://source.squeak.org/trunk/Collections-topa.806.mcz >> >> ==================== Summary ==================== >> >> Name: Collections-topa.806 >> Author: topa >> Time: 12 September 2018, 3:28:40.687052 pm >> UUID: 46b95db5-a773-4113-92f0-5ee905404b49 >> Ancestors: Collections-cmm.805 >> >> Fix separators to include U+00A0 (no break space) >> >> Thanks Ron! >> >> =============== Diff against Collections-cmm.805 =============== >> >> Item was changed: >> ----- Method: Character class>>separators (in category 'instance creation') ----- >> separators >> + "Answer a collection of space-like separator characters. >> + Note that we do not consider spaces in >8bit code points yet. >> + " >> - "Answer a collection of the standard ASCII separator characters." >> + ^ #(9 "tab" >> - ^ #(32 "space" >> - 13 "cr" >> - 9 "tab" >> 10 "line feed" >> + 12 "form feed" >> + 13 "cr" >> + 32 "space" >> + 160 "non-breaking space, see Unicode Z general category") >> + collect: [:v | Character value: v] as: String >> + " To be considered: >> + 16r1680 OGHAM SPACE MARK >> + 16r2000 EN QUAD >> + 16r2001 EM QUAD >> + 16r2002 EN SPACE >> + 16r2003 EM SPACE >> + 16r2004 THREE-PER-EM SPACE >> + 16r2005 FOUR-PER-EM SPACE >> + 16r2006 SIX-PER-EM SPACE >> + 16r2007 FIGURE SPACE >> + 16r2008 PUNCTUATION SPACE >> + 16r2009 THIN SPACE >> + 16r200A HAIR SPACE >> + 16r2028 LINE SEPARATOR >> + 16r2029 PARAGRAPH SEPARATOR >> + 16r202F NARROW NO-BREAK SPACE >> + 16r205F MEDIUM MATHEMATICAL SPACE >> + 16r3000 IDEOGRAPHIC SPACE >> + "! >> - 12 "form feed") >> - collect: [:v | Character value: v] as: String! >> >> Item was changed: >> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'! >> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'! > |
On Thu, 13 Sep 2018, Tobias Pape wrote:
> >> On 13.09.2018, at 16:35, Levente Uzonyi <[hidden email]> wrote: >> >> You're opening a can of worms with this. There are several other separator/white space characters missing from that list. > > Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment. That list is still incomplete (e.g. zero width space), and you still have to deal with the can of worms - aka answering "What is a separator?". > >> Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change. > > Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right? > See the discussion with Ron. > On a related note, is a very fast #isSeparator important? Yes, it is. It's used extensively by various parsers. For example, see the senders of #isSeparator and #skipSeparators. Also, consider how the change of behavior affects those methods (along with other users, e.g. those methods which use the character sets). > Otherwise I'd just propose > > ^ #( 9 10 12 13 32 160 ) includes: self asInteger > for now… According to my measurements, that would be 10-15x slower than the current implementation. I optimized it for a reason not just for fun. > > All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators. That's true, but those are inconsistent now. Levente > > > >> >> Levente >> >> On Wed, 12 Sep 2018, [hidden email] wrote: >> >>> Tobias Pape uploaded a new version of Collections to project The Trunk: >>> http://source.squeak.org/trunk/Collections-topa.806.mcz >>> >>> ==================== Summary ==================== >>> >>> Name: Collections-topa.806 >>> Author: topa >>> Time: 12 September 2018, 3:28:40.687052 pm >>> UUID: 46b95db5-a773-4113-92f0-5ee905404b49 >>> Ancestors: Collections-cmm.805 >>> >>> Fix separators to include U+00A0 (no break space) >>> >>> Thanks Ron! >>> >>> =============== Diff against Collections-cmm.805 =============== >>> >>> Item was changed: >>> ----- Method: Character class>>separators (in category 'instance creation') ----- >>> separators >>> + "Answer a collection of space-like separator characters. >>> + Note that we do not consider spaces in >8bit code points yet. >>> + " >>> - "Answer a collection of the standard ASCII separator characters." >>> + ^ #(9 "tab" >>> - ^ #(32 "space" >>> - 13 "cr" >>> - 9 "tab" >>> 10 "line feed" >>> + 12 "form feed" >>> + 13 "cr" >>> + 32 "space" >>> + 160 "non-breaking space, see Unicode Z general category") >>> + collect: [:v | Character value: v] as: String >>> + " To be considered: >>> + 16r1680 OGHAM SPACE MARK >>> + 16r2000 EN QUAD >>> + 16r2001 EM QUAD >>> + 16r2002 EN SPACE >>> + 16r2003 EM SPACE >>> + 16r2004 THREE-PER-EM SPACE >>> + 16r2005 FOUR-PER-EM SPACE >>> + 16r2006 SIX-PER-EM SPACE >>> + 16r2007 FIGURE SPACE >>> + 16r2008 PUNCTUATION SPACE >>> + 16r2009 THIN SPACE >>> + 16r200A HAIR SPACE >>> + 16r2028 LINE SEPARATOR >>> + 16r2029 PARAGRAPH SEPARATOR >>> + 16r202F NARROW NO-BREAK SPACE >>> + 16r205F MEDIUM MATHEMATICAL SPACE >>> + 16r3000 IDEOGRAPHIC SPACE >>> + "! >>> - 12 "form feed") >>> - collect: [:v | Character value: v] as: String! >>> >>> Item was changed: >>> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'! >>> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'! >> |
I think Levente raises very good points, Squeak should present
a consistent implementation of what a separator is. I've always considered hard space and hard page break, etc. as "Word Processor" characters, since they have "functionality", not merely "separators". I think we should give more time for proper consideration, discussion and full implementation (with consistent behaviors everywhere), and testing, too. IMO, this type of change is low-level enough that it should not be a last-minute change put in merely minutes before the 5.2 release but we should discuss it for the next release. Best, Chris On Thu, Sep 13, 2018 at 12:13 PM Levente Uzonyi <[hidden email]> wrote: > > On Thu, 13 Sep 2018, Tobias Pape wrote: > > > > >> On 13.09.2018, at 16:35, Levente Uzonyi <[hidden email]> wrote: > >> > >> You're opening a can of worms with this. There are several other separator/white space characters missing from that list. > > > > Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment. > > That list is still incomplete (e.g. zero width space), and you still have > to deal with the can of worms - aka answering "What is a separator?". > > > > >> Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change. > > > > Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right? > > See the discussion with Ron. > > On a related note, is a very fast #isSeparator important? > > Yes, it is. It's used extensively by various parsers. For example, see the > senders of #isSeparator and #skipSeparators. > Also, consider how the change of behavior affects those methods (along > with other users, e.g. those methods which use the character sets). > > > Otherwise I'd just propose > > > > ^ #( 9 10 12 13 32 160 ) includes: self asInteger > > for now… > > According to my measurements, that would be 10-15x slower than the > current implementation. I optimized it for a reason not just for fun. > > > > > All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators. > > That's true, but those are inconsistent now. > > Levente > > > > > > > > >> > >> Levente > >> > >> On Wed, 12 Sep 2018, [hidden email] wrote: > >> > >>> Tobias Pape uploaded a new version of Collections to project The Trunk: > >>> http://source.squeak.org/trunk/Collections-topa.806.mcz > >>> > >>> ==================== Summary ==================== > >>> > >>> Name: Collections-topa.806 > >>> Author: topa > >>> Time: 12 September 2018, 3:28:40.687052 pm > >>> UUID: 46b95db5-a773-4113-92f0-5ee905404b49 > >>> Ancestors: Collections-cmm.805 > >>> > >>> Fix separators to include U+00A0 (no break space) > >>> > >>> Thanks Ron! > >>> > >>> =============== Diff against Collections-cmm.805 =============== > >>> > >>> Item was changed: > >>> ----- Method: Character class>>separators (in category 'instance creation') ----- > >>> separators > >>> + "Answer a collection of space-like separator characters. > >>> + Note that we do not consider spaces in >8bit code points yet. > >>> + " > >>> - "Answer a collection of the standard ASCII separator characters." > >>> + ^ #(9 "tab" > >>> - ^ #(32 "space" > >>> - 13 "cr" > >>> - 9 "tab" > >>> 10 "line feed" > >>> + 12 "form feed" > >>> + 13 "cr" > >>> + 32 "space" > >>> + 160 "non-breaking space, see Unicode Z general category") > >>> + collect: [:v | Character value: v] as: String > >>> + " To be considered: > >>> + 16r1680 OGHAM SPACE MARK > >>> + 16r2000 EN QUAD > >>> + 16r2001 EM QUAD > >>> + 16r2002 EN SPACE > >>> + 16r2003 EM SPACE > >>> + 16r2004 THREE-PER-EM SPACE > >>> + 16r2005 FOUR-PER-EM SPACE > >>> + 16r2006 SIX-PER-EM SPACE > >>> + 16r2007 FIGURE SPACE > >>> + 16r2008 PUNCTUATION SPACE > >>> + 16r2009 THIN SPACE > >>> + 16r200A HAIR SPACE > >>> + 16r2028 LINE SEPARATOR > >>> + 16r2029 PARAGRAPH SEPARATOR > >>> + 16r202F NARROW NO-BREAK SPACE > >>> + 16r205F MEDIUM MATHEMATICAL SPACE > >>> + 16r3000 IDEOGRAPHIC SPACE > >>> + "! > >>> - 12 "form feed") > >>> - collect: [:v | Character value: v] as: String! > >>> > >>> Item was changed: > >>> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'! > >>> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'! > >> |
On Thu, 13 Sep 2018 at 12:00, Chris Muller <[hidden email]> wrote: I think Levente raises very good points, Squeak should present That sounds like a category error. A _character set_ knows what a separator is. Unicode, ASCII, etc. The question should, IMO at least, be "what character set should Squeak use" and, again IMO, that should be Unicode and, in particular, the UTF-8 encoding. (http://utf8everywhere.org/) I've always +1 to this. Even if everyone decided that UTF-8 is the perfect encoding to use, and we should proceed with alacrity towards using it, now is not the time to start. My impression was that 5.2 was in a feature freeze, bugfix only phase. frank Best, |
On Thu, 13 Sep 2018, Frank Shearar wrote:
> On Thu, 13 Sep 2018 at 12:00, Chris Muller <[hidden email]> wrote: > I think Levente raises very good points, Squeak should present > a consistent implementation of what a separator is. > > > That sounds like a category error. A _character set_ knows what a separator is. Unicode, ASCII, etc. > > The question should, IMO at least, be "what character set should Squeak use" and, again IMO, that should be Unicode and, in particular, the UTF-8 encoding. (http://utf8everywhere.org/) My impression is that UTF-8 is slightly better and slightly worse at the same time than the current UTF-32 (+leading char extension) representation. So, I don't find it very tempting to make a huge change for something "different". Levente > > I've always > considered hard space and hard page break, etc. as "Word Processor" > characters, since they have "functionality", not merely "separators". > > I think we should give more time for proper consideration, discussion > and full implementation (with consistent behaviors everywhere), and > testing, too. IMO, this type of change is low-level enough that it > should not be a last-minute change put in merely minutes before the > 5.2 release but we should discuss it for the next release. > > > +1 to this. Even if everyone decided that UTF-8 is the perfect encoding to use, and we should proceed with alacrity towards using it, now is not the time to start. My impression was that 5.2 was in a > feature freeze, bugfix only phase. > > frank > > Best, > Chris > > On Thu, Sep 13, 2018 at 12:13 PM Levente Uzonyi <[hidden email]> wrote: > > > > On Thu, 13 Sep 2018, Tobias Pape wrote: > > > > > > > >> On 13.09.2018, at 16:35, Levente Uzonyi <[hidden email]> wrote: > > >> > > >> You're opening a can of worms with this. There are several other separator/white space characters missing from that list. > > > > > > Yeah, thats listed below in a comment. I am hesitating to add the other because WideString, so I just put them in a comment. > > > > That list is still incomplete (e.g. zero width space), and you still have > > to deal with the can of worms - aka answering "What is a separator?". > > > > > > > >> Also, this change makes the various #*separator* implementations (e.g. #isSeparator) inconsistent, so I strongly disagree with this change. > > > > > > Hmm. But isSeparator is Wrong, then… because nbsp _is_ a separator, right? > > > See the discussion with Ron. > > > On a related note, is a very fast #isSeparator important? > > > > Yes, it is. It's used extensively by various parsers. For example, see the > > senders of #isSeparator and #skipSeparators. > > Also, consider how the change of behavior affects those methods (along > > with other users, e.g. those methods which use the character sets). > > > > > Otherwise I'd just propose > > > > > > ^ #( 9 10 12 13 32 160 ) includes: self asInteger > > > for now… > > > > According to my measurements, that would be 10-15x slower than the > > current implementation. I optimized it for a reason not just for fun. > > > > > > > > All other *separator* messages fall back either to either Character>>#isSeparator or #separators from CharacterSet, which in turn is based on Character class>>#separators. > > > > That's true, but those are inconsistent now. > > > > Levente > > > > > > > > > > > > > >> > > >> Levente > > >> > > >> On Wed, 12 Sep 2018, [hidden email] wrote: > > >> > > >>> Tobias Pape uploaded a new version of Collections to project The Trunk: > > >>> http://source.squeak.org/trunk/Collections-topa.806.mcz > > >>> > > >>> ==================== Summary ==================== > > >>> > > >>> Name: Collections-topa.806 > > >>> Author: topa > > >>> Time: 12 September 2018, 3:28:40.687052 pm > > >>> UUID: 46b95db5-a773-4113-92f0-5ee905404b49 > > >>> Ancestors: Collections-cmm.805 > > >>> > > >>> Fix separators to include U+00A0 (no break space) > > >>> > > >>> Thanks Ron! > > >>> > > >>> =============== Diff against Collections-cmm.805 =============== > > >>> > > >>> Item was changed: > > >>> ----- Method: Character class>>separators (in category 'instance creation') ----- > > >>> separators > > >>> + "Answer a collection of space-like separator characters. > > >>> + Note that we do not consider spaces in >8bit code points yet. > > >>> + " > > >>> - "Answer a collection of the standard ASCII separator characters." > > >>> + ^ #(9 "tab" > > >>> - ^ #(32 "space" > > >>> - 13 "cr" > > >>> - 9 "tab" > > >>> 10 "line feed" > > >>> + 12 "form feed" > > >>> + 13 "cr" > > >>> + 32 "space" > > >>> + 160 "non-breaking space, see Unicode Z general category") > > >>> + collect: [:v | Character value: v] as: String > > >>> + " To be considered: > > >>> + 16r1680 OGHAM SPACE MARK > > >>> + 16r2000 EN QUAD > > >>> + 16r2001 EM QUAD > > >>> + 16r2002 EN SPACE > > >>> + 16r2003 EM SPACE > > >>> + 16r2004 THREE-PER-EM SPACE > > >>> + 16r2005 FOUR-PER-EM SPACE > > >>> + 16r2006 SIX-PER-EM SPACE > > >>> + 16r2007 FIGURE SPACE > > >>> + 16r2008 PUNCTUATION SPACE > > >>> + 16r2009 THIN SPACE > > >>> + 16r200A HAIR SPACE > > >>> + 16r2028 LINE SEPARATOR > > >>> + 16r2029 PARAGRAPH SEPARATOR > > >>> + 16r202F NARROW NO-BREAK SPACE > > >>> + 16r205F MEDIUM MATHEMATICAL SPACE > > >>> + 16r3000 IDEOGRAPHIC SPACE > > >>> + "! > > >>> - 12 "form feed") > > >>> - collect: [:v | Character value: v] as: String! > > >>> > > >>> Item was changed: > > >>> + (PackageInfo named: 'Collections') postscript: 'CharacterSet cleanUp: false.'! > > >>> - (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'! > > >> > > > |
>> The question should, IMO at least, be "what character set should Squeak use" and, again IMO, that should be Unicode and, in particular, the UTF-8 encoding. (http://utf8everywhere.org/)
We should probably have a proper UTF8String class so that at least we know that it is encoded and needs conversion to a 'real' String. During the NuScratch work I toiled mightily with string stuff and really ought to have done it then. The current widestring/bytestring stuff works quite well though for most internal cases, though the cost of converting an entire string anytime a big char is inserted could get annoying. If one were making a word processor for large amounts of text, rather than a text editor with some prettiness tweaks for code editting etc, it might pay to have a form of text that allows for mixed byte & wide sub-parts. Perhaps even possible to use text attributes in yet another twisted and sneaky way? As we discovered in the Sophie Project, handling formatted texts is decidedly non-trivial. Especially when the customer can't even define a paragraph for you.... tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Strange OpCodes: PSM: Print and SMear |
Free forum by Nabble | Edit this page |