Is this correct?
(String with: 12 asCharacter with: 0 asCharacter) = (String with: 12 asCharacter with: 0 asCharacter with: 32 asCharacter) Other string methods, like #copyAfter:, don't treat null the same way. _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
My example and thread title were wrong. It skips null *and* various control chars entirely when comparing:
(0 to: 255) select: [:each | (String with: $a with: $b) = (String with: $a with: each asCharacter with: $b)] which yields: anArray( 0, 1, 2, 3, 4, 5, 6, 7, 8, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 173) The GS Prog Guide (p. 77) says the ICU lib handles string comparisons internally, and it seems to ignore these characters for the sake of normalization. But that means it's possible for two Strings to be #= while having different #sizes and indexable characters, and that comparisons between Strings containing binary data aren't reliable, and that other String methods aren't consistent with #=: | one two | one := String with: $a with: 0 asCharacter with: $b. two := String with: $a with: $b. one = two and: [(one at: 1 equals: two) not and: [(two at: 1 equals: one) not]] And since GsFile #next and #contents are character based: (GsFile open: 'bin.one' mode: 'wb' onClient: false) nextPutAll: #[100 25 200]; close. (GsFile open: 'bin.two' mode: 'wb' onClient: false) nextPutAll: #[100 200]; close. (GsFile open: 'bin.one' mode: 'rb' onClient: false) contents = (GsFile open: 'bin.two' mode: 'rb' onClient: false) contents. Consider this more as a "heads-up" for users than a bug report, since this is apparently the intended, documented behavior. > Sent: Friday, January 26, 2018 at 2:20 AM > From: "monty via Glass" <[hidden email]> > To: [hidden email] > Subject: [Glass] Possible Bug: String>>#= treats nulls as a terminator > > Is this correct? > > (String with: 12 asCharacter with: 0 asCharacter) = > (String with: 12 asCharacter with: 0 asCharacter with: 32 asCharacter) > > Other string methods, like #copyAfter:, don't treat null the same way. > _______________________________________________ > Glass mailing list > [hidden email] > http://lists.gemtalksystems.com/mailman/listinfo/glass > Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Monty,
Good points ... this "unexpected" behavior of Unicode strings with respect to control characters has been hard for us to grapple with internally as well, but this is unicode being unicode. I did notice that with the exception of code point 173, all of the code points you list are indeed control characters according the Unicode character table[1]. Code point 173 is a "Soft Hypen"[2] and doesn't really seem to fit the description of a control character, so I'm now curious if we might have a bug here, either in our implementation, the implementation of libICU or my understanding:) I'm curious how you ran across this behavior? The control characters wouldn't seem to be a normal part of strings intended for display ... I'm asking because if there is a use case for providing the old literal byte comparison operators we can make them available. Dale [1] https://unicode-table.com/en/#control-character [2] https://unicode-table.com/en/00AD/ On 01/27/2018 01:57 AM, monty via Glass wrote: > My example and thread title were wrong. It skips null *and* various control chars entirely when comparing: > (0 to: 255) select: [:each | > (String with: $a with: $b) = > (String with: $a with: each asCharacter with: $b)] > > which yields: > anArray( 0, 1, 2, 3, 4, 5, 6, 7, 8, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 173) > > The GS Prog Guide (p. 77) says the ICU lib handles string comparisons internally, and it seems to ignore these characters for the sake of normalization. > > But that means it's possible for two Strings to be #= while having different #sizes and indexable characters, and that comparisons between Strings containing binary data aren't reliable, and that other String methods aren't consistent with #=: > | one two | > one := String with: $a with: 0 asCharacter with: $b. > two := String with: $a with: $b. > one = two > and: [(one at: 1 equals: two) not > and: [(two at: 1 equals: one) not]] > > And since GsFile #next and #contents are character based: > (GsFile open: 'bin.one' mode: 'wb' onClient: false) > nextPutAll: #[100 25 200]; > close. > (GsFile open: 'bin.two' mode: 'wb' onClient: false) > nextPutAll: #[100 200]; > close. > (GsFile open: 'bin.one' mode: 'rb' onClient: false) contents = > (GsFile open: 'bin.two' mode: 'rb' onClient: false) contents. > > Consider this more as a "heads-up" for users than a bug report, since this is apparently the intended, documented behavior. > >> Sent: Friday, January 26, 2018 at 2:20 AM >> From: "monty via Glass" <[hidden email]> >> To: [hidden email] >> Subject: [Glass] Possible Bug: String>>#= treats nulls as a terminator >> >> Is this correct? >> >> (String with: 12 asCharacter with: 0 asCharacter) = >> (String with: 12 asCharacter with: 0 asCharacter with: 32 asCharacter) >> >> Other string methods, like #copyAfter:, don't treat null the same way. >> _______________________________________________ >> Glass mailing list >> [hidden email] >> http://lists.gemtalksystems.com/mailman/listinfo/glass >> > _______________________________________________ > Glass mailing list > [hidden email] > http://lists.gemtalksystems.com/mailman/listinfo/glass _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
I was writing tests for stream converter classes that do encoding/decoding from various encodings. But any use of Strings to store binary data is a use case. ByteArray is more appropriate, but GsFile is still byte-character based by default, even when you open files in binary mode (which I assume just disables line ending normalization on Windows).
> Sent: Saturday, January 27, 2018 at 12:18 PM > From: "Dale Henrichs via Glass" <[hidden email]> > To: [hidden email] > Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator > > Monty, > > Good points ... this "unexpected" behavior of Unicode strings with > respect to control characters has been hard for us to grapple with > internally as well, but this is unicode being unicode. I did notice that > with the exception of code point 173, all of the code points you list > are indeed control characters according the Unicode character table[1]. > > Code point 173 is a "Soft Hypen"[2] and doesn't really seem to fit the > description of a control character, so I'm now curious if we might have > a bug here, either in our implementation, the implementation of libICU > or my understanding:) > > I'm curious how you ran across this behavior? The control characters > wouldn't seem to be a normal part of strings intended for display ... > > I'm asking because if there is a use case for providing the old literal > byte comparison operators we can make them available. > > Dale > > [1] https://unicode-table.com/en/#control-character > [2] https://unicode-table.com/en/00AD/ > > On 01/27/2018 01:57 AM, monty via Glass wrote: > > My example and thread title were wrong. It skips null *and* various control chars entirely when comparing: > > (0 to: 255) select: [:each | > > (String with: $a with: $b) = > > (String with: $a with: each asCharacter with: $b)] > > > > which yields: > > anArray( 0, 1, 2, 3, 4, 5, 6, 7, 8, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 173) > > > > The GS Prog Guide (p. 77) says the ICU lib handles string comparisons internally, and it seems to ignore these characters for the sake of normalization. > > > > But that means it's possible for two Strings to be #= while having different #sizes and indexable characters, and that comparisons between Strings containing binary data aren't reliable, and that other String methods aren't consistent with #=: > > | one two | > > one := String with: $a with: 0 asCharacter with: $b. > > two := String with: $a with: $b. > > one = two > > and: [(one at: 1 equals: two) not > > and: [(two at: 1 equals: one) not]] > > > > And since GsFile #next and #contents are character based: > > (GsFile open: 'bin.one' mode: 'wb' onClient: false) > > nextPutAll: #[100 25 200]; > > close. > > (GsFile open: 'bin.two' mode: 'wb' onClient: false) > > nextPutAll: #[100 200]; > > close. > > (GsFile open: 'bin.one' mode: 'rb' onClient: false) contents = > > (GsFile open: 'bin.two' mode: 'rb' onClient: false) contents. > > > > Consider this more as a "heads-up" for users than a bug report, since this is apparently the intended, documented behavior. > > > >> Sent: Friday, January 26, 2018 at 2:20 AM > >> From: "monty via Glass" <[hidden email]> > >> To: [hidden email] > >> Subject: [Glass] Possible Bug: String>>#= treats nulls as a terminator > >> > >> Is this correct? > >> > >> (String with: 12 asCharacter with: 0 asCharacter) = > >> (String with: 12 asCharacter with: 0 asCharacter with: 32 asCharacter) > >> > >> Other string methods, like #copyAfter:, don't treat null the same way. > >> _______________________________________________ > >> Glass mailing list > >> [hidden email] > >> http://lists.gemtalksystems.com/mailman/listinfo/glass > >> > > _______________________________________________ > > Glass mailing list > > [hidden email] > > http://lists.gemtalksystems.com/mailman/listinfo/glass > > _______________________________________________ > Glass mailing list > [hidden email] > http://lists.gemtalksystems.com/mailman/listinfo/glass > Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
On 01/29/2018 01:16 AM, monty via Glass wrote: > I was writing tests for stream converter classes that do encoding/decoding from various encodings. But any use of Strings to store binary data is a use case. ByteArray is more appropriate, but GsFile is still byte-character based by default, even when you open files in binary mode (which I assume just disables line ending normalization on Windows). This seems like a GemStone bug at the end of the day ... ByteArray and Utf8 are the two classes that _should_ be used, but if GsFile is not handling them well, then that is an issue for us ... I will check this out ... Thanks, Dale > >> Sent: Saturday, January 27, 2018 at 12:18 PM >> From: "Dale Henrichs via Glass" <[hidden email]> >> To: [hidden email] >> Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator >> >> Monty, >> >> Good points ... this "unexpected" behavior of Unicode strings with >> respect to control characters has been hard for us to grapple with >> internally as well, but this is unicode being unicode. I did notice that >> with the exception of code point 173, all of the code points you list >> are indeed control characters according the Unicode character table[1]. >> >> Code point 173 is a "Soft Hypen"[2] and doesn't really seem to fit the >> description of a control character, so I'm now curious if we might have >> a bug here, either in our implementation, the implementation of libICU >> or my understanding:) >> >> I'm curious how you ran across this behavior? The control characters >> wouldn't seem to be a normal part of strings intended for display ... >> >> I'm asking because if there is a use case for providing the old literal >> byte comparison operators we can make them available. >> >> Dale >> >> [1] https://unicode-table.com/en/#control-character >> [2] https://unicode-table.com/en/00AD/ >> >> On 01/27/2018 01:57 AM, monty via Glass wrote: >>> My example and thread title were wrong. It skips null *and* various control chars entirely when comparing: >>> (0 to: 255) select: [:each | >>> (String with: $a with: $b) = >>> (String with: $a with: each asCharacter with: $b)] >>> >>> which yields: >>> anArray( 0, 1, 2, 3, 4, 5, 6, 7, 8, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 173) >>> >>> The GS Prog Guide (p. 77) says the ICU lib handles string comparisons internally, and it seems to ignore these characters for the sake of normalization. >>> >>> But that means it's possible for two Strings to be #= while having different #sizes and indexable characters, and that comparisons between Strings containing binary data aren't reliable, and that other String methods aren't consistent with #=: >>> | one two | >>> one := String with: $a with: 0 asCharacter with: $b. >>> two := String with: $a with: $b. >>> one = two >>> and: [(one at: 1 equals: two) not >>> and: [(two at: 1 equals: one) not]] >>> >>> And since GsFile #next and #contents are character based: >>> (GsFile open: 'bin.one' mode: 'wb' onClient: false) >>> nextPutAll: #[100 25 200]; >>> close. >>> (GsFile open: 'bin.two' mode: 'wb' onClient: false) >>> nextPutAll: #[100 200]; >>> close. >>> (GsFile open: 'bin.one' mode: 'rb' onClient: false) contents = >>> (GsFile open: 'bin.two' mode: 'rb' onClient: false) contents. >>> >>> Consider this more as a "heads-up" for users than a bug report, since this is apparently the intended, documented behavior. >>> >>>> Sent: Friday, January 26, 2018 at 2:20 AM >>>> From: "monty via Glass" <[hidden email]> >>>> To: [hidden email] >>>> Subject: [Glass] Possible Bug: String>>#= treats nulls as a terminator >>>> >>>> Is this correct? >>>> >>>> (String with: 12 asCharacter with: 0 asCharacter) = >>>> (String with: 12 asCharacter with: 0 asCharacter with: 32 asCharacter) >>>> >>>> Other string methods, like #copyAfter:, don't treat null the same way. >>>> _______________________________________________ >>>> Glass mailing list >>>> [hidden email] >>>> http://lists.gemtalksystems.com/mailman/listinfo/glass >>>> >>> _______________________________________________ >>> Glass mailing list >>> [hidden email] >>> http://lists.gemtalksystems.com/mailman/listinfo/glass >> _______________________________________________ >> Glass mailing list >> [hidden email] >> http://lists.gemtalksystems.com/mailman/listinfo/glass >> > _______________________________________________ > Glass mailing list > [hidden email] > http://lists.gemtalksystems.com/mailman/listinfo/glass _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
The real problem is String>>#=. It's bizarre that two SequenceableCollections can be #= yet have different #sizes and that for every shared index i, it's not necessarily true that "(one at: i) = (two at: i)":
| one two | one := String with: $a with: 25 asCharacter with: $b. two := one copyWithout: one second. one = two and: [one asArray ~= two asArray and: [ (1 to: (one size min: two size)) anySatisfy: [:i | (one at: i) ~= (two at: i)]]]. Java and C# model strings as immutable indexed collections of UTF-16 16-bit code units (meaning surrogate pair-encoded code points require two units), and no normalization is done during comparisons. Instead there are special methods, like Normalize(), that convert a string into a chosen normalized form, and normalized comparisons can then be done on the converted strings. Ignoring the choice of UTF-16, this seems like a better, safer approach if you're still committed to treating strings as indexable character collections. But I'm not sure how you can fix String or GsFile without breaking backwards compatibility. > Sent: Monday, January 29, 2018 at 11:44 AM > From: "Dale Henrichs via Glass" <[hidden email]> > To: [hidden email] > Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator > > > > On 01/29/2018 01:16 AM, monty via Glass wrote: > > I was writing tests for stream converter classes that do encoding/decoding from various encodings. But any use of Strings to store binary data is a use case. ByteArray is more appropriate, but GsFile is still byte-character based by default, even when you open files in binary mode (which I assume just disables line ending normalization on Windows). > This seems like a GemStone bug at the end of the day ... ByteArray and > Utf8 are the two classes that _should_ be used, but if GsFile is not > handling them well, then that is an issue for us ... I will check this > out ... > > Thanks, > > Dale > > > > >> Sent: Saturday, January 27, 2018 at 12:18 PM > >> From: "Dale Henrichs via Glass" <[hidden email]> > >> To: [hidden email] > >> Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator > >> > >> Monty, > >> > >> Good points ... this "unexpected" behavior of Unicode strings with > >> respect to control characters has been hard for us to grapple with > >> internally as well, but this is unicode being unicode. I did notice that > >> with the exception of code point 173, all of the code points you list > >> are indeed control characters according the Unicode character table[1]. > >> > >> Code point 173 is a "Soft Hypen"[2] and doesn't really seem to fit the > >> description of a control character, so I'm now curious if we might have > >> a bug here, either in our implementation, the implementation of libICU > >> or my understanding:) > >> > >> I'm curious how you ran across this behavior? The control characters > >> wouldn't seem to be a normal part of strings intended for display ... > >> > >> I'm asking because if there is a use case for providing the old literal > >> byte comparison operators we can make them available. > >> > >> Dale > >> > >> [1] https://unicode-table.com/en/#control-character > >> [2] https://unicode-table.com/en/00AD/ > >> > >> On 01/27/2018 01:57 AM, monty via Glass wrote: > >>> My example and thread title were wrong. It skips null *and* various control chars entirely when comparing: > >>> (0 to: 255) select: [:each | > >>> (String with: $a with: $b) = > >>> (String with: $a with: each asCharacter with: $b)] > >>> > >>> which yields: > >>> anArray( 0, 1, 2, 3, 4, 5, 6, 7, 8, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 173) > >>> > >>> The GS Prog Guide (p. 77) says the ICU lib handles string comparisons internally, and it seems to ignore these characters for the sake of normalization. > >>> > >>> But that means it's possible for two Strings to be #= while having different #sizes and indexable characters, and that comparisons between Strings containing binary data aren't reliable, and that other String methods aren't consistent with #=: > >>> | one two | > >>> one := String with: $a with: 0 asCharacter with: $b. > >>> two := String with: $a with: $b. > >>> one = two > >>> and: [(one at: 1 equals: two) not > >>> and: [(two at: 1 equals: one) not]] > >>> > >>> And since GsFile #next and #contents are character based: > >>> (GsFile open: 'bin.one' mode: 'wb' onClient: false) > >>> nextPutAll: #[100 25 200]; > >>> close. > >>> (GsFile open: 'bin.two' mode: 'wb' onClient: false) > >>> nextPutAll: #[100 200]; > >>> close. > >>> (GsFile open: 'bin.one' mode: 'rb' onClient: false) contents = > >>> (GsFile open: 'bin.two' mode: 'rb' onClient: false) contents. > >>> > >>> Consider this more as a "heads-up" for users than a bug report, since this is apparently the intended, documented behavior. > >>> > >>>> Sent: Friday, January 26, 2018 at 2:20 AM > >>>> From: "monty via Glass" <[hidden email]> > >>>> To: [hidden email] > >>>> Subject: [Glass] Possible Bug: String>>#= treats nulls as a terminator > >>>> > >>>> Is this correct? > >>>> > >>>> (String with: 12 asCharacter with: 0 asCharacter) = > >>>> (String with: 12 asCharacter with: 0 asCharacter with: 32 asCharacter) > >>>> > >>>> Other string methods, like #copyAfter:, don't treat null the same way. > >>>> _______________________________________________ > >>>> Glass mailing list > >>>> [hidden email] > >>>> http://lists.gemtalksystems.com/mailman/listinfo/glass > >>>> > >>> _______________________________________________ > >>> Glass mailing list > >>> [hidden email] > >>> http://lists.gemtalksystems.com/mailman/listinfo/glass > >> _______________________________________________ > >> Glass mailing list > >> [hidden email] > >> http://lists.gemtalksystems.com/mailman/listinfo/glass > >> > > _______________________________________________ > > Glass mailing list > > [hidden email] > > http://lists.gemtalksystems.com/mailman/listinfo/glass > > _______________________________________________ > Glass mailing list > [hidden email] > http://lists.gemtalksystems.com/mailman/listinfo/glass > Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Another one:
| one two | one := 'Köln'. two := String with: $K with: $o with: 16r308 asCharacter with: $l with: $n. one = two and: [one size ~= two size and: [(one endsWith: two) not and: [(one beginsWith: two) not and: [(two endsWith: one) not and: [(two beginsWith: one) not]]]]]. > Sent: Tuesday, January 30, 2018 at 2:34 AM > From: "monty via Glass" <[hidden email]> > To: [hidden email] > Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator > > The real problem is String>>#=. It's bizarre that two SequenceableCollections can be #= yet have different #sizes and that for every shared index i, it's not necessarily true that "(one at: i) = (two at: i)": > | one two | > > one := String with: $a with: 25 asCharacter with: $b. > two := one copyWithout: one second. > one = two > and: [one asArray ~= two asArray > and: [ > (1 to: (one size min: two size)) anySatisfy: [:i | > (one at: i) ~= (two at: i)]]]. > > Java and C# model strings as immutable indexed collections of UTF-16 16-bit code units (meaning surrogate pair-encoded code points require two units), and no normalization is done during comparisons. Instead there are special methods, like Normalize(), that convert a string into a chosen normalized form, and normalized comparisons can then be done on the converted strings. Ignoring the choice of UTF-16, this seems like a better, safer approach if you're still committed to treating strings as indexable character collections. > > But I'm not sure how you can fix String or GsFile without breaking backwards compatibility. > > > Sent: Monday, January 29, 2018 at 11:44 AM > > From: "Dale Henrichs via Glass" <[hidden email]> > > To: [hidden email] > > Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator > > > > > > > > On 01/29/2018 01:16 AM, monty via Glass wrote: > > > I was writing tests for stream converter classes that do encoding/decoding from various encodings. But any use of Strings to store binary data is a use case. ByteArray is more appropriate, but GsFile is still byte-character based by default, even when you open files in binary mode (which I assume just disables line ending normalization on Windows). > > This seems like a GemStone bug at the end of the day ... ByteArray and > > Utf8 are the two classes that _should_ be used, but if GsFile is not > > handling them well, then that is an issue for us ... I will check this > > out ... > > > > Thanks, > > > > Dale > > > > > > > >> Sent: Saturday, January 27, 2018 at 12:18 PM > > >> From: "Dale Henrichs via Glass" <[hidden email]> > > >> To: [hidden email] > > >> Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator > > >> > > >> Monty, > > >> > > >> Good points ... this "unexpected" behavior of Unicode strings with > > >> respect to control characters has been hard for us to grapple with > > >> internally as well, but this is unicode being unicode. I did notice that > > >> with the exception of code point 173, all of the code points you list > > >> are indeed control characters according the Unicode character table[1]. > > >> > > >> Code point 173 is a "Soft Hypen"[2] and doesn't really seem to fit the > > >> description of a control character, so I'm now curious if we might have > > >> a bug here, either in our implementation, the implementation of libICU > > >> or my understanding:) > > >> > > >> I'm curious how you ran across this behavior? The control characters > > >> wouldn't seem to be a normal part of strings intended for display ... > > >> > > >> I'm asking because if there is a use case for providing the old literal > > >> byte comparison operators we can make them available. > > >> > > >> Dale > > >> > > >> [1] https://unicode-table.com/en/#control-character > > >> [2] https://unicode-table.com/en/00AD/ > > >> > > >> On 01/27/2018 01:57 AM, monty via Glass wrote: > > >>> My example and thread title were wrong. It skips null *and* various control chars entirely when comparing: > > >>> (0 to: 255) select: [:each | > > >>> (String with: $a with: $b) = > > >>> (String with: $a with: each asCharacter with: $b)] > > >>> > > >>> which yields: > > >>> anArray( 0, 1, 2, 3, 4, 5, 6, 7, 8, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 173) > > >>> > > >>> The GS Prog Guide (p. 77) says the ICU lib handles string comparisons internally, and it seems to ignore these characters for the sake of normalization. > > >>> > > >>> But that means it's possible for two Strings to be #= while having different #sizes and indexable characters, and that comparisons between Strings containing binary data aren't reliable, and that other String methods aren't consistent with #=: > > >>> | one two | > > >>> one := String with: $a with: 0 asCharacter with: $b. > > >>> two := String with: $a with: $b. > > >>> one = two > > >>> and: [(one at: 1 equals: two) not > > >>> and: [(two at: 1 equals: one) not]] > > >>> > > >>> And since GsFile #next and #contents are character based: > > >>> (GsFile open: 'bin.one' mode: 'wb' onClient: false) > > >>> nextPutAll: #[100 25 200]; > > >>> close. > > >>> (GsFile open: 'bin.two' mode: 'wb' onClient: false) > > >>> nextPutAll: #[100 200]; > > >>> close. > > >>> (GsFile open: 'bin.one' mode: 'rb' onClient: false) contents = > > >>> (GsFile open: 'bin.two' mode: 'rb' onClient: false) contents. > > >>> > > >>> Consider this more as a "heads-up" for users than a bug report, since this is apparently the intended, documented behavior. > > >>> > > >>>> Sent: Friday, January 26, 2018 at 2:20 AM > > >>>> From: "monty via Glass" <[hidden email]> > > >>>> To: [hidden email] > > >>>> Subject: [Glass] Possible Bug: String>>#= treats nulls as a terminator > > >>>> > > >>>> Is this correct? > > >>>> > > >>>> (String with: 12 asCharacter with: 0 asCharacter) = > > >>>> (String with: 12 asCharacter with: 0 asCharacter with: 32 asCharacter) > > >>>> > > >>>> Other string methods, like #copyAfter:, don't treat null the same way. > > >>>> _______________________________________________ > > >>>> Glass mailing list > > >>>> [hidden email] > > >>>> http://lists.gemtalksystems.com/mailman/listinfo/glass > > >>>> > > >>> _______________________________________________ > > >>> Glass mailing list > > >>> [hidden email] > > >>> http://lists.gemtalksystems.com/mailman/listinfo/glass > > >> _______________________________________________ > > >> Glass mailing list > > >> [hidden email] > > >> http://lists.gemtalksystems.com/mailman/listinfo/glass > > >> > > > _______________________________________________ > > > Glass mailing list > > > [hidden email] > > > http://lists.gemtalksystems.com/mailman/listinfo/glass > > > > _______________________________________________ > > Glass mailing list > > [hidden email] > > http://lists.gemtalksystems.com/mailman/listinfo/glass > > > _______________________________________________ > Glass mailing list > [hidden email] > http://lists.gemtalksystems.com/mailman/listinfo/glass > Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by GLASS mailing list
Hi Monty,
> On 30.01.2018, at 08:34, monty via Glass <[hidden email]> wrote: > > The real problem is String>>#=. It's bizarre that two SequenceableCollections can be #= yet have different #sizes and that for every shared index i, it's not necessarily true that "(one at: i) = (two at: i)": This is, however in line with unicode … See this very on-point discussion of the matter: https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ Best regards -Tobias > | one two | > > one := String with: $a with: 25 asCharacter with: $b. > two := one copyWithout: one second. > one = two > and: [one asArray ~= two asArray > and: [ > (1 to: (one size min: two size)) anySatisfy: [:i | > (one at: i) ~= (two at: i)]]]. > > Java and C# model strings as immutable indexed collections of UTF-16 16-bit code units (meaning surrogate pair-encoded code points require two units), and no normalization is done during comparisons. Instead there are special methods, like Normalize(), that convert a string into a chosen normalized form, and normalized comparisons can then be done on the converted strings. Ignoring the choice of UTF-16, this seems like a better, safer approach if you're still committed to treating strings as indexable character collections. > > But I'm not sure how you can fix String or GsFile without breaking backwards compatibility. > >> Sent: Monday, January 29, 2018 at 11:44 AM >> From: "Dale Henrichs via Glass" <[hidden email]> >> To: [hidden email] >> Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator >> >> >> >> On 01/29/2018 01:16 AM, monty via Glass wrote: >>> I was writing tests for stream converter classes that do encoding/decoding from various encodings. But any use of Strings to store binary data is a use case. ByteArray is more appropriate, but GsFile is still byte-character based by default, even when you open files in binary mode (which I assume just disables line ending normalization on Windows). >> This seems like a GemStone bug at the end of the day ... ByteArray and >> Utf8 are the two classes that _should_ be used, but if GsFile is not >> handling them well, then that is an issue for us ... I will check this >> out ... >> >> Thanks, >> >> Dale >> >>> >>>> Sent: Saturday, January 27, 2018 at 12:18 PM >>>> From: "Dale Henrichs via Glass" <[hidden email]> >>>> To: [hidden email] >>>> Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator >>>> >>>> Monty, >>>> >>>> Good points ... this "unexpected" behavior of Unicode strings with >>>> respect to control characters has been hard for us to grapple with >>>> internally as well, but this is unicode being unicode. I did notice that >>>> with the exception of code point 173, all of the code points you list >>>> are indeed control characters according the Unicode character table[1]. >>>> >>>> Code point 173 is a "Soft Hypen"[2] and doesn't really seem to fit the >>>> description of a control character, so I'm now curious if we might have >>>> a bug here, either in our implementation, the implementation of libICU >>>> or my understanding:) >>>> >>>> I'm curious how you ran across this behavior? The control characters >>>> wouldn't seem to be a normal part of strings intended for display ... >>>> >>>> I'm asking because if there is a use case for providing the old literal >>>> byte comparison operators we can make them available. >>>> >>>> Dale >>>> >>>> [1] https://unicode-table.com/en/#control-character >>>> [2] https://unicode-table.com/en/00AD/ >>>> >>>> On 01/27/2018 01:57 AM, monty via Glass wrote: >>>>> My example and thread title were wrong. It skips null *and* various control chars entirely when comparing: >>>>> (0 to: 255) select: [:each | >>>>> (String with: $a with: $b) = >>>>> (String with: $a with: each asCharacter with: $b)] >>>>> >>>>> which yields: >>>>> anArray( 0, 1, 2, 3, 4, 5, 6, 7, 8, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 173) >>>>> >>>>> The GS Prog Guide (p. 77) says the ICU lib handles string comparisons internally, and it seems to ignore these characters for the sake of normalization. >>>>> >>>>> But that means it's possible for two Strings to be #= while having different #sizes and indexable characters, and that comparisons between Strings containing binary data aren't reliable, and that other String methods aren't consistent with #=: >>>>> | one two | >>>>> one := String with: $a with: 0 asCharacter with: $b. >>>>> two := String with: $a with: $b. >>>>> one = two >>>>> and: [(one at: 1 equals: two) not >>>>> and: [(two at: 1 equals: one) not]] >>>>> >>>>> And since GsFile #next and #contents are character based: >>>>> (GsFile open: 'bin.one' mode: 'wb' onClient: false) >>>>> nextPutAll: #[100 25 200]; >>>>> close. >>>>> (GsFile open: 'bin.two' mode: 'wb' onClient: false) >>>>> nextPutAll: #[100 200]; >>>>> close. >>>>> (GsFile open: 'bin.one' mode: 'rb' onClient: false) contents = >>>>> (GsFile open: 'bin.two' mode: 'rb' onClient: false) contents. >>>>> >>>>> Consider this more as a "heads-up" for users than a bug report, since this is apparently the intended, documented behavior. >>>>> >>>>>> Sent: Friday, January 26, 2018 at 2:20 AM >>>>>> From: "monty via Glass" <[hidden email]> >>>>>> To: [hidden email] >>>>>> Subject: [Glass] Possible Bug: String>>#= treats nulls as a terminator >>>>>> >>>>>> Is this correct? >>>>>> >>>>>> (String with: 12 asCharacter with: 0 asCharacter) = >>>>>> (String with: 12 asCharacter with: 0 asCharacter with: 32 asCharacter) >>>>>> >>>>>> Other string methods, like #copyAfter:, don't treat null the same way. >>>>>> _______________________________________________ >>>>>> Glass mailing list >>>>>> [hidden email] >>>>>> http://lists.gemtalksystems.com/mailman/listinfo/glass >>>>>> >>>>> _______________________________________________ >>>>> Glass mailing list >>>>> [hidden email] >>>>> http://lists.gemtalksystems.com/mailman/listinfo/glass >>>> _______________________________________________ >>>> Glass mailing list >>>> [hidden email] >>>> http://lists.gemtalksystems.com/mailman/listinfo/glass >>>> >>> _______________________________________________ >>> Glass mailing list >>> [hidden email] >>> http://lists.gemtalksystems.com/mailman/listinfo/glass >> >> _______________________________________________ >> Glass mailing list >> [hidden email] >> http://lists.gemtalksystems.com/mailman/listinfo/glass >> > _______________________________________________ > Glass mailing list > [hidden email] > http://lists.gemtalksystems.com/mailman/listinfo/glass _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
You misunderstood the issue. If you choose one string representation (like an indexed collection of code points) but use another (like normalized EGSs) when doing basic comparisons, you get these inconsistencies that arguably violate the underlying indexable collection interface contract (like #= being true while #beingsWith: and #endsWith: are false).
Perl 6, which your article mentions, models strings as indexed, _pre-normalized_ collections of EGSs[0]: "Köln" eq "Ko\x308ln" && "Köln".chars == "Ko\x308ln".chars && "Köln".codes == "Ko\x308ln".codes && "Köln".starts-with("Ko\x308ln") && "Köln".ends-with("Ko\x308ln") ('chars' is the length in EGSs, while 'codes' is the length in code points.) The Java/C# approach is more basic, but it's still consistent, forcing you to manually normalize strings before comparing them by code unit, if you want a normalized comparison. Anyway, I would recommend adding character (code point)-based comparison messages to String, and a #byteContents/#binaryContents message to GsFile, or even better, #ascii/#binary toggles like Pharo/Squeak have so you can set GsFile to #binary and use #next (instead of #nextByte) and #contents normally. 0: https://github.com/MoarVM/MoarVM/blob/master/docs/strings.asciidoc#normalization > Sent: Tuesday, January 30, 2018 at 3:48 AM > From: "Tobias Pape" <[hidden email]> > To: monty <[hidden email]> > Cc: [hidden email] > Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator > > Hi Monty, > > > > On 30.01.2018, at 08:34, monty via Glass <[hidden email]> wrote: > > > > The real problem is String>>#=. It's bizarre that two SequenceableCollections can be #= yet have different #sizes and that for every shared index i, it's not necessarily true that "(one at: i) = (two at: i)": > > This is, however in line with unicode … > See this very on-point discussion of the matter: > > https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ > > Best regards > -Tobias > > > | one two | > > > > one := String with: $a with: 25 asCharacter with: $b. > > two := one copyWithout: one second. > > one = two > > and: [one asArray ~= two asArray > > and: [ > > (1 to: (one size min: two size)) anySatisfy: [:i | > > (one at: i) ~= (two at: i)]]]. > > > > Java and C# model strings as immutable indexed collections of UTF-16 16-bit code units (meaning surrogate pair-encoded code points require two units), and no normalization is done during comparisons. Instead there are special methods, like Normalize(), that convert a string into a chosen normalized form, and normalized comparisons can then be done on the converted strings. Ignoring the choice of UTF-16, this seems like a better, safer approach if you're still committed to treating strings as indexable character collections. > > > > But I'm not sure how you can fix String or GsFile without breaking backwards compatibility. > > > >> Sent: Monday, January 29, 2018 at 11:44 AM > >> From: "Dale Henrichs via Glass" <[hidden email]> > >> To: [hidden email] > >> Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator > >> > >> > >> > >> On 01/29/2018 01:16 AM, monty via Glass wrote: > >>> I was writing tests for stream converter classes that do encoding/decoding from various encodings. But any use of Strings to store binary data is a use case. ByteArray is more appropriate, but GsFile is still byte-character based by default, even when you open files in binary mode (which I assume just disables line ending normalization on Windows). > >> This seems like a GemStone bug at the end of the day ... ByteArray and > >> Utf8 are the two classes that _should_ be used, but if GsFile is not > >> handling them well, then that is an issue for us ... I will check this > >> out ... > >> > >> Thanks, > >> > >> Dale > >> > >>> > >>>> Sent: Saturday, January 27, 2018 at 12:18 PM > >>>> From: "Dale Henrichs via Glass" <[hidden email]> > >>>> To: [hidden email] > >>>> Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator > >>>> > >>>> Monty, > >>>> > >>>> Good points ... this "unexpected" behavior of Unicode strings with > >>>> respect to control characters has been hard for us to grapple with > >>>> internally as well, but this is unicode being unicode. I did notice that > >>>> with the exception of code point 173, all of the code points you list > >>>> are indeed control characters according the Unicode character table[1]. > >>>> > >>>> Code point 173 is a "Soft Hypen"[2] and doesn't really seem to fit the > >>>> description of a control character, so I'm now curious if we might have > >>>> a bug here, either in our implementation, the implementation of libICU > >>>> or my understanding:) > >>>> > >>>> I'm curious how you ran across this behavior? The control characters > >>>> wouldn't seem to be a normal part of strings intended for display ... > >>>> > >>>> I'm asking because if there is a use case for providing the old literal > >>>> byte comparison operators we can make them available. > >>>> > >>>> Dale > >>>> > >>>> [1] https://unicode-table.com/en/#control-character > >>>> [2] https://unicode-table.com/en/00AD/ > >>>> > >>>> On 01/27/2018 01:57 AM, monty via Glass wrote: > >>>>> My example and thread title were wrong. It skips null *and* various control chars entirely when comparing: > >>>>> (0 to: 255) select: [:each | > >>>>> (String with: $a with: $b) = > >>>>> (String with: $a with: each asCharacter with: $b)] > >>>>> > >>>>> which yields: > >>>>> anArray( 0, 1, 2, 3, 4, 5, 6, 7, 8, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 173) > >>>>> > >>>>> The GS Prog Guide (p. 77) says the ICU lib handles string comparisons internally, and it seems to ignore these characters for the sake of normalization. > >>>>> > >>>>> But that means it's possible for two Strings to be #= while having different #sizes and indexable characters, and that comparisons between Strings containing binary data aren't reliable, and that other String methods aren't consistent with #=: > >>>>> | one two | > >>>>> one := String with: $a with: 0 asCharacter with: $b. > >>>>> two := String with: $a with: $b. > >>>>> one = two > >>>>> and: [(one at: 1 equals: two) not > >>>>> and: [(two at: 1 equals: one) not]] > >>>>> > >>>>> And since GsFile #next and #contents are character based: > >>>>> (GsFile open: 'bin.one' mode: 'wb' onClient: false) > >>>>> nextPutAll: #[100 25 200]; > >>>>> close. > >>>>> (GsFile open: 'bin.two' mode: 'wb' onClient: false) > >>>>> nextPutAll: #[100 200]; > >>>>> close. > >>>>> (GsFile open: 'bin.one' mode: 'rb' onClient: false) contents = > >>>>> (GsFile open: 'bin.two' mode: 'rb' onClient: false) contents. > >>>>> > >>>>> Consider this more as a "heads-up" for users than a bug report, since this is apparently the intended, documented behavior. > >>>>> > >>>>>> Sent: Friday, January 26, 2018 at 2:20 AM > >>>>>> From: "monty via Glass" <[hidden email]> > >>>>>> To: [hidden email] > >>>>>> Subject: [Glass] Possible Bug: String>>#= treats nulls as a terminator > >>>>>> > >>>>>> Is this correct? > >>>>>> > >>>>>> (String with: 12 asCharacter with: 0 asCharacter) = > >>>>>> (String with: 12 asCharacter with: 0 asCharacter with: 32 asCharacter) > >>>>>> > >>>>>> Other string methods, like #copyAfter:, don't treat null the same way. > >>>>>> _______________________________________________ > >>>>>> Glass mailing list > >>>>>> [hidden email] > >>>>>> http://lists.gemtalksystems.com/mailman/listinfo/glass > >>>>>> > >>>>> _______________________________________________ > >>>>> Glass mailing list > >>>>> [hidden email] > >>>>> http://lists.gemtalksystems.com/mailman/listinfo/glass > >>>> _______________________________________________ > >>>> Glass mailing list > >>>> [hidden email] > >>>> http://lists.gemtalksystems.com/mailman/listinfo/glass > >>>> > >>> _______________________________________________ > >>> Glass mailing list > >>> [hidden email] > >>> http://lists.gemtalksystems.com/mailman/listinfo/glass > >> > >> _______________________________________________ > >> Glass mailing list > >> [hidden email] > >> http://lists.gemtalksystems.com/mailman/listinfo/glass > >> > > _______________________________________________ > > Glass mailing list > > [hidden email] > > http://lists.gemtalksystems.com/mailman/listinfo/glass > > Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by GLASS mailing list
One more, and easily the worst:
| one two | one:=#[97 150 98] asString. two:=#[97 98] asString. one = two and: [one hash ~= two hash] > Sent: Monday, January 29, 2018 at 11:44 AM > From: "Dale Henrichs via Glass" <[hidden email]> > To: [hidden email] > Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator > > > > On 01/29/2018 01:16 AM, monty via Glass wrote: > > I was writing tests for stream converter classes that do encoding/decoding from various encodings. But any use of Strings to store binary data is a use case. ByteArray is more appropriate, but GsFile is still byte-character based by default, even when you open files in binary mode (which I assume just disables line ending normalization on Windows). > This seems like a GemStone bug at the end of the day ... ByteArray and > Utf8 are the two classes that _should_ be used, but if GsFile is not > handling them well, then that is an issue for us ... I will check this > out ... > > Thanks, > > Dale > > > > >> Sent: Saturday, January 27, 2018 at 12:18 PM > >> From: "Dale Henrichs via Glass" <[hidden email]> > >> To: [hidden email] > >> Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator > >> > >> Monty, > >> > >> Good points ... this "unexpected" behavior of Unicode strings with > >> respect to control characters has been hard for us to grapple with > >> internally as well, but this is unicode being unicode. I did notice that > >> with the exception of code point 173, all of the code points you list > >> are indeed control characters according the Unicode character table[1]. > >> > >> Code point 173 is a "Soft Hypen"[2] and doesn't really seem to fit the > >> description of a control character, so I'm now curious if we might have > >> a bug here, either in our implementation, the implementation of libICU > >> or my understanding:) > >> > >> I'm curious how you ran across this behavior? The control characters > >> wouldn't seem to be a normal part of strings intended for display ... > >> > >> I'm asking because if there is a use case for providing the old literal > >> byte comparison operators we can make them available. > >> > >> Dale > >> > >> [1] https://unicode-table.com/en/#control-character > >> [2] https://unicode-table.com/en/00AD/ > >> > >> On 01/27/2018 01:57 AM, monty via Glass wrote: > >>> My example and thread title were wrong. It skips null *and* various control chars entirely when comparing: > >>> (0 to: 255) select: [:each | > >>> (String with: $a with: $b) = > >>> (String with: $a with: each asCharacter with: $b)] > >>> > >>> which yields: > >>> anArray( 0, 1, 2, 3, 4, 5, 6, 7, 8, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 173) > >>> > >>> The GS Prog Guide (p. 77) says the ICU lib handles string comparisons internally, and it seems to ignore these characters for the sake of normalization. > >>> > >>> But that means it's possible for two Strings to be #= while having different #sizes and indexable characters, and that comparisons between Strings containing binary data aren't reliable, and that other String methods aren't consistent with #=: > >>> | one two | > >>> one := String with: $a with: 0 asCharacter with: $b. > >>> two := String with: $a with: $b. > >>> one = two > >>> and: [(one at: 1 equals: two) not > >>> and: [(two at: 1 equals: one) not]] > >>> > >>> And since GsFile #next and #contents are character based: > >>> (GsFile open: 'bin.one' mode: 'wb' onClient: false) > >>> nextPutAll: #[100 25 200]; > >>> close. > >>> (GsFile open: 'bin.two' mode: 'wb' onClient: false) > >>> nextPutAll: #[100 200]; > >>> close. > >>> (GsFile open: 'bin.one' mode: 'rb' onClient: false) contents = > >>> (GsFile open: 'bin.two' mode: 'rb' onClient: false) contents. > >>> > >>> Consider this more as a "heads-up" for users than a bug report, since this is apparently the intended, documented behavior. > >>> > >>>> Sent: Friday, January 26, 2018 at 2:20 AM > >>>> From: "monty via Glass" <[hidden email]> > >>>> To: [hidden email] > >>>> Subject: [Glass] Possible Bug: String>>#= treats nulls as a terminator > >>>> > >>>> Is this correct? > >>>> > >>>> (String with: 12 asCharacter with: 0 asCharacter) = > >>>> (String with: 12 asCharacter with: 0 asCharacter with: 32 asCharacter) > >>>> > >>>> Other string methods, like #copyAfter:, don't treat null the same way. > >>>> _______________________________________________ > >>>> Glass mailing list > >>>> [hidden email] > >>>> http://lists.gemtalksystems.com/mailman/listinfo/glass > >>>> > >>> _______________________________________________ > >>> Glass mailing list > >>> [hidden email] > >>> http://lists.gemtalksystems.com/mailman/listinfo/glass > >> _______________________________________________ > >> Glass mailing list > >> [hidden email] > >> http://lists.gemtalksystems.com/mailman/listinfo/glass > >> > > _______________________________________________ > > Glass mailing list > > [hidden email] > > http://lists.gemtalksystems.com/mailman/listinfo/glass > > _______________________________________________ > Glass mailing list > [hidden email] > http://lists.gemtalksystems.com/mailman/listinfo/glass > Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Free forum by Nabble | Edit this page |