Hello,
i finished this stuff, and its ready for adoption. See http://bugs.squeak.org/view.php?id=7428 Anyone wants to help pushing it into trunk update stream (using MC configs)? It works fine on recent trunk image, on pharo however i had some problems installing changes, because of some differencies. Tried on PharoCore-1.1-11106-ALPHA.image phase2.1.cs - do not filein the TextEditor changes, since pharo-core don't have it. - do not filein the last line (reorganizing).. - tests failing because pharo String class does not implements #squeakToUtf8 nor #utf8ToSqueak Do we having an uniform way how to encode ANY String -> ByteString(utf8) and back? What ANSI standard saying about it? Maybe i'm using wrong methods? Still, i think we need this thing standartized and be common for all dialects (not just Pharo/Squeak). -- Best regards, Igor Stasenko AKA sig. _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
2009/12/20 Igor Stasenko <[hidden email]>:
> Hello, > i finished this stuff, and its ready for adoption. > > See http://bugs.squeak.org/view.php?id=7428 > > Anyone wants to help pushing it into trunk update stream (using MC configs)? > > It works fine on recent trunk image, > on pharo however i had some problems installing changes, because of > some differencies. > > Tried on PharoCore-1.1-11106-ALPHA.image > > phase2.1.cs > - do not filein the TextEditor changes, since pharo-core don't have it. > - do not filein the last line (reorganizing).. > > - tests failing because pharo String class does not implements > #squeakToUtf8 > nor > #utf8ToSqueak > > Do we having an uniform way how to encode ANY String -> ByteString(utf8) > and back? What ANSI standard saying about it? Maybe i'm using wrong methods? > Update: - fixed the utf8 stuff, by using a #convertToEncoding: / #convertFromEncoding: - @Pharoers: do not file-in a reorganize crap, attached in *phase* and *cleanup* changesets. There is an issue with #defaultMethodTrailer implementation, which i missed to change. In trunk, i changed it in TCompilingBehavior but in Pharo, there's no such trait. There is an additional .cs to fix that (see notes on mantis) Pfff.. i hope i din't miss anything this time :) -- Best regards, Igor Stasenko AKA sig. _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Igor Stasenko
On 20.12.2009 20:04, Igor Stasenko wrote:
> Hello, > i finished this stuff, and its ready for adoption. > Nice! > See http://bugs.squeak.org/view.php?id=7428 > > Anyone wants to help pushing it into trunk update stream (using MC configs)? > > It works fine on recent trunk image, > on pharo however i had some problems installing changes, because of > some differencies. > > Tried on PharoCore-1.1-11106-ALPHA.image > > phase2.1.cs > - do not filein the TextEditor changes, since pharo-core don't have it. > - do not filein the last line (reorganizing).. > > - tests failing because pharo String class does not implements > #squeakToUtf8 > nor > #utf8ToSqueak > > Do we having an uniform way how to encode ANY String -> ByteString(utf8) > and back? What ANSI standard saying about it? Maybe i'm using wrong methods? > does not exist in the implementation defined execution character set used in the representation of character objects." So, implementation defined. Every internal String (in Squeak and Pharo) (afaik) should be either latin1 (ByteStrings) or + utf32 with the high byte used for differentiation between language of the string. To me, sending squeakToUtf8, then using StandardFileStream instead of FileStream seems safe. As long as the ByteString's bytes is utf8, utf8ToSqueak works. (And in most other cases as well) In fact, it's safer than UTF8Decoder for non-utf8 strings, which does not perform the validity checks (only reads the total #of bytes) when encountering bytes > 127. The reason it seems mostly for internal use (to me) is the fact it silently falls back to assuming string is already in latin1 (ie, the "valid" ByteString format), instead of raising an error like the stream decoder does. (Which, by the way, would be much nicer if was a MalformedUTF8Error or some such...) ws := StandardFileStream newFileNamed: 'test.txt'. "Save as latin1" ws nextPutAll: 'ååå'. ws close. "Read with UTF8Decoder" rs := FileStream oldFileNamed: 'test.txt'. "Print this, gives a ?" rs contents. rs close "Read with Latin1Decoder" rs := StandardFileStream oldFileNamed: 'test.txt'. "Print this, gives ååå. since it's not valid utf8, thus assumes latin1" rs contents utf8ToSqueak. rs close > Still, i think we need this thing standartized and be common for all > dialects (not just Pharo/Squeak). > There's really only one way to store characters in a ByteArray (ie. ByteString) and call it utf8 encoded. As far as I can tell, Squeak seems to do the right thing :) I believe Nicolas pushed for implementation in Pharo some time ago, not sure what happened to that. Cheers, Henry _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
2009/12/20 Henrik Sperre Johansen <[hidden email]>:
> On 20.12.2009 20:04, Igor Stasenko wrote: >> Hello, >> i finished this stuff, and its ready for adoption. >> > Nice! >> See http://bugs.squeak.org/view.php?id=7428 >> >> Anyone wants to help pushing it into trunk update stream (using MC configs)? >> >> It works fine on recent trunk image, >> on pharo however i had some problems installing changes, because of >> some differencies. >> >> Tried on PharoCore-1.1-11106-ALPHA.image >> >> phase2.1.cs >> - do not filein the TextEditor changes, since pharo-core don't have it. >> - do not filein the last line (reorganizing).. >> >> - tests failing because pharo String class does not implements >> #squeakToUtf8 >> nor >> #utf8ToSqueak >> >> Do we having an uniform way how to encode ANY String -> ByteString(utf8) >> and back? What ANSI standard saying about it? Maybe i'm using wrong methods? >> > "3.4.6.4 - It is erroneous if stringBody contains any characters that > does not exist in the implementation > defined execution character set used in the representation of character > objects." > So, implementation defined. > Every internal String (in Squeak and Pharo) (afaik) should be either > latin1 (ByteStrings) or + utf32 with the high byte used for > differentiation between language of the string. > > To me, sending squeakToUtf8, then using StandardFileStream instead of > FileStream seems safe. > As long as the ByteString's bytes is utf8, utf8ToSqueak works. (And in > most other cases as well) > In fact, it's safer than UTF8Decoder for non-utf8 strings, which does > not perform the validity checks (only reads the total #of bytes) when > encountering bytes > 127. > The reason it seems mostly for internal use (to me) is the fact it > silently falls back to assuming string is already in latin1 (ie, the > "valid" ByteString format), instead of raising an error like the stream > decoder does. (Which, by the way, would be much nicer if was a > MalformedUTF8Error or some such...) > > ws := StandardFileStream newFileNamed: 'test.txt'. > "Save as latin1" > ws nextPutAll: 'ååå'. > ws close. > > "Read with UTF8Decoder" > rs := FileStream oldFileNamed: 'test.txt'. > "Print this, gives a ?" > rs contents. > rs close > > "Read with Latin1Decoder" > rs := StandardFileStream oldFileNamed: 'test.txt'. > "Print this, gives ååå. since it's not valid utf8, thus assumes latin1" > rs contents utf8ToSqueak. > rs close >> Still, i think we need this thing standartized and be common for all >> dialects (not just Pharo/Squeak). >> > There's really only one way to store characters in a ByteArray (ie. > ByteString) and call it utf8 encoded. > As far as I can tell, Squeak seems to do the right thing :) > I believe Nicolas pushed for implementation in Pharo some time ago, not > sure what happened to that. > I seems solved this by using #convertToEncoding: / #convertFromEncoding: . Tests working fine after that. I didn't tried however to use source with other than Latin1 characters yet. > Cheers, > Henry > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > -- Best regards, Igor Stasenko AKA sig. _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
On 20.12.2009 22:07, Igor Stasenko wrote:
> 2009/12/20 Henrik Sperre Johansen<[hidden email]>: > >> On 20.12.2009 20:04, Igor Stasenko wrote: >> >>> Hello, >>> i finished this stuff, and its ready for adoption. >>> >>> >> Nice! >> >>> See http://bugs.squeak.org/view.php?id=7428 >>> >>> Anyone wants to help pushing it into trunk update stream (using MC configs)? >>> >>> It works fine on recent trunk image, >>> on pharo however i had some problems installing changes, because of >>> some differencies. >>> >>> Tried on PharoCore-1.1-11106-ALPHA.image >>> >>> phase2.1.cs >>> - do not filein the TextEditor changes, since pharo-core don't have it. >>> - do not filein the last line (reorganizing).. >>> >>> - tests failing because pharo String class does not implements >>> #squeakToUtf8 >>> nor >>> #utf8ToSqueak >>> >>> Do we having an uniform way how to encode ANY String -> ByteString(utf8) >>> and back? What ANSI standard saying about it? Maybe i'm using wrong methods? >>> >>> >> "3.4.6.4 - It is erroneous if stringBody contains any characters that >> does not exist in the implementation >> defined execution character set used in the representation of character >> objects." >> So, implementation defined. >> Every internal String (in Squeak and Pharo) (afaik) should be either >> latin1 (ByteStrings) or + utf32 with the high byte used for >> differentiation between language of the string. >> >> To me, sending squeakToUtf8, then using StandardFileStream instead of >> FileStream seems safe. >> As long as the ByteString's bytes is utf8, utf8ToSqueak works. (And in >> most other cases as well) >> In fact, it's safer than UTF8Decoder for non-utf8 strings, which does >> not perform the validity checks (only reads the total #of bytes) when >> encountering bytes> 127. >> The reason it seems mostly for internal use (to me) is the fact it >> silently falls back to assuming string is already in latin1 (ie, the >> "valid" ByteString format), instead of raising an error like the stream >> decoder does. (Which, by the way, would be much nicer if was a >> MalformedUTF8Error or some such...) >> >> ws := StandardFileStream newFileNamed: 'test.txt'. >> "Save as latin1" >> ws nextPutAll: 'ååå'. >> ws close. >> >> "Read with UTF8Decoder" >> rs := FileStream oldFileNamed: 'test.txt'. >> "Print this, gives a ?" >> rs contents. >> rs close >> >> "Read with Latin1Decoder" >> rs := StandardFileStream oldFileNamed: 'test.txt'. >> "Print this, gives ååå. since it's not valid utf8, thus assumes latin1" >> rs contents utf8ToSqueak. >> rs close >> >>> Still, i think we need this thing standartized and be common for all >>> dialects (not just Pharo/Squeak). >>> >>> >> There's really only one way to store characters in a ByteArray (ie. >> ByteString) and call it utf8 encoded. >> As far as I can tell, Squeak seems to do the right thing :) >> I believe Nicolas pushed for implementation in Pharo some time ago, not >> sure what happened to that. >> >> > I seems solved this by using #convertToEncoding: / #convertFromEncoding: . > Tests working fine after that. I didn't tried however to use source > with other than Latin1 characters yet. > as long as you know the ByteString encoding is latin1. (Which it should if created it by any normal means) As long as you are SURE the string you are decoding is utf8 (like, when you've encoded them all yourself ;) ), convertFromEncoding: shouldn't be a problem either. (See previous mail, it's the same as used by FileStream, so lacks the validity checks). Cheers, Henry _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
2009/12/20 Henrik Sperre Johansen <[hidden email]>:
> On 20.12.2009 22:07, Igor Stasenko wrote: >> 2009/12/20 Henrik Sperre Johansen<[hidden email]>: >> >>> On 20.12.2009 20:04, Igor Stasenko wrote: >>> >>>> Hello, >>>> i finished this stuff, and its ready for adoption. >>>> >>>> >>> Nice! >>> >>>> See http://bugs.squeak.org/view.php?id=7428 >>>> >>>> Anyone wants to help pushing it into trunk update stream (using MC configs)? >>>> >>>> It works fine on recent trunk image, >>>> on pharo however i had some problems installing changes, because of >>>> some differencies. >>>> >>>> Tried on PharoCore-1.1-11106-ALPHA.image >>>> >>>> phase2.1.cs >>>> - do not filein the TextEditor changes, since pharo-core don't have it. >>>> - do not filein the last line (reorganizing).. >>>> >>>> - tests failing because pharo String class does not implements >>>> #squeakToUtf8 >>>> nor >>>> #utf8ToSqueak >>>> >>>> Do we having an uniform way how to encode ANY String -> ByteString(utf8) >>>> and back? What ANSI standard saying about it? Maybe i'm using wrong methods? >>>> >>>> >>> "3.4.6.4 - It is erroneous if stringBody contains any characters that >>> does not exist in the implementation >>> defined execution character set used in the representation of character >>> objects." >>> So, implementation defined. >>> Every internal String (in Squeak and Pharo) (afaik) should be either >>> latin1 (ByteStrings) or + utf32 with the high byte used for >>> differentiation between language of the string. >>> >>> To me, sending squeakToUtf8, then using StandardFileStream instead of >>> FileStream seems safe. >>> As long as the ByteString's bytes is utf8, utf8ToSqueak works. (And in >>> most other cases as well) >>> In fact, it's safer than UTF8Decoder for non-utf8 strings, which does >>> not perform the validity checks (only reads the total #of bytes) when >>> encountering bytes> 127. >>> The reason it seems mostly for internal use (to me) is the fact it >>> silently falls back to assuming string is already in latin1 (ie, the >>> "valid" ByteString format), instead of raising an error like the stream >>> decoder does. (Which, by the way, would be much nicer if was a >>> MalformedUTF8Error or some such...) >>> >>> ws := StandardFileStream newFileNamed: 'test.txt'. >>> "Save as latin1" >>> ws nextPutAll: 'ååå'. >>> ws close. >>> >>> "Read with UTF8Decoder" >>> rs := FileStream oldFileNamed: 'test.txt'. >>> "Print this, gives a ?" >>> rs contents. >>> rs close >>> >>> "Read with Latin1Decoder" >>> rs := StandardFileStream oldFileNamed: 'test.txt'. >>> "Print this, gives ååå. since it's not valid utf8, thus assumes latin1" >>> rs contents utf8ToSqueak. >>> rs close >>> >>>> Still, i think we need this thing standartized and be common for all >>>> dialects (not just Pharo/Squeak). >>>> >>>> >>> There's really only one way to store characters in a ByteArray (ie. >>> ByteString) and call it utf8 encoded. >>> As far as I can tell, Squeak seems to do the right thing :) >>> I believe Nicolas pushed for implementation in Pharo some time ago, not >>> sure what happened to that. >>> >>> >> I seems solved this by using #convertToEncoding: / #convertFromEncoding: . >> Tests working fine after that. I didn't tried however to use source >> with other than Latin1 characters yet. >> > Converting to utf8 from ByteString/WideString should not be a problem, > as long as you know the ByteString encoding is latin1. (Which it should > if created it by any normal means) > As long as you are SURE the string you are decoding is utf8 (like, when > you've encoded them all yourself ;) ), convertFromEncoding: shouldn't be > a problem either. (See previous mail, it's the same as used by > FileStream, so lacks the validity checks). > I'm also found other places in Pharo where its using a #( 0 0 0 0) as trailer in addTraitSelector: aSymbol withMethod: aCompiledMethod it needs to be fixed (as well as all other places which trying to use arrays for defining a trailer). > Cheers, > Henry > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project -- Best regards, Igor Stasenko AKA sig. _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
Free forum by Nabble | Edit this page |