It seems like when inputing accented character it is not by default in
UTF-8. Is it the case with Pharo 1.3 ? Hilaire -- Education 0.2 -- http://blog.ofset.org/hilaire |
On Aug 5, 2011, at 1:14 35PM, Hilaire Fernandes wrote: > It seems like when inputing accented character it is not by default in > UTF-8. > Is it the case with Pharo 1.3 ? > > Hilaire > > > -- > Education 0.2 -- http://blog.ofset.org/hilaire I'm not sure what you mean. When in image, all the way from InputEvents to String representation, you only deal with Unicode codePoints. Cheers, Henry |
Le 05/08/2011 13:28, Henrik Johansen a écrit :
> > On Aug 5, 2011, at 1:14 35PM, Hilaire Fernandes wrote: > >> It seems like when inputing accented character it is not by default in >> UTF-8. >> Is it the case with Pharo 1.3 ? >> >> Hilaire >> >> >> -- >> Education 0.2 -- http://blog.ofset.org/hilaire > > I'm not sure what you mean. > When in image, all the way from InputEvents to String representation, you only deal with Unicode codePoints. Is seems it is 8 bits chars, when exported through XMLParser, it is 8bits string. I need to investigate further. Hilaire -- Education 0.2 -- http://blog.ofset.org/hilaire |
On Aug 5, 2011, at 3:41 54PM, Hilaire Fernandes wrote: > Le 05/08/2011 13:28, Henrik Johansen a écrit : >> >> On Aug 5, 2011, at 1:14 35PM, Hilaire Fernandes wrote: >> >>> It seems like when inputing accented character it is not by default in >>> UTF-8. >>> Is it the case with Pharo 1.3 ? >>> >>> Hilaire >>> >>> >>> -- >>> Education 0.2 -- http://blog.ofset.org/hilaire >> >> I'm not sure what you mean. >> When in image, all the way from InputEvents to String representation, you only deal with Unicode codePoints. > > Is seems it is 8 bits chars, when exported through XMLParser, it is > 8bits string. I need to investigate further. > > Hilaire Accented characters like é could be either: a) One Unicode codepoint (U+00E9 (decimal 233) small acute e ) b) Two Unicode codepoints ( U+0301 (decimal 769) combining acute accent + U0065 (decimal 101) small e ). Internally, you'd see strings with character values corresponding to those listed as decimal, ie the unicode codePoints. b) would be a WideString, as 769 does not fit in a byte. However, if correctly converted to UTF8, their representations should be; a) represented in 2 bytes ; 16r C3A9 b) represented in 3 bytes: 16r CD81 65. Ie. it seems XMLParser does not encode it properly to utf8 when exporting. Note: This is perfectly legal if the document contains an encoding attribute specifying a one-byte encoding like iso-8859-1 or windows-1252. (starts with <?xml version="1.0" encoding="windows-1252" ?> or some such) Absent such an attribute, or a BOM indicating another Unicode encoding though, it is a bug. Cheers, Henry |
I gave a look at the latest XMLParser but the API is different with a
lot broken code on my face. Does XMLWriter class>>on: obsolete ? It bugs me with that but the class and method are still there, a Monticello trick I forget about? I don't even now how to port to new API. Is there a port guide? I guess this is for the better, but still frustrating and distracting from the main task... Le 05/08/2011 16:23, Henrik Johansen a écrit : > > On Aug 5, 2011, at 3:41 54PM, Hilaire Fernandes wrote: > >> Le 05/08/2011 13:28, Henrik Johansen a écrit : >>> >>> On Aug 5, 2011, at 1:14 35PM, Hilaire Fernandes wrote: >>> >>>> It seems like when inputing accented character it is not by default in >>>> UTF-8. >>>> Is it the case with Pharo 1.3 ? >>>> >>>> Hilaire >>>> >>>> >>>> -- >>>> Education 0.2 -- http://blog.ofset.org/hilaire >>> >>> I'm not sure what you mean. >>> When in image, all the way from InputEvents to String representation, you only deal with Unicode codePoints. >> >> Is seems it is 8 bits chars, when exported through XMLParser, it is >> 8bits string. I need to investigate further. >> >> Hilaire > It is an 8-bit character, since the codePoint fits in one byte. (see a) > Accented characters like é could be either: > a) One Unicode codepoint (U+00E9 (decimal 233) small acute e ) > b) Two Unicode codepoints ( U+0301 (decimal 769) combining acute accent + U0065 (decimal 101) small e ). > > Internally, you'd see strings with character values corresponding to those listed as decimal, ie the unicode codePoints. > b) would be a WideString, as 769 does not fit in a byte. > > However, if correctly converted to UTF8, their representations should be; > a) represented in 2 bytes ; 16r C3A9 > b) represented in 3 bytes: 16r CD81 65. > > Ie. it seems XMLParser does not encode it properly to utf8 when exporting. > Note: This is perfectly legal if the document contains an encoding attribute specifying a one-byte encoding like iso-8859-1 or windows-1252. > (starts with <?xml version="1.0" encoding="windows-1252" ?> or some such) > Absent such an attribute, or a BOM indicating another Unicode encoding though, it is a bug. > > Cheers, > Henry > > > -- Education 0.2 -- http://blog.ofset.org/hilaire |
On Aug 5, 2011, at 4:41 PM, Hilaire Fernandes wrote: > I gave a look at the latest XMLParser but the API is different with a > lot broken code on my face. Does XMLWriter class>>on: obsolete ? It bugs > me with that but the class and method are still there, a Monticello > trick I forget about? > I don't even now how to port to new API. Is there a port guide? > I guess this is for the better, but still frustrating and distracting > from the main task... indeed We should really invest into some main packages. For example I worked on SOUP to add comments and add new tests. Now we (the core) do not have the energy to work on the core and external packages. I hope it will change when the core gets fixed. > > > > Le 05/08/2011 16:23, Henrik Johansen a écrit : >> >> On Aug 5, 2011, at 3:41 54PM, Hilaire Fernandes wrote: >> >>> Le 05/08/2011 13:28, Henrik Johansen a écrit : >>>> >>>> On Aug 5, 2011, at 1:14 35PM, Hilaire Fernandes wrote: >>>> >>>>> It seems like when inputing accented character it is not by default in >>>>> UTF-8. >>>>> Is it the case with Pharo 1.3 ? >>>>> >>>>> Hilaire >>>>> >>>>> >>>>> -- >>>>> Education 0.2 -- http://blog.ofset.org/hilaire >>>> >>>> I'm not sure what you mean. >>>> When in image, all the way from InputEvents to String representation, you only deal with Unicode codePoints. >>> >>> Is seems it is 8 bits chars, when exported through XMLParser, it is >>> 8bits string. I need to investigate further. >>> >>> Hilaire >> It is an 8-bit character, since the codePoint fits in one byte. (see a) >> Accented characters like é could be either: >> a) One Unicode codepoint (U+00E9 (decimal 233) small acute e ) >> b) Two Unicode codepoints ( U+0301 (decimal 769) combining acute accent + U0065 (decimal 101) small e ). >> >> Internally, you'd see strings with character values corresponding to those listed as decimal, ie the unicode codePoints. >> b) would be a WideString, as 769 does not fit in a byte. >> >> However, if correctly converted to UTF8, their representations should be; >> a) represented in 2 bytes ; 16r C3A9 >> b) represented in 3 bytes: 16r CD81 65. >> >> Ie. it seems XMLParser does not encode it properly to utf8 when exporting. >> Note: This is perfectly legal if the document contains an encoding attribute specifying a one-byte encoding like iso-8859-1 or windows-1252. >> (starts with <?xml version="1.0" encoding="windows-1252" ?> or some such) >> Absent such an attribute, or a BOM indicating another Unicode encoding though, it is a bug. >> >> Cheers, >> Henry >> >> >> > > > -- > Education 0.2 -- http://blog.ofset.org/hilaire > > |
Free forum by Nabble | Edit this page |