Status: Accepted
Owner: [hidden email] New issue 4187 by [hidden email]: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter) http://code.google.com/p/pharo/issues/detail?id=4187 It seems that some of the conversion tables of ByteTextConverter' subclasses are wrong. For example, ISO88592TextConverter class>>#byteToUnicodeSpec specifies values for encoding between 16r80 and 16r8F, which are undefined according to http://en.wikipedia.org/wiki/ISO-8859-2 and http://www.gymel.com/charsets/ISO8859-2.html. I am posting this so that we do not forget. I can fix this later when I have some more time. Sven _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #1 on issue 4187 by [hidden email]: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter) http://code.google.com/p/pharo/issues/detail?id=4187 However, in the mappings published by unicode, they are to be mapped to the same control characters as in 8859-1: http://unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT 16r80 being the euro sign is probably a copy-paste error from 1253 though :) _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #2 on issue 4187 by [hidden email]: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter) http://code.google.com/p/pharo/issues/detail?id=4187 These files are actually quite cool. Couldn't we try to use them directly, I mean add the necessary tools to get the file over the internet, parse it and generate the mapping table that we need. Of course, we have to cache the table, but it would make the specs quite authorative. Sven _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #3 on issue 4187 by [hidden email]: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter) http://code.google.com/p/pharo/issues/detail?id=4187 My idea is to add the following two methods to the ByteTextConverter class, so that the key #byteToUnicodeSpec method could be autogenerated from official Unicode.org spec files ByteTextConverter class>>parseUnicodeOrgSpec: url "Parse and return a mapping from byte to unicode values from url." "self parseUnicodeOrgSpec: 'http://unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT'." | mapping | mapping := Dictionary new: 256. (ZnClient get: url) contents linesDo: [ :each | (each isEmpty or: [ each beginsWith: '#' ]) ifFalse: [ | tokens hexReader | hexReader := [ :string | Integer readFrom: (string readStream skip: 2; yourself) base: 16 ]. tokens := each findTokens: String tab. (tokens last = '<control>' or: [ tokens last = '#UNDEFINED' ]) ifFalse: [ mapping at: (hexReader value: tokens first) put: (hexReader value: tokens second) ] ] ]. ^ mapping ByteTextConverter class>generateByteToUnicodeSpec: url "Return the formatted source code for an array mapping the top 128 byte to unicode values from a Unicode.org url" "self generateByteToUnicodeSpec: 'http://unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT'." | mapping | mapping := self parseUnicodeOrgSpec: url. ^ String streamContents: [ :stream | stream cr; << ' ^ #('. 128 to: 255 do: [ :each | | unicode | each \\ 8 = 0 ifTrue: [ stream cr; tab ]. (unicode := mapping at: each ifAbsent: [ nil ]) isNil ifTrue: [ stream print: nil; space ] ifFalse: [ stream << '16r' << (unicode printPaddedWith: $0 to: 4 base: 16); space ] ]. stream nextPut: $); cr ] The following code compares the new and the old mappings: (Dictionary newFromPairs: {CP1250TextConverter. 'http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT'. CP1252TextConverter. 'http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT'. CP1253TextConverter. 'http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1253.TXT'. Latin1TextConverter. 'http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT'. ISO88592TextConverter. 'http://unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT'. ISO88597TextConverter. 'http://unicode.org/Public/MAPPINGS/ISO8859/8859-7.TXT'. ISO885915TextConverter. 'http://unicode.org/Public/MAPPINGS/ISO8859/8859-15.TXT'. KOI8RTextConverter. 'http://unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT'. MacRomanTextConverter. 'http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT' }) keysAndValuesDo: [ :encoderClass :url | | encoder mapping | Transcript cr; show: encoderClass; space. encoder := encoderClass new. mapping := ByteTextConverter parseUnicodeOrgSpec: url. 0 to: 255 do: [ :each | (encoder byteToUnicode: each asCharacter) charCode = (mapping at: each ifAbsent: [ -1]) ifFalse: [ Transcript print: each; space ] ] ]. The output show for which byte encoding values there is a difference: Latin1TextConverter 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 ISO88592TextConverter 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 ISO88597TextConverter 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 174 210 255 ISO885915TextConverter 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 KOI8RTextConverter CP1253TextConverter 129 136 138 140 141 142 143 144 152 154 156 157 158 159 170 210 255 MacRomanTextConverter 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 249 250 251 253 254 CP1250TextConverter 129 131 136 144 152 CP1252TextConverter This has to be studied and resolved. Another problem is that ByteTextConverter cannot (yet) deal with 'holes' in the encoding tables. Also, ByteTextConverter does silently ignore errors where it should throw exceptions IMHO (as when dealing with out of range or undefined values). Anyone cares to comment ? Sven _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Comment #4 on issue 4187 by [hidden email]: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter) http://code.google.com/p/pharo/issues/detail?id=4187 OK, I picked this up again ;-) I added backwards compatibility so this in now more a cleanup than a refactoring. slice is in SS3 inbox Name: SLICE-Issue-4187-ByteTextConverter-byteToUnicodeSpec-generation-from-external-files-SvenVanCaekenberghe.1 Author: SvenVanCaekenberghe Time: 7 March 2012, 11:31:20 pm UUID: 4e5fd869-c4ea-4660-9552-818e9cd57348 Ancestors: Dependencies: Multilingual-TextConversion-SvenVanCaekenberghe.26 added ByteTextConverter class>>#parseUnicodeOrgSpec: and #generateByteToUnicodeSpec: so that all #byteToUnicode implementations for subclasses can be generated (statically) from official external specifications; fixed #initializeTables to silently handle holes with an identity mapping for backwards compatibility (for now); added uniform class #initialize to all subclasses; removed unused FromTable dictionary and unused initialization from ISO88597TextConverter and ISO88592TextConverter; execute ByteTextConverter initialize. after loading. _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Updates:
Status: FixReviewNeeded Labels: Type-Cleanup Milestone-1.4 Comment #5 on issue 4187 by [hidden email]: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter) http://code.google.com/p/pharo/issues/detail?id=4187 (No comment was entered for this change.) _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Updates:
Status: FixToInclude Comment #6 on issue 4187 by [hidden email]: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter) http://code.google.com/p/pharo/issues/detail?id=4187 (No comment was entered for this change.) _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Updates:
Status: Integrated Comment #7 on issue 4187 by [hidden email]: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter) http://code.google.com/p/pharo/issues/detail?id=4187 in 14392 _______________________________________________ Pharo-bugtracker mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker |
Free forum by Nabble | Edit this page |