Issue 4187 in pharo: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Issue 4187 in pharo: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter)

pharo
Status: Accepted
Owner: [hidden email]

New issue 4187 by [hidden email]: ByteTextConverter's subclasses  
tables need to be checked (a.o. ISO88592TextConverter)
http://code.google.com/p/pharo/issues/detail?id=4187

It seems that some of the conversion tables of ByteTextConverter'  
subclasses are wrong.

For example, ISO88592TextConverter class>>#byteToUnicodeSpec specifies  
values for encoding between 16r80 and 16r8F, which are undefined according  
to http://en.wikipedia.org/wiki/ISO-8859-2 and  
http://www.gymel.com/charsets/ISO8859-2.html.

I am posting this so that we do not forget.

I can fix this later when I have some more time.

Sven




_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4187 in pharo: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter)

pharo

Comment #1 on issue 4187 by [hidden email]: ByteTextConverter's  
subclasses tables need to be checked (a.o. ISO88592TextConverter)
http://code.google.com/p/pharo/issues/detail?id=4187

However, in the mappings published by unicode, they are to be mapped to the  
same control characters as in 8859-1:

http://unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT

16r80 being the euro sign is probably a copy-paste error from 1253 though :)


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4187 in pharo: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter)

pharo

Comment #2 on issue 4187 by [hidden email]: ByteTextConverter's  
subclasses tables need to be checked (a.o. ISO88592TextConverter)
http://code.google.com/p/pharo/issues/detail?id=4187

These files are actually quite cool.

Couldn't we try to use them directly, I mean add the necessary tools to get  
the file over the internet, parse it and generate the mapping table that we  
need. Of course, we have to cache the table, but it would make the specs  
quite authorative.

Sven


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4187 in pharo: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter)

pharo

Comment #3 on issue 4187 by [hidden email]: ByteTextConverter's  
subclasses tables need to be checked (a.o. ISO88592TextConverter)
http://code.google.com/p/pharo/issues/detail?id=4187

My idea is to add the following two methods to the ByteTextConverter class,  
so that the key #byteToUnicodeSpec method could be autogenerated from  
official Unicode.org spec files

ByteTextConverter class>>parseUnicodeOrgSpec: url
        "Parse and return a mapping from byte to unicode values from url."
        "self  
parseUnicodeOrgSpec: 'http://unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT'."
       
        | mapping |
        mapping := Dictionary new: 256.
        (ZnClient get: url) contents linesDo: [ :each |
                (each isEmpty or: [ each beginsWith: '#' ])
                        ifFalse: [ | tokens hexReader |
                                hexReader := [ :string | Integer readFrom: (string readStream skip: 2;  
yourself) base: 16 ].
                                tokens := each findTokens: String tab.
                                (tokens last = '<control>' or: [ tokens last = '#UNDEFINED' ]) ifFalse:  
[
                                        mapping
                                                at: (hexReader value: tokens first)
                                                put: (hexReader value: tokens second) ] ] ].
        ^ mapping

ByteTextConverter class>generateByteToUnicodeSpec: url
        "Return the formatted source code for an array mapping
        the top 128 byte to unicode values from a Unicode.org url"
        "self  
generateByteToUnicodeSpec: 'http://unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT'."

        | mapping |
        mapping := self parseUnicodeOrgSpec: url.
        ^ String streamContents: [ :stream |
                stream cr; << ' ^ #('.
                128 to: 255 do: [ :each | | unicode |
                        each \\ 8 = 0 ifTrue: [ stream cr; tab ].
                        (unicode := mapping at: each ifAbsent: [ nil ]) isNil
                                ifTrue: [ stream print: nil; space ]
                                ifFalse: [ stream << '16r' << (unicode printPaddedWith: $0 to: 4 base:  
16); space ] ].
                stream nextPut: $); cr ]

The following code compares the new and the old mappings:

(Dictionary newFromPairs:
{CP1250TextConverter.
        'http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT'.
CP1252TextConverter.
        'http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT'.
CP1253TextConverter.
        'http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1253.TXT'.
Latin1TextConverter.  
        'http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT'.
ISO88592TextConverter.  
        'http://unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT'.
ISO88597TextConverter. 'http://unicode.org/Public/MAPPINGS/ISO8859/8859-7.TXT'.
ISO885915TextConverter. 'http://unicode.org/Public/MAPPINGS/ISO8859/8859-15.TXT'.
KOI8RTextConverter.
        'http://unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT'.
MacRomanTextConverter. 'http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT' 
})
        keysAndValuesDo: [ :encoderClass :url | | encoder mapping |
                Transcript cr; show: encoderClass; space.
                encoder := encoderClass new.
                mapping := ByteTextConverter parseUnicodeOrgSpec: url.
                0 to: 255 do: [ :each |
                        (encoder byteToUnicode: each asCharacter) charCode = (mapping at: each  
ifAbsent: [ -1])
                                ifFalse: [ Transcript print: each; space ] ] ].

The output show for which byte encoding values there is a difference:

Latin1TextConverter 128 129 130 131 132 133 134 135 136 137 138 139 140 141  
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
ISO88592TextConverter 128 129 130 131 132 133 134 135 136 137 138 139 140  
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
ISO88597TextConverter 128 129 130 131 132 133 134 135 136 137 138 139 140  
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159  
174 210 255
ISO885915TextConverter 128 129 130 131 132 133 134 135 136 137 138 139 140  
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
KOI8RTextConverter
CP1253TextConverter 129 136 138 140 141 142 143 144 152 154 156 157 158 159  
170 210 255
MacRomanTextConverter 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  
21 22 23 24 25 26 27 28 29 30 31 127 249 250 251 253 254
CP1250TextConverter 129 131 136 144 152
CP1252TextConverter

This has to be studied and resolved.

Another problem is that ByteTextConverter cannot (yet) deal with 'holes' in  
the encoding tables.
Also, ByteTextConverter does silently ignore errors where it should throw  
exceptions IMHO (as when dealing with out of range or undefined values).

Anyone cares to comment ?

Sven





_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4187 in pharo: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter)

pharo

Comment #4 on issue 4187 by [hidden email]: ByteTextConverter's  
subclasses tables need to be checked (a.o. ISO88592TextConverter)
http://code.google.com/p/pharo/issues/detail?id=4187

OK, I picked this up again ;-)

I added backwards compatibility so this in now more a cleanup than a  
refactoring.

slice is in SS3 inbox

Name:  
SLICE-Issue-4187-ByteTextConverter-byteToUnicodeSpec-generation-from-external-files-SvenVanCaekenberghe.1
Author: SvenVanCaekenberghe
Time: 7 March 2012, 11:31:20 pm
UUID: 4e5fd869-c4ea-4660-9552-818e9cd57348
Ancestors:
Dependencies: Multilingual-TextConversion-SvenVanCaekenberghe.26

added ByteTextConverter class>>#parseUnicodeOrgSpec: and  
#generateByteToUnicodeSpec: so that all #byteToUnicode implementations for  
subclasses can be generated (statically) from official external  
specifications;
fixed #initializeTables to silently handle holes with an identity mapping  
for backwards compatibility (for now);
added uniform class #initialize to all subclasses;
removed unused FromTable dictionary and unused initialization from  
ISO88597TextConverter and ISO88592TextConverter;

execute
        ByteTextConverter initialize.
after loading.



_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4187 in pharo: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter)

pharo
Updates:
        Status: FixReviewNeeded
        Labels: Type-Cleanup Milestone-1.4

Comment #5 on issue 4187 by [hidden email]: ByteTextConverter's  
subclasses tables need to be checked (a.o. ISO88592TextConverter)
http://code.google.com/p/pharo/issues/detail?id=4187

(No comment was entered for this change.)


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4187 in pharo: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter)

pharo
Updates:
        Status: FixToInclude

Comment #6 on issue 4187 by [hidden email]: ByteTextConverter's  
subclasses tables need to be checked (a.o. ISO88592TextConverter)
http://code.google.com/p/pharo/issues/detail?id=4187

(No comment was entered for this change.)


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker
Reply | Threaded
Open this post in threaded view
|

Re: Issue 4187 in pharo: ByteTextConverter's subclasses tables need to be checked (a.o. ISO88592TextConverter)

pharo
Updates:
        Status: Integrated

Comment #7 on issue 4187 by [hidden email]: ByteTextConverter's  
subclasses tables need to be checked (a.o. ISO88592TextConverter)
http://code.google.com/p/pharo/issues/detail?id=4187

in 14392


_______________________________________________
Pharo-bugtracker mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-bugtracker