ASN1 encoding of UTF8

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

ASN1 encoding of UTF8

Alan Pinch
I am trying to map utf8 into an ASN1 encoding, where the UTF8 is
specified to perhaps extend past one byte in value. I am also interested
in retaining this UTF8 characters in squeak to interoperate well. What
would be my best approach to this, mapping to/from these bytes on a stream?

Alan


Reply | Threaded
Open this post in threaded view
|

Re: ASN1 encoding of UTF8

Jakob Reschke-2
I just did a quick search on the web and it seems like ASN.1 has a UTF8String type (with tag 12) that just contains the sequence of bytes of the UTF-8-encoded string. Can you use that? See also this question on stackoverflow: https://stackoverflow.com/q/28929809

In Squeak, you can convert between UTF-8-encoded byte strings and decoded (Squeak-encoded) character strings with the help of UTF8TextConverter. Have a look at its class-side methods. Also, there are conversion methods in String, IIRC. Try to filter its instance-side methods by "utf8".

Does this answer your question or are you in search of something else?

Kind regards,
Jakob

Am 18.09.2017 03:49 schrieb "Alan Pinch" <[hidden email]>:
I am trying to map utf8 into an ASN1 encoding, where the UTF8 is
specified to perhaps extend past one byte in value. I am also interested
in retaining this UTF8 characters in squeak to interoperate well. What
would be my best approach to this, mapping to/from these bytes on a stream?

Alan




Reply | Threaded
Open this post in threaded view
|

Re: ASN1 encoding of UTF8

Alan Pinch

I had found the same stackover flow question. It is the only place I found that mentions that 0x0C is the tag for it.

I am currently encoding thus:

aString squeakToUtf8 asByteArray.

and decoding:

bytes asByteArray asString utf8ToSqueak.

Do you think this lays out the bytes as specified in this page? I gather from the stackoverflow that this would be the encoded form of utf8 for asn1.

https://en.wikipedia.org/wiki/UTF-8#Description

Alan

On 09/18/2017 01:46 AM, Jakob Reschke wrote:
I just did a quick search on the web and it seems like ASN.1 has a UTF8String type (with tag 12) that just contains the sequence of bytes of the UTF-8-encoded string. Can you use that? See also this question on stackoverflow: https://stackoverflow.com/q/28929809

In Squeak, you can convert between UTF-8-encoded byte strings and decoded (Squeak-encoded) character strings with the help of UTF8TextConverter. Have a look at its class-side methods. Also, there are conversion methods in String, IIRC. Try to filter its instance-side methods by "utf8".

Does this answer your question or are you in search of something else?

Kind regards,
Jakob

Am 18.09.2017 03:49 schrieb "Alan Pinch" <[hidden email]>:
I am trying to map utf8 into an ASN1 encoding, where the UTF8 is
specified to perhaps extend past one byte in value. I am also interested
in retaining this UTF8 characters in squeak to interoperate well. What
would be my best approach to this, mapping to/from these bytes on a stream?

Alan





    



Reply | Threaded
Open this post in threaded view
|

Re: ASN1 encoding of UTF8

timrowledge
In reply to this post by Alan Pinch
We do have assorted string encoding stuff in the current image but the actual UTF8 results of #squeakToUtf8 (for example) are just ByteStrings. Which is actually rather confusing and annoying because now you have no way to know what encoding is relevant other than be carefully keeping track manually. Normally of course, within the image we have perfectly usable strings because any time a unicode character that is outside the 1-byte range is used the string becomes a WideString.

We need to do better. Look at TextEncoder and its hierarchy for more info.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: RLBM: Ruin Logic Board Multiple



Reply | Threaded
Open this post in threaded view
|

Re: ASN1 encoding of UTF8

Alan Pinch
I will explore, thank you for your thoughts. I tell myself you have to better, yet the task list is long and somewhat disorganized. 
And the are only so many seconds each decade.

- Alan

On Sep 18, 2017, at 12:29, tim Rowledge <[hidden email]> wrote:

We do have assorted string encoding stuff in the current image but the actual UTF8 results of #squeakToUtf8 (for example) are just ByteStrings. Which is actually rather confusing and annoying because now you have no way to know what encoding is relevant other than be carefully keeping track manually. Normally of course, within the image we have perfectly usable strings because any time a unicode character that is outside the 1-byte range is used the string becomes a WideString.

We need to do better. Look at TextEncoder and its hierarchy for more info.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: RLBM: Ruin Logic Board Multiple





Reply | Threaded
Open this post in threaded view
|

Re: ASN1 encoding of UTF8

Alan Pinch
I think the pieces are there I merely need to figure out the correct ordering.

- Alan

On Sep 18, 2017, at 12:43, Alan Pinch <[hidden email]> wrote:

I will explore, thank you for your thoughts. I tell myself you have to better, yet the task list is long and somewhat disorganized. 
And the are only so many seconds each decade.

- Alan

On Sep 18, 2017, at 12:29, tim Rowledge <[hidden email]> wrote:

We do have assorted string encoding stuff in the current image but the actual UTF8 results of #squeakToUtf8 (for example) are just ByteStrings. Which is actually rather confusing and annoying because now you have no way to know what encoding is relevant other than be carefully keeping track manually. Normally of course, within the image we have perfectly usable strings because any time a unicode character that is outside the 1-byte range is used the string becomes a WideString.

We need to do better. Look at TextEncoder and its hierarchy for more info.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: RLBM: Ruin Logic Board Multiple





Reply | Threaded
Open this post in threaded view
|

Re: ASN1 encoding of UTF8

Alan Pinch
In reply to this post by timrowledge
Here is the encode and decode code I am using and a test that does not
test UTF8 extended encoding. I need ASN1 bytes with non-trivial
charachers and a baseline. Would anyone have some interesting utf8 bytes
handy?

ASN1UTF8StringType

 >>#encodeValue: anObject withDERStream: derStream

     derStream nextPutAll: anObject squeakToUtf8 asByteArray

 >>#decodeValueWithDERStream: derStream length: length

     ^ (derStream next: length) asByteArray asString utf8ToSqueak.

CryptoASN1Test>>#testConstructedUTF8String

    | bytes obj testObj |
     bytes := #(44 15 12 5 84 101 115 116 32 12 6 85 115 101 114 32 49).
     testObj := 'Test User 1'.
     obj := ASN1InputStream decodeBytes: bytes.
     self assert: (obj = testObj).

Thank you for your consideration,
Alan


On 09/18/2017 12:29 PM, tim Rowledge wrote:

> We do have assorted string encoding stuff in the current image but the actual UTF8 results of #squeakToUtf8 (for example) are just ByteStrings. Which is actually rather confusing and annoying because now you have no way to know what encoding is relevant other than be carefully keeping track manually. Normally of course, within the image we have perfectly usable strings because any time a unicode character that is outside the 1-byte range is used the string becomes a WideString.
>
> We need to do better. Look at TextEncoder and its hierarchy for more info.
>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Strange OpCodes: RLBM: Ruin Logic Board Multiple
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: ASN1 encoding of UTF8

Jakob Reschke-2
In reply to this post by timrowledge
2017-09-20 21:42 GMT+02:00 Alan Pinch <[hidden email]>:
> Would anyone have some interesting utf8 bytes
> handy?

http://xahlee.info/comp/unicode_drawing_shapes.html
:-)

🐭
┃┣━━━┳━━━━┳━━┓
┃┗┓┏┛┃╻╺━━┛╺┓┃
┣┓┃┗┓┗┻━━━┳╸┃┃
┃┃┣╸┣━━┳━┓┗━┛┃
┃┃┃┏┛┏╸┃╻┣━┳╸┃
┃┗━┫╻┣━━┫┗╸┃┏┫
┃┏━┫┃┃╺┓┗━┓┃┃┃
┃┃┃┃┃┗┓┗━┓┗┻╸┃
┗━┫┏┻━━━━┻━━━┛

Can be copy&pasted into a workspace. You only get to see question
marks, but the character values are correct.

theAbove squeakToUtf8 asByteArray
=>  #[226 148 131 226 148 163 226 148 129 226 148 129 226 148 129 226
148 179 226 148 129 226 148 129 226 148 129 226 148 129 226 148 179
226 148 129 226 148 129 226 148 147 13 226 148 131 226 148 151 226 148
147 226 148 143 226 148 155 226 148 131 226 149 187 226 149 186 226
148 129 226 148 129 226 148 155 226 149 186 226 148 147 226 148 131 13
226 148 163 226 148 147 226 148 131 226 148 151 226 148 147 226 148
151 226 148 187 226 148 129 226 148 129 226 148 129 226 148 179 226
149 184 226 148 131 226 148 131 13 226 148 131 226 148 131 226 148 163
226 149 184 226 148 163 226 148 129 226 148 129 226 148 179 226 148
129 226 148 147 226 148 151 226 148 129 226 148 155 226 148 131 13 226
148 131 226 148 131 226 148 131 226 148 143 226 148 155 226 148 143
226 149 184 226 148 131 226 149 187 226 148 163 226 148 129 226 148
179 226 149 184 226 148 131 13 226 148 131 226 148 151 226 148 129 226
148 171 226 149 187 226 148 163 226 148 129 226 148 129 226 148 171
226 148 151 226 149 184 226 148 131 226 148 143 226 148 171 13 226 148
131 226 148 143 226 148 129 226 148 171 226 148 131 226 148 131 226
149 186 226 148 147 226 148 151 226 148 129 226 148 147 226 148 131
226 148 131 226 148 131 13 226 148 131 226 148 131 226 148 131 226 148
131 226 148 131 226 148 151 226 148 147 226 148 151 226 148 129 226
148 147 226 148 151 226 148 187 226 149 184 226 148 131 13 226 148 151
226 148 129 226 148 171 226 148 143 226 148 187 226 148 129 226 148
129 226 148 129 226 148 129 226 148 187 226 148 129 226 148 129 226
148 129 226 148 155]

Alternatively, you could try some pseudo-German pseudo-names like:
'Björn-Thaddäus Düngerstraß' squeakToUtf8 asByteArray
=> #[66 106 195 182 114 110 45 84 104 97 100 100 195 164 117 115 32 68
195 188 110 103 101 114 115 116 114 97 195 159].

>
> ASN1UTF8StringType
>
>  >>#encodeValue: anObject withDERStream: derStream
>
>      derStream nextPutAll: anObject squeakToUtf8 asByteArray
>
>  >>#decodeValueWithDERStream: derStream length: length
>
>      ^ (derStream next: length) asByteArray asString utf8ToSqueak.
>
> CryptoASN1Test>>#testConstructedUTF8String
>
>     | bytes obj testObj |
>      bytes := #(44 15 12 5 84 101 115 116 32 12 6 85 115 101 114 32 49).
>      testObj := 'Test User 1'.
>      obj := ASN1InputStream decodeBytes: bytes.
>      self assert: (obj = testObj).
>
> Thank you for your consideration,
> Alan
>
>
> On 09/18/2017 12:29 PM, tim Rowledge wrote:
>> We do have assorted string encoding stuff in the current image but the actual UTF8 results of #squeakToUtf8 (for example) are just ByteStrings. Which is actually rather confusing and annoying because now you have no way to know what encoding is relevant other than be carefully keeping track manually. Normally of course, within the image we have perfectly usable strings because any time a unicode character that is outside the 1-byte range is used the string becomes a WideString.
>>
>> We need to do better. Look at TextEncoder and its hierarchy for more info.
>>
>> tim
>> --
>> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
>> Strange OpCodes: RLBM: Ruin Logic Board Multiple
>>
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: ASN1 encoding of UTF8

Alan Pinch
Excellent! My regards, now a working ASN1UTF8StringType test


testConstructedUTF8String

     | bytes obj testObj |
     bytes := #[12 30 66 106 195 182 114 110 45 84 104 97 100 100 195
164 117 115 32 68 195 188 110 103 101 114 115 116 114 97 195 159].
     testObj := 'Björn-Thaddäus Düngerstraß'.
     obj := ASN1InputStream decodeBytes: bytes.
     self assert: (obj = testObj).


On 09/20/2017 04:33 PM, Jakob Reschke wrote:
> 'Björn-Thaddäus Düngerstraß' squeakToUtf8 asByteArray
> => #[66 106 195 182 114 110 45 84 104 97 100 100 195 164 117 115 32 68
> 195 188 110 103 101 114 115 116 114 97 195 159].

--
Thank you for your consideration, Alan

Reply | Threaded
Open this post in threaded view
|

Re: ASN1 encoding of UTF8

Alan Pinch
In reply to this post by Jakob Reschke-2
Here is the updated ParrotTalk spec with UTF8 string types.



On 09/20/2017 04:33 PM, Jakob Reschke wrote:

> 2017-09-20 21:42 GMT+02:00 Alan Pinch <[hidden email]>:
>> Would anyone have some interesting utf8 bytes
>> handy?
> http://xahlee.info/comp/unicode_drawing_shapes.html
> :-)
>
> 🐭
> ┃┣━━━┳━━━━┳━━┓
> ┃┗┓┏┛┃╻╺━━┛╺┓┃
> ┣┓┃┗┓┗┻━━━┳╸┃┃
> ┃┃┣╸┣━━┳━┓┗━┛┃
> ┃┃┃┏┛┏╸┃╻┣━┳╸┃
> ┃┗━┫╻┣━━┫┗╸┃┏┫
> ┃┏━┫┃┃╺┓┗━┓┃┃┃
> ┃┃┃┃┃┗┓┗━┓┗┻╸┃
> ┗━┫┏┻━━━━┻━━━┛
>
> Can be copy&pasted into a workspace. You only get to see question
> marks, but the character values are correct.
>
> theAbove squeakToUtf8 asByteArray
> =>  #[226 148 131 226 148 163 226 148 129 226 148 129 226 148 129 226
> 148 179 226 148 129 226 148 129 226 148 129 226 148 129 226 148 179
> 226 148 129 226 148 129 226 148 147 13 226 148 131 226 148 151 226 148
> 147 226 148 143 226 148 155 226 148 131 226 149 187 226 149 186 226
> 148 129 226 148 129 226 148 155 226 149 186 226 148 147 226 148 131 13
> 226 148 163 226 148 147 226 148 131 226 148 151 226 148 147 226 148
> 151 226 148 187 226 148 129 226 148 129 226 148 129 226 148 179 226
> 149 184 226 148 131 226 148 131 13 226 148 131 226 148 131 226 148 163
> 226 149 184 226 148 163 226 148 129 226 148 129 226 148 179 226 148
> 129 226 148 147 226 148 151 226 148 129 226 148 155 226 148 131 13 226
> 148 131 226 148 131 226 148 131 226 148 143 226 148 155 226 148 143
> 226 149 184 226 148 131 226 149 187 226 148 163 226 148 129 226 148
> 179 226 149 184 226 148 131 13 226 148 131 226 148 151 226 148 129 226
> 148 171 226 149 187 226 148 163 226 148 129 226 148 129 226 148 171
> 226 148 151 226 149 184 226 148 131 226 148 143 226 148 171 13 226 148
> 131 226 148 143 226 148 129 226 148 171 226 148 131 226 148 131 226
> 149 186 226 148 147 226 148 151 226 148 129 226 148 147 226 148 131
> 226 148 131 226 148 131 13 226 148 131 226 148 131 226 148 131 226 148
> 131 226 148 131 226 148 151 226 148 147 226 148 151 226 148 129 226
> 148 147 226 148 151 226 148 187 226 149 184 226 148 131 13 226 148 151
> 226 148 129 226 148 171 226 148 143 226 148 187 226 148 129 226 148
> 129 226 148 129 226 148 129 226 148 187 226 148 129 226 148 129 226
> 148 129 226 148 155]
>
> Alternatively, you could try some pseudo-German pseudo-names like:
> 'Björn-Thaddäus Düngerstraß' squeakToUtf8 asByteArray
> => #[66 106 195 182 114 110 45 84 104 97 100 100 195 164 117 115 32 68
> 195 188 110 103 101 114 115 116 114 97 195 159].
>
>> ASN1UTF8StringType
>>
>>   >>#encodeValue: anObject withDERStream: derStream
>>
>>       derStream nextPutAll: anObject squeakToUtf8 asByteArray
>>
>>   >>#decodeValueWithDERStream: derStream length: length
>>
>>       ^ (derStream next: length) asByteArray asString utf8ToSqueak.
>>
>> CryptoASN1Test>>#testConstructedUTF8String
>>
>>      | bytes obj testObj |
>>       bytes := #(44 15 12 5 84 101 115 116 32 12 6 85 115 101 114 32 49).
>>       testObj := 'Test User 1'.
>>       obj := ASN1InputStream decodeBytes: bytes.
>>       self assert: (obj = testObj).
>>
>> Thank you for your consideration,
>> Alan
>>
>>
>> On 09/18/2017 12:29 PM, tim Rowledge wrote:
>>> We do have assorted string encoding stuff in the current image but the actual UTF8 results of #squeakToUtf8 (for example) are just ByteStrings. Which is actually rather confusing and annoying because now you have no way to know what encoding is relevant other than be carefully keeping track manually. Normally of course, within the image we have perfectly usable strings because any time a unicode character that is outside the 1-byte range is used the string becomes a WideString.
>>>
>>> We need to do better. Look at TextEncoder and its hierarchy for more info.
>>>
>>> tim
>>> --
>>> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
>>> Strange OpCodes: RLBM: Ruin Logic Board Multiple
>>>
>>>
>>>
>>
--
Thank you for your consideration,
Alan




ParrotTalkFrameDesign-3.4.pdf (162K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ASN1 encoding of UTF8

Alan Pinch
In reply to this post by timrowledge
I got BigIntegers working in java ASN1

https://github.com/ZiroZimbarra/callistohouse


On 09/18/2017 12:29 PM, tim Rowledge wrote:

> We do have assorted string encoding stuff in the current image but the actual UTF8 results of #squeakToUtf8 (for example) are just ByteStrings. Which is actually rather confusing and annoying because now you have no way to know what encoding is relevant other than be carefully keeping track manually. Normally of course, within the image we have perfectly usable strings because any time a unicode character that is outside the 1-byte range is used the string becomes a WideString.
>
> We need to do better. Look at TextEncoder and its hierarchy for more info.
>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Strange OpCodes: RLBM: Ruin Logic Board Multiple
>
>
>

--
Thank you for your consideration,
Alan


Reply | Threaded
Open this post in threaded view
|

Re: ASN1 encoding of UTF8

Alan Pinch
I asked a question on stackoverflow, regarding UTC Time in java conversions.

https://stackoverflow.com/questions/46419082/java-conversion-from-to-asn1-datetimes

I thought you may like to know.


On 09/25/2017 06:09 PM, Alan Pinch wrote:

> I got BigIntegers working in java ASN1
>
> https://github.com/ZiroZimbarra/callistohouse
>
>
> On 09/18/2017 12:29 PM, tim Rowledge wrote:
>> We do have assorted string encoding stuff in the current image but
>> the actual UTF8 results of #squeakToUtf8 (for example) are just
>> ByteStrings. Which is actually rather confusing and annoying because
>> now you have no way to know what encoding is relevant other than be
>> carefully keeping track manually. Normally of course, within the
>> image we have perfectly usable strings because any time a unicode
>> character that is outside the 1-byte range is used the string becomes
>> a WideString.
>>
>> We need to do better. Look at TextEncoder and its hierarchy for more
>> info.
>>
>> tim
>> --
>> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
>> Strange OpCodes: RLBM: Ruin Logic Board Multiple
>>
>>
>>
>

--
Thank you for your consideration,
Alan


Reply | Threaded
Open this post in threaded view
|

Re: ASN1 encoding of UTF8

Alan Pinch
To share my good news! I just got a port of Cryptography's ASN1 to Java
passing tests. Now to get PhaseHeaders encoding right to bring
bit-compatible encryption between Squeak and Java online.


Almost 50% more code in Java than squeak, just saying we have a concrete
example of the efficacy of squeak over Java. They should have left it as
the Oak Project and called it a day. Our day comes.

On 09/26/2017 05:08 PM, Alan Pinch wrote:

> I asked a question on stackoverflow, regarding UTC Time in java
> conversions.
>
> https://stackoverflow.com/questions/46419082/java-conversion-from-to-asn1-datetimes 
>
>
> I thought you may like to know.
>
>
> On 09/25/2017 06:09 PM, Alan Pinch wrote:
>> I got BigIntegers working in java ASN1
>>
>> https://github.com/ZiroZimbarra/callistohouse
>>
>>
>> On 09/18/2017 12:29 PM, tim Rowledge wrote:
>>> We do have assorted string encoding stuff in the current image but
>>> the actual UTF8 results of #squeakToUtf8 (for example) are just
>>> ByteStrings. Which is actually rather confusing and annoying because
>>> now you have no way to know what encoding is relevant other than be
>>> carefully keeping track manually. Normally of course, within the
>>> image we have perfectly usable strings because any time a unicode
>>> character that is outside the 1-byte range is used the string
>>> becomes a WideString.
>>>
>>> We need to do better. Look at TextEncoder and its hierarchy for more
>>> info.
>>>
>>> tim
>>> --
>>> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
>>> Strange OpCodes: RLBM: Ruin Logic Board Multiple
>>>
>>>
>>>
>>
>

--
Thank you for your consideration,
Alan