Great job with IMAP interface

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Great job with IMAP interface

Seth Berman
Hi Lou,

To put the editor in UTF-8 mode, you get an instance and apply:
And I didn't convert anything...it just goes into a mode where it interprets the bytes in that byte array correctly.
CwScintillaEditor>>setCodePage: 65001

"How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?"
That's not going to work.  Except *mostly* for the first 128 chars, UTF-8 encoded bytes are not relatable to code pages.  Again, its just a value
assigned by a group of folks, so you can't really do anything mathematical to it to accomplish much with regards to conversion.
The first 4 bytes in your example is the UTF-8 encoded value for Mathmatical Bold Capital M (https://www.compart.com/en/unicode/U+1D40C)
It's UTF-8 value is different than its UTF-16 value which is different from its Unicode Scalar value.  And they don't relate...its just a value.

"I am just trying to hack my way around it for my limited case until you guys do the real conversion."
- We won't be doing any conversion.  As a first step, what we are doing is offering a UnicodeString container that could ingest those UTF-8 encoded bytes and do interesting things with them and showing them in our editor which we will switch to UTF-8 mode.  This means a UnicodeString will need to convert itself to UTF-8 when it hands Scintilla bytes to display.
Next steps would be upgrading our CFS APIs to use them, followed by switching our APIS on Windows from narrow to wide APIs.  At that point, UTF-8 would still require conversion to UTF-16 if you are going to be showing them in table cells that are Windows widgets.  But the new Unicode support library gives you a first class container and plenty of easy APIs to make that feasible.

- Seth

On Monday, April 5, 2021 at 2:26:31 PM UTC-4 [hidden email] wrote:
Hi Seth,

If you dump those bytes to a file and open them in something like Notepad++ in UTF-8 mode...then it will be readable.

The program I need this for displays the subject as a column in a container is a window.  Joachim might be okay with sending the data to a file.
 
If you set the code page of the smalltalk scintilla editor to 65001, then it will be readable (see below)

How did you do that?  Can I call scintilla to give me a converted string?

How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?
 
But those UTF-8 encoded code points are just numbers assigned by a group of people.  It doesn't know that its technically a decorated ASCII M.  So you would need to have some program that new how to perform that conversion to some code page which is certainly not something that VAST does.

This was my guess, I am just trying to hack my way around it for my limited case until you guys do the real conversion.

Lou 

--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/c520d6f1-2027-4ce0-a9c6-dc5c7127074en%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Great job with IMAP interface

jtuchel
Hi Lou, Seth,


I've been offline for a bit over easter. So I'm late to the discussion....
As you know, I also implemented a rather sketchy RFC2047 converter and had used it for a few months already. It has some limitiations and of course has nothing to offer for emojis, bold characters and whatnot. I freely admit I never cared about these edge cases, because my main purpose of using IMAP is to receive attachments and put them on some batch for processing. Most of the encoded headers are irrelevant for my use case.

I was sure my encoder is not even close to being half-baked, so I kept it for myself for the time being. I would revisit it once I encounter a use case that requires more. It is a lot of work to "just" write some tool that implements the plethora of RFCs that you discover once you lift a corner of one of them....

I am glad you shared your experiences here and that Seth chimed in with his knowledge on the topic. UTF-x and Unicode are cans of worms, white areas on the map of a mortal application developer, the valley where many young and brave knights rode into but never came back... So there may be lots of beautiful princesses there or just an army of Dragons, nobody knows ;-)

Why am I glad? Because I know I am not alone and - even better - that Instantiations is going to send their princes into UTF-x valley for us ;-)


Joachim


As a side note: my clumsy decoder returns this: '???????????? _???????????? _???????? _????. _???????????? _????????????' ;-)


Seth Berman schrieb am Montag, 5. April 2021 um 20:43:44 UTC+2:
Hi Lou,

To put the editor in UTF-8 mode, you get an instance and apply:
And I didn't convert anything...it just goes into a mode where it interprets the bytes in that byte array correctly.
CwScintillaEditor>>setCodePage: 65001

"How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?"
That's not going to work.  Except *mostly* for the first 128 chars, UTF-8 encoded bytes are not relatable to code pages.  Again, its just a value
assigned by a group of folks, so you can't really do anything mathematical to it to accomplish much with regards to conversion.
The first 4 bytes in your example is the UTF-8 encoded value for Mathmatical Bold Capital M (https://www.compart.com/en/unicode/U+1D40C)
It's UTF-8 value is different than its UTF-16 value which is different from its Unicode Scalar value.  And they don't relate...its just a value.

"I am just trying to hack my way around it for my limited case until you guys do the real conversion."
- We won't be doing any conversion.  As a first step, what we are doing is offering a UnicodeString container that could ingest those UTF-8 encoded bytes and do interesting things with them and showing them in our editor which we will switch to UTF-8 mode.  This means a UnicodeString will need to convert itself to UTF-8 when it hands Scintilla bytes to display.
Next steps would be upgrading our CFS APIs to use them, followed by switching our APIS on Windows from narrow to wide APIs.  At that point, UTF-8 would still require conversion to UTF-16 if you are going to be showing them in table cells that are Windows widgets.  But the new Unicode support library gives you a first class container and plenty of easy APIs to make that feasible.

- Seth

On Monday, April 5, 2021 at 2:26:31 PM UTC-4 [hidden email] wrote:
Hi Seth,

If you dump those bytes to a file and open them in something like Notepad++ in UTF-8 mode...then it will be readable.

The program I need this for displays the subject as a column in a container is a window.  Joachim might be okay with sending the data to a file.
 
If you set the code page of the smalltalk scintilla editor to 65001, then it will be readable (see below)

How did you do that?  Can I call scintilla to give me a converted string?

How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?
 
But those UTF-8 encoded code points are just numbers assigned by a group of people.  It doesn't know that its technically a decorated ASCII M.  So you would need to have some program that new how to perform that conversion to some code page which is certainly not something that VAST does.

This was my guess, I am just trying to hack my way around it for my limited case until you guys do the real conversion.

Lou 

--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/8a5e9339-0ec1-4d8b-aa2d-5a7e181acee3n%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Great job with IMAP interface

Louis LaBrunda
Hi Everyone,

I have been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I'm still working on the three and two byte sections, those that start with E0 and C0.  This code is a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Tuesday, April 6, 2021 at 6:02:31 AM UTC-4 [hidden email] wrote:
Hi Lou, Seth,


I've been offline for a bit over easter. So I'm late to the discussion....
As you know, I also implemented a rather sketchy RFC2047 converter and had used it for a few months already. It has some limitiations and of course has nothing to offer for emojis, bold characters and whatnot. I freely admit I never cared about these edge cases, because my main purpose of using IMAP is to receive attachments and put them on some batch for processing. Most of the encoded headers are irrelevant for my use case.

I was sure my encoder is not even close to being half-baked, so I kept it for myself for the time being. I would revisit it once I encounter a use case that requires more. It is a lot of work to "just" write some tool that implements the plethora of RFCs that you discover once you lift a corner of one of them....

I am glad you shared your experiences here and that Seth chimed in with his knowledge on the topic. UTF-x and Unicode are cans of worms, white areas on the map of a mortal application developer, the valley where many young and brave knights rode into but never came back... So there may be lots of beautiful princesses there or just an army of Dragons, nobody knows ;-)

Why am I glad? Because I know I am not alone and - even better - that Instantiations is going to send their princes into UTF-x valley for us ;-)


Joachim


As a side note: my clumsy decoder returns this: '???????????? _???????????? _???????? _????. _???????????? _????????????' ;-)


Seth Berman schrieb am Montag, 5. April 2021 um 20:43:44 UTC+2:
Hi Lou,

To put the editor in UTF-8 mode, you get an instance and apply:
And I didn't convert anything...it just goes into a mode where it interprets the bytes in that byte array correctly.
CwScintillaEditor>>setCodePage: 65001

"How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?"
That's not going to work.  Except *mostly* for the first 128 chars, UTF-8 encoded bytes are not relatable to code pages.  Again, its just a value
assigned by a group of folks, so you can't really do anything mathematical to it to accomplish much with regards to conversion.
The first 4 bytes in your example is the UTF-8 encoded value for Mathmatical Bold Capital M (https://www.compart.com/en/unicode/U+1D40C)
It's UTF-8 value is different than its UTF-16 value which is different from its Unicode Scalar value.  And they don't relate...its just a value.

"I am just trying to hack my way around it for my limited case until you guys do the real conversion."
- We won't be doing any conversion.  As a first step, what we are doing is offering a UnicodeString container that could ingest those UTF-8 encoded bytes and do interesting things with them and showing them in our editor which we will switch to UTF-8 mode.  This means a UnicodeString will need to convert itself to UTF-8 when it hands Scintilla bytes to display.
Next steps would be upgrading our CFS APIs to use them, followed by switching our APIS on Windows from narrow to wide APIs.  At that point, UTF-8 would still require conversion to UTF-16 if you are going to be showing them in table cells that are Windows widgets.  But the new Unicode support library gives you a first class container and plenty of easy APIs to make that feasible.

- Seth

On Monday, April 5, 2021 at 2:26:31 PM UTC-4 [hidden email] wrote:
Hi Seth,

If you dump those bytes to a file and open them in something like Notepad++ in UTF-8 mode...then it will be readable.

The program I need this for displays the subject as a column in a container is a window.  Joachim might be okay with sending the data to a file.
 
If you set the code page of the smalltalk scintilla editor to 65001, then it will be readable (see below)

How did you do that?  Can I call scintilla to give me a converted string?

How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?
 
But those UTF-8 encoded code points are just numbers assigned by a group of people.  It doesn't know that its technically a decorated ASCII M.  So you would need to have some program that new how to perform that conversion to some code page which is certainly not something that VAST does.

This was my guess, I am just trying to hack my way around it for my limited case until you guys do the real conversion.

Lou 

--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/ecd6d648-4d34-46db-b299-fc3dea888815n%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Great job with IMAP interface

Louis LaBrunda
Hi Everyone,

I spoke a little too soon when I said I had been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I thought I had been able to do some math on the value and map it to some ascii character that would look close.  It turns out that works a little but not as consistently as I hoped.  I had to go to the use of a table and that seems to cover almost everything I'm interested in.  I'm was able to work out the math for the three and two byte sections, those that start with E0 and C0.  I consider all of the code to be a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Wednesday, April 7, 2021 at 6:19:34 PM UTC-4 Louis LaBrunda wrote:
Hi Everyone,

I have been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I'm still working on the three and two byte sections, those that start with E0 and C0.  This code is a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Tuesday, April 6, 2021 at 6:02:31 AM UTC-4 [hidden email] wrote:
Hi Lou, Seth,


I've been offline for a bit over easter. So I'm late to the discussion....
As you know, I also implemented a rather sketchy RFC2047 converter and had used it for a few months already. It has some limitiations and of course has nothing to offer for emojis, bold characters and whatnot. I freely admit I never cared about these edge cases, because my main purpose of using IMAP is to receive attachments and put them on some batch for processing. Most of the encoded headers are irrelevant for my use case.

I was sure my encoder is not even close to being half-baked, so I kept it for myself for the time being. I would revisit it once I encounter a use case that requires more. It is a lot of work to "just" write some tool that implements the plethora of RFCs that you discover once you lift a corner of one of them....

I am glad you shared your experiences here and that Seth chimed in with his knowledge on the topic. UTF-x and Unicode are cans of worms, white areas on the map of a mortal application developer, the valley where many young and brave knights rode into but never came back... So there may be lots of beautiful princesses there or just an army of Dragons, nobody knows ;-)

Why am I glad? Because I know I am not alone and - even better - that Instantiations is going to send their princes into UTF-x valley for us ;-)


Joachim


As a side note: my clumsy decoder returns this: '???????????? _???????????? _???????? _????. _???????????? _????????????' ;-)


Seth Berman schrieb am Montag, 5. April 2021 um 20:43:44 UTC+2:
Hi Lou,

To put the editor in UTF-8 mode, you get an instance and apply:
And I didn't convert anything...it just goes into a mode where it interprets the bytes in that byte array correctly.
CwScintillaEditor>>setCodePage: 65001

"How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?"
That's not going to work.  Except *mostly* for the first 128 chars, UTF-8 encoded bytes are not relatable to code pages.  Again, its just a value
assigned by a group of folks, so you can't really do anything mathematical to it to accomplish much with regards to conversion.
The first 4 bytes in your example is the UTF-8 encoded value for Mathmatical Bold Capital M (https://www.compart.com/en/unicode/U+1D40C)
It's UTF-8 value is different than its UTF-16 value which is different from its Unicode Scalar value.  And they don't relate...its just a value.

"I am just trying to hack my way around it for my limited case until you guys do the real conversion."
- We won't be doing any conversion.  As a first step, what we are doing is offering a UnicodeString container that could ingest those UTF-8 encoded bytes and do interesting things with them and showing them in our editor which we will switch to UTF-8 mode.  This means a UnicodeString will need to convert itself to UTF-8 when it hands Scintilla bytes to display.
Next steps would be upgrading our CFS APIs to use them, followed by switching our APIS on Windows from narrow to wide APIs.  At that point, UTF-8 would still require conversion to UTF-16 if you are going to be showing them in table cells that are Windows widgets.  But the new Unicode support library gives you a first class container and plenty of easy APIs to make that feasible.

- Seth

On Monday, April 5, 2021 at 2:26:31 PM UTC-4 [hidden email] wrote:
Hi Seth,

If you dump those bytes to a file and open them in something like Notepad++ in UTF-8 mode...then it will be readable.

The program I need this for displays the subject as a column in a container is a window.  Joachim might be okay with sending the data to a file.
 
If you set the code page of the smalltalk scintilla editor to 65001, then it will be readable (see below)

How did you do that?  Can I call scintilla to give me a converted string?

How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?
 
But those UTF-8 encoded code points are just numbers assigned by a group of people.  It doesn't know that its technically a decorated ASCII M.  So you would need to have some program that new how to perform that conversion to some code page which is certainly not something that VAST does.

This was my guess, I am just trying to hack my way around it for my limited case until you guys do the real conversion.

Lou 

--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/8a24aa3a-b7b5-4104-87fa-a265492cdb98n%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Great job with IMAP interface

Louis LaBrunda
Hi All,

Today I got an email with this subject:

'=?UTF-8?B?SGVhcnRmZWx0IGdpZnRzIGZvciBNb3RoZXLigJlzIERheSDwn5Kd?='

my conversion yields:

'Heartfelt gifts for Mother’s Day 💠'

an online converter yealds:

Heartfelt gifts for Mother’s Day 💝

Clearly I'm misunderstanding something.  I don't expect to get the same result but I thought the "B" encoding could be fed into "Base64Coder current decode:" and that is enough.  How is one to know the funny quote and heart are present?  It seems 16rE28099 and 16rF09F929D just show up.  How does one know they aren't high end ASCII or would something that isn't in VA Smalltalk yet deal with them?

I don't really need this for my program but I curious. 

Lou

On Sunday, April 11, 2021 at 7:19:44 PM UTC-4 Louis LaBrunda wrote:
Hi Everyone,

I spoke a little too soon when I said I had been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I thought I had been able to do some math on the value and map it to some ascii character that would look close.  It turns out that works a little but not as consistently as I hoped.  I had to go to the use of a table and that seems to cover almost everything I'm interested in.  I'm was able to work out the math for the three and two byte sections, those that start with E0 and C0.  I consider all of the code to be a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Wednesday, April 7, 2021 at 6:19:34 PM UTC-4 Louis LaBrunda wrote:
Hi Everyone,

I have been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I'm still working on the three and two byte sections, those that start with E0 and C0.  This code is a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Tuesday, April 6, 2021 at 6:02:31 AM UTC-4 [hidden email] wrote:
Hi Lou, Seth,


I've been offline for a bit over easter. So I'm late to the discussion....
As you know, I also implemented a rather sketchy RFC2047 converter and had used it for a few months already. It has some limitiations and of course has nothing to offer for emojis, bold characters and whatnot. I freely admit I never cared about these edge cases, because my main purpose of using IMAP is to receive attachments and put them on some batch for processing. Most of the encoded headers are irrelevant for my use case.

I was sure my encoder is not even close to being half-baked, so I kept it for myself for the time being. I would revisit it once I encounter a use case that requires more. It is a lot of work to "just" write some tool that implements the plethora of RFCs that you discover once you lift a corner of one of them....

I am glad you shared your experiences here and that Seth chimed in with his knowledge on the topic. UTF-x and Unicode are cans of worms, white areas on the map of a mortal application developer, the valley where many young and brave knights rode into but never came back... So there may be lots of beautiful princesses there or just an army of Dragons, nobody knows ;-)

Why am I glad? Because I know I am not alone and - even better - that Instantiations is going to send their princes into UTF-x valley for us ;-)


Joachim


As a side note: my clumsy decoder returns this: '???????????? _???????????? _???????? _????. _???????????? _????????????' ;-)


Seth Berman schrieb am Montag, 5. April 2021 um 20:43:44 UTC+2:
Hi Lou,

To put the editor in UTF-8 mode, you get an instance and apply:
And I didn't convert anything...it just goes into a mode where it interprets the bytes in that byte array correctly.
CwScintillaEditor>>setCodePage: 65001

"How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?"
That's not going to work.  Except *mostly* for the first 128 chars, UTF-8 encoded bytes are not relatable to code pages.  Again, its just a value
assigned by a group of folks, so you can't really do anything mathematical to it to accomplish much with regards to conversion.
The first 4 bytes in your example is the UTF-8 encoded value for Mathmatical Bold Capital M (https://www.compart.com/en/unicode/U+1D40C)
It's UTF-8 value is different than its UTF-16 value which is different from its Unicode Scalar value.  And they don't relate...its just a value.

"I am just trying to hack my way around it for my limited case until you guys do the real conversion."
- We won't be doing any conversion.  As a first step, what we are doing is offering a UnicodeString container that could ingest those UTF-8 encoded bytes and do interesting things with them and showing them in our editor which we will switch to UTF-8 mode.  This means a UnicodeString will need to convert itself to UTF-8 when it hands Scintilla bytes to display.
Next steps would be upgrading our CFS APIs to use them, followed by switching our APIS on Windows from narrow to wide APIs.  At that point, UTF-8 would still require conversion to UTF-16 if you are going to be showing them in table cells that are Windows widgets.  But the new Unicode support library gives you a first class container and plenty of easy APIs to make that feasible.

- Seth

On Monday, April 5, 2021 at 2:26:31 PM UTC-4 [hidden email] wrote:
Hi Seth,

If you dump those bytes to a file and open them in something like Notepad++ in UTF-8 mode...then it will be readable.

The program I need this for displays the subject as a column in a container is a window.  Joachim might be okay with sending the data to a file.
 
If you set the code page of the smalltalk scintilla editor to 65001, then it will be readable (see below)

How did you do that?  Can I call scintilla to give me a converted string?

How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?
 
But those UTF-8 encoded code points are just numbers assigned by a group of people.  It doesn't know that its technically a decorated ASCII M.  So you would need to have some program that new how to perform that conversion to some code page which is certainly not something that VAST does.

This was my guess, I am just trying to hack my way around it for my limited case until you guys do the real conversion.

Lou 

--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/6dc5daf1-d78d-4409-8d8e-32b43b34e5acn%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Great job with IMAP interface

Seth Berman
Hi Lou,

This is kinda what I've been saying.  This is UTF-8 encoded bytes. You need a UTF-8 decoder to properly decode those bytes.

UTF-8 is an encoding scheme for Unicode code points.  The code page you are using (or any code page) can only represent a very small portion of what the Unicode code space captures.  And certainly, emoji are not included in that unless someone in recent times has been creating strange code pages I'm not familiar with.

Using the "Code Page Converter", you can properly convert most of that...but its not going to know how to convert a 💝 to a codepage byte...because there likely is no such mapping in any useful codepage.

Conversion from Unicode to code page is a lossy conversion.  One can't possibly represent all forms of a 'character' from the valid set of Unicode code points in just a byte.
This is why I was asking earlier about what you plan to do with these incoming bytes.
If you need to just pass them on to someone else...then keep them in UTF-8 in a ByteArray (no conversion), and then just pass them on.
You could technically keep them in a String class, since it basically is a ByteArray...but I would advise against trying to assume (String at: <index>) is always going to return something useful.

If you need to work with them in Smalltalk as a set of Characters...then we arrive at the inevitable conclusion that VAST is not yet Unicode-aware.
This whole discussion was the basis of my recent post showing what we are doing for VAST 2022
https://groups.google.com/g/va-smalltalk/c/-w7FVdc1gFM/m/gvngUys3CAAJ

- Seth

On Monday, April 12, 2021 at 10:32:22 AM UTC-4 [hidden email] wrote:
Hi All,

Today I got an email with this subject:

'=?UTF-8?B?SGVhcnRmZWx0IGdpZnRzIGZvciBNb3RoZXLigJlzIERheSDwn5Kd?='

my conversion yields:

'Heartfelt gifts for Mother’s Day 💠'

an online converter yealds:

Heartfelt gifts for Mother’s Day 💝

Clearly I'm misunderstanding something.  I don't expect to get the same result but I thought the "B" encoding could be fed into "Base64Coder current decode:" and that is enough.  How is one to know the funny quote and heart are present?  It seems 16rE28099 and 16rF09F929D just show up.  How does one know they aren't high end ASCII or would something that isn't in VA Smalltalk yet deal with them?

I don't really need this for my program but I curious. 

Lou

On Sunday, April 11, 2021 at 7:19:44 PM UTC-4 Louis LaBrunda wrote:
Hi Everyone,

I spoke a little too soon when I said I had been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I thought I had been able to do some math on the value and map it to some ascii character that would look close.  It turns out that works a little but not as consistently as I hoped.  I had to go to the use of a table and that seems to cover almost everything I'm interested in.  I'm was able to work out the math for the three and two byte sections, those that start with E0 and C0.  I consider all of the code to be a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Wednesday, April 7, 2021 at 6:19:34 PM UTC-4 Louis LaBrunda wrote:
Hi Everyone,

I have been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I'm still working on the three and two byte sections, those that start with E0 and C0.  This code is a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Tuesday, April 6, 2021 at 6:02:31 AM UTC-4 [hidden email] wrote:
Hi Lou, Seth,


I've been offline for a bit over easter. So I'm late to the discussion....
As you know, I also implemented a rather sketchy RFC2047 converter and had used it for a few months already. It has some limitiations and of course has nothing to offer for emojis, bold characters and whatnot. I freely admit I never cared about these edge cases, because my main purpose of using IMAP is to receive attachments and put them on some batch for processing. Most of the encoded headers are irrelevant for my use case.

I was sure my encoder is not even close to being half-baked, so I kept it for myself for the time being. I would revisit it once I encounter a use case that requires more. It is a lot of work to "just" write some tool that implements the plethora of RFCs that you discover once you lift a corner of one of them....

I am glad you shared your experiences here and that Seth chimed in with his knowledge on the topic. UTF-x and Unicode are cans of worms, white areas on the map of a mortal application developer, the valley where many young and brave knights rode into but never came back... So there may be lots of beautiful princesses there or just an army of Dragons, nobody knows ;-)

Why am I glad? Because I know I am not alone and - even better - that Instantiations is going to send their princes into UTF-x valley for us ;-)


Joachim


As a side note: my clumsy decoder returns this: '???????????? _???????????? _???????? _????. _???????????? _????????????' ;-)


Seth Berman schrieb am Montag, 5. April 2021 um 20:43:44 UTC+2:
Hi Lou,

To put the editor in UTF-8 mode, you get an instance and apply:
And I didn't convert anything...it just goes into a mode where it interprets the bytes in that byte array correctly.
CwScintillaEditor>>setCodePage: 65001

"How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?"
That's not going to work.  Except *mostly* for the first 128 chars, UTF-8 encoded bytes are not relatable to code pages.  Again, its just a value
assigned by a group of folks, so you can't really do anything mathematical to it to accomplish much with regards to conversion.
The first 4 bytes in your example is the UTF-8 encoded value for Mathmatical Bold Capital M (https://www.compart.com/en/unicode/U+1D40C)
It's UTF-8 value is different than its UTF-16 value which is different from its Unicode Scalar value.  And they don't relate...its just a value.

"I am just trying to hack my way around it for my limited case until you guys do the real conversion."
- We won't be doing any conversion.  As a first step, what we are doing is offering a UnicodeString container that could ingest those UTF-8 encoded bytes and do interesting things with them and showing them in our editor which we will switch to UTF-8 mode.  This means a UnicodeString will need to convert itself to UTF-8 when it hands Scintilla bytes to display.
Next steps would be upgrading our CFS APIs to use them, followed by switching our APIS on Windows from narrow to wide APIs.  At that point, UTF-8 would still require conversion to UTF-16 if you are going to be showing them in table cells that are Windows widgets.  But the new Unicode support library gives you a first class container and plenty of easy APIs to make that feasible.

- Seth

On Monday, April 5, 2021 at 2:26:31 PM UTC-4 [hidden email] wrote:
Hi Seth,

If you dump those bytes to a file and open them in something like Notepad++ in UTF-8 mode...then it will be readable.

The program I need this for displays the subject as a column in a container is a window.  Joachim might be okay with sending the data to a file.
 
If you set the code page of the smalltalk scintilla editor to 65001, then it will be readable (see below)

How did you do that?  Can I call scintilla to give me a converted string?

How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?
 
But those UTF-8 encoded code points are just numbers assigned by a group of people.  It doesn't know that its technically a decorated ASCII M.  So you would need to have some program that new how to perform that conversion to some code page which is certainly not something that VAST does.

This was my guess, I am just trying to hack my way around it for my limited case until you guys do the real conversion.

Lou 

--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/5156c951-1cae-43e7-b61b-6c33dea188dfn%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Great job with IMAP interface

Louis LaBrunda
Hi Seth,

Thanks for the reply.  I understand everything you said.  What I'm curious about is how does a code page or anything know that an F0 (which is a valid ASCII upper end character) is the start of a UTF-8 or Unicode four byte thingy?

Lou

On Monday, April 12, 2021 at 11:04:49 AM UTC-4 Seth Berman wrote:
Hi Lou,

This is kinda what I've been saying.  This is UTF-8 encoded bytes. You need a UTF-8 decoder to properly decode those bytes.

UTF-8 is an encoding scheme for Unicode code points.  The code page you are using (or any code page) can only represent a very small portion of what the Unicode code space captures.  And certainly, emoji are not included in that unless someone in recent times has been creating strange code pages I'm not familiar with.

Using the "Code Page Converter", you can properly convert most of that...but its not going to know how to convert a 💝 to a codepage byte...because there likely is no such mapping in any useful codepage.

Conversion from Unicode to code page is a lossy conversion.  One can't possibly represent all forms of a 'character' from the valid set of Unicode code points in just a byte.
This is why I was asking earlier about what you plan to do with these incoming bytes.
If you need to just pass them on to someone else...then keep them in UTF-8 in a ByteArray (no conversion), and then just pass them on.
You could technically keep them in a String class, since it basically is a ByteArray...but I would advise against trying to assume (String at: <index>) is always going to return something useful.

If you need to work with them in Smalltalk as a set of Characters...then we arrive at the inevitable conclusion that VAST is not yet Unicode-aware.
This whole discussion was the basis of my recent post showing what we are doing for VAST 2022
https://groups.google.com/g/va-smalltalk/c/-w7FVdc1gFM/m/gvngUys3CAAJ

- Seth

On Monday, April 12, 2021 at 10:32:22 AM UTC-4 [hidden email] wrote:
Hi All,

Today I got an email with this subject:

'=?UTF-8?B?SGVhcnRmZWx0IGdpZnRzIGZvciBNb3RoZXLigJlzIERheSDwn5Kd?='

my conversion yields:

'Heartfelt gifts for Mother’s Day 💠'

an online converter yealds:

Heartfelt gifts for Mother’s Day 💝

Clearly I'm misunderstanding something.  I don't expect to get the same result but I thought the "B" encoding could be fed into "Base64Coder current decode:" and that is enough.  How is one to know the funny quote and heart are present?  It seems 16rE28099 and 16rF09F929D just show up.  How does one know they aren't high end ASCII or would something that isn't in VA Smalltalk yet deal with them?

I don't really need this for my program but I curious. 

Lou

On Sunday, April 11, 2021 at 7:19:44 PM UTC-4 Louis LaBrunda wrote:
Hi Everyone,

I spoke a little too soon when I said I had been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I thought I had been able to do some math on the value and map it to some ascii character that would look close.  It turns out that works a little but not as consistently as I hoped.  I had to go to the use of a table and that seems to cover almost everything I'm interested in.  I'm was able to work out the math for the three and two byte sections, those that start with E0 and C0.  I consider all of the code to be a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Wednesday, April 7, 2021 at 6:19:34 PM UTC-4 Louis LaBrunda wrote:
Hi Everyone,

I have been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I'm still working on the three and two byte sections, those that start with E0 and C0.  This code is a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Tuesday, April 6, 2021 at 6:02:31 AM UTC-4 [hidden email] wrote:
Hi Lou, Seth,


I've been offline for a bit over easter. So I'm late to the discussion....
As you know, I also implemented a rather sketchy RFC2047 converter and had used it for a few months already. It has some limitiations and of course has nothing to offer for emojis, bold characters and whatnot. I freely admit I never cared about these edge cases, because my main purpose of using IMAP is to receive attachments and put them on some batch for processing. Most of the encoded headers are irrelevant for my use case.

I was sure my encoder is not even close to being half-baked, so I kept it for myself for the time being. I would revisit it once I encounter a use case that requires more. It is a lot of work to "just" write some tool that implements the plethora of RFCs that you discover once you lift a corner of one of them....

I am glad you shared your experiences here and that Seth chimed in with his knowledge on the topic. UTF-x and Unicode are cans of worms, white areas on the map of a mortal application developer, the valley where many young and brave knights rode into but never came back... So there may be lots of beautiful princesses there or just an army of Dragons, nobody knows ;-)

Why am I glad? Because I know I am not alone and - even better - that Instantiations is going to send their princes into UTF-x valley for us ;-)


Joachim


As a side note: my clumsy decoder returns this: '???????????? _???????????? _???????? _????. _???????????? _????????????' ;-)


Seth Berman schrieb am Montag, 5. April 2021 um 20:43:44 UTC+2:
Hi Lou,

To put the editor in UTF-8 mode, you get an instance and apply:
And I didn't convert anything...it just goes into a mode where it interprets the bytes in that byte array correctly.
CwScintillaEditor>>setCodePage: 65001

"How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?"
That's not going to work.  Except *mostly* for the first 128 chars, UTF-8 encoded bytes are not relatable to code pages.  Again, its just a value
assigned by a group of folks, so you can't really do anything mathematical to it to accomplish much with regards to conversion.
The first 4 bytes in your example is the UTF-8 encoded value for Mathmatical Bold Capital M (https://www.compart.com/en/unicode/U+1D40C)
It's UTF-8 value is different than its UTF-16 value which is different from its Unicode Scalar value.  And they don't relate...its just a value.

"I am just trying to hack my way around it for my limited case until you guys do the real conversion."
- We won't be doing any conversion.  As a first step, what we are doing is offering a UnicodeString container that could ingest those UTF-8 encoded bytes and do interesting things with them and showing them in our editor which we will switch to UTF-8 mode.  This means a UnicodeString will need to convert itself to UTF-8 when it hands Scintilla bytes to display.
Next steps would be upgrading our CFS APIs to use them, followed by switching our APIS on Windows from narrow to wide APIs.  At that point, UTF-8 would still require conversion to UTF-16 if you are going to be showing them in table cells that are Windows widgets.  But the new Unicode support library gives you a first class container and plenty of easy APIs to make that feasible.

- Seth

On Monday, April 5, 2021 at 2:26:31 PM UTC-4 [hidden email] wrote:
Hi Seth,

If you dump those bytes to a file and open them in something like Notepad++ in UTF-8 mode...then it will be readable.

The program I need this for displays the subject as a column in a container is a window.  Joachim might be okay with sending the data to a file.
 
If you set the code page of the smalltalk scintilla editor to 65001, then it will be readable (see below)

How did you do that?  Can I call scintilla to give me a converted string?

How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?
 
But those UTF-8 encoded code points are just numbers assigned by a group of people.  It doesn't know that its technically a decorated ASCII M.  So you would need to have some program that new how to perform that conversion to some code page which is certainly not something that VAST does.

This was my guess, I am just trying to hack my way around it for my limited case until you guys do the real conversion.

Lou 

--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/33c892f2-8851-43d0-a56a-72df32715de0n%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Great job with IMAP interface

Seth Berman
Hi Lou,

Codepages don't know anything about UTF-8.  UTF-8 was designed so that the first 128 'characters' of UTF-8 match ASCII, so it may seem as if they are related.
The subject lines you are providing specify 'UTF-8'...and therefore you need to treat it as such.
If all chars in the subject are between 0-127, then you can just treat it as ascii.  Otherwise, you will need to attempt lossy code page conversion which may (or may not) be able to map the UTF-8 encoded codepoints to a byte in your current code page.

If you are curious about the UTF-8 encoding, you can just checkout Wikipedia or something that will describe how the upper bits help determine how many UTF-8 bytes are required to read a decoded unicode code point.

- Seth


On Monday, April 12, 2021 at 1:24:39 PM UTC-4 [hidden email] wrote:
Hi Seth,

Thanks for the reply.  I understand everything you said.  What I'm curious about is how does a code page or anything know that an F0 (which is a valid ASCII upper end character) is the start of a UTF-8 or Unicode four byte thingy?

Lou

On Monday, April 12, 2021 at 11:04:49 AM UTC-4 Seth Berman wrote:
Hi Lou,

This is kinda what I've been saying.  This is UTF-8 encoded bytes. You need a UTF-8 decoder to properly decode those bytes.

UTF-8 is an encoding scheme for Unicode code points.  The code page you are using (or any code page) can only represent a very small portion of what the Unicode code space captures.  And certainly, emoji are not included in that unless someone in recent times has been creating strange code pages I'm not familiar with.

Using the "Code Page Converter", you can properly convert most of that...but its not going to know how to convert a 💝 to a codepage byte...because there likely is no such mapping in any useful codepage.

Conversion from Unicode to code page is a lossy conversion.  One can't possibly represent all forms of a 'character' from the valid set of Unicode code points in just a byte.
This is why I was asking earlier about what you plan to do with these incoming bytes.
If you need to just pass them on to someone else...then keep them in UTF-8 in a ByteArray (no conversion), and then just pass them on.
You could technically keep them in a String class, since it basically is a ByteArray...but I would advise against trying to assume (String at: <index>) is always going to return something useful.

If you need to work with them in Smalltalk as a set of Characters...then we arrive at the inevitable conclusion that VAST is not yet Unicode-aware.
This whole discussion was the basis of my recent post showing what we are doing for VAST 2022
https://groups.google.com/g/va-smalltalk/c/-w7FVdc1gFM/m/gvngUys3CAAJ

- Seth

On Monday, April 12, 2021 at 10:32:22 AM UTC-4 [hidden email] wrote:
Hi All,

Today I got an email with this subject:

'=?UTF-8?B?SGVhcnRmZWx0IGdpZnRzIGZvciBNb3RoZXLigJlzIERheSDwn5Kd?='

my conversion yields:

'Heartfelt gifts for Mother’s Day 💠'

an online converter yealds:

Heartfelt gifts for Mother’s Day 💝

Clearly I'm misunderstanding something.  I don't expect to get the same result but I thought the "B" encoding could be fed into "Base64Coder current decode:" and that is enough.  How is one to know the funny quote and heart are present?  It seems 16rE28099 and 16rF09F929D just show up.  How does one know they aren't high end ASCII or would something that isn't in VA Smalltalk yet deal with them?

I don't really need this for my program but I curious. 

Lou

On Sunday, April 11, 2021 at 7:19:44 PM UTC-4 Louis LaBrunda wrote:
Hi Everyone,

I spoke a little too soon when I said I had been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I thought I had been able to do some math on the value and map it to some ascii character that would look close.  It turns out that works a little but not as consistently as I hoped.  I had to go to the use of a table and that seems to cover almost everything I'm interested in.  I'm was able to work out the math for the three and two byte sections, those that start with E0 and C0.  I consider all of the code to be a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Wednesday, April 7, 2021 at 6:19:34 PM UTC-4 Louis LaBrunda wrote:
Hi Everyone,

I have been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I'm still working on the three and two byte sections, those that start with E0 and C0.  This code is a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Tuesday, April 6, 2021 at 6:02:31 AM UTC-4 [hidden email] wrote:
Hi Lou, Seth,


I've been offline for a bit over easter. So I'm late to the discussion....
As you know, I also implemented a rather sketchy RFC2047 converter and had used it for a few months already. It has some limitiations and of course has nothing to offer for emojis, bold characters and whatnot. I freely admit I never cared about these edge cases, because my main purpose of using IMAP is to receive attachments and put them on some batch for processing. Most of the encoded headers are irrelevant for my use case.

I was sure my encoder is not even close to being half-baked, so I kept it for myself for the time being. I would revisit it once I encounter a use case that requires more. It is a lot of work to "just" write some tool that implements the plethora of RFCs that you discover once you lift a corner of one of them....

I am glad you shared your experiences here and that Seth chimed in with his knowledge on the topic. UTF-x and Unicode are cans of worms, white areas on the map of a mortal application developer, the valley where many young and brave knights rode into but never came back... So there may be lots of beautiful princesses there or just an army of Dragons, nobody knows ;-)

Why am I glad? Because I know I am not alone and - even better - that Instantiations is going to send their princes into UTF-x valley for us ;-)


Joachim


As a side note: my clumsy decoder returns this: '???????????? _???????????? _???????? _????. _???????????? _????????????' ;-)


Seth Berman schrieb am Montag, 5. April 2021 um 20:43:44 UTC+2:
Hi Lou,

To put the editor in UTF-8 mode, you get an instance and apply:
And I didn't convert anything...it just goes into a mode where it interprets the bytes in that byte array correctly.
CwScintillaEditor>>setCodePage: 65001

"How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?"
That's not going to work.  Except *mostly* for the first 128 chars, UTF-8 encoded bytes are not relatable to code pages.  Again, its just a value
assigned by a group of folks, so you can't really do anything mathematical to it to accomplish much with regards to conversion.
The first 4 bytes in your example is the UTF-8 encoded value for Mathmatical Bold Capital M (https://www.compart.com/en/unicode/U+1D40C)
It's UTF-8 value is different than its UTF-16 value which is different from its Unicode Scalar value.  And they don't relate...its just a value.

"I am just trying to hack my way around it for my limited case until you guys do the real conversion."
- We won't be doing any conversion.  As a first step, what we are doing is offering a UnicodeString container that could ingest those UTF-8 encoded bytes and do interesting things with them and showing them in our editor which we will switch to UTF-8 mode.  This means a UnicodeString will need to convert itself to UTF-8 when it hands Scintilla bytes to display.
Next steps would be upgrading our CFS APIs to use them, followed by switching our APIS on Windows from narrow to wide APIs.  At that point, UTF-8 would still require conversion to UTF-16 if you are going to be showing them in table cells that are Windows widgets.  But the new Unicode support library gives you a first class container and plenty of easy APIs to make that feasible.

- Seth

On Monday, April 5, 2021 at 2:26:31 PM UTC-4 [hidden email] wrote:
Hi Seth,

If you dump those bytes to a file and open them in something like Notepad++ in UTF-8 mode...then it will be readable.

The program I need this for displays the subject as a column in a container is a window.  Joachim might be okay with sending the data to a file.
 
If you set the code page of the smalltalk scintilla editor to 65001, then it will be readable (see below)

How did you do that?  Can I call scintilla to give me a converted string?

How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?
 
But those UTF-8 encoded code points are just numbers assigned by a group of people.  It doesn't know that its technically a decorated ASCII M.  So you would need to have some program that new how to perform that conversion to some code page which is certainly not something that VAST does.

This was my guess, I am just trying to hack my way around it for my limited case until you guys do the real conversion.

Lou 

--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/7098c9f6-fb6b-4fcf-92ba-b841aea5beacn%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Great job with IMAP interface

Louis LaBrunda
Hi Seth,

I see my misunderstanding.  It is 0-127 of ASCII and UTF-8 that are the same.  Above 127 that requires the 2, 3 or 4 byte encodings.  So, if a character is between 0 and 127, ASCII is good.  Above 127 one needs to map to ASCII if one wants something that looks close (which is what I have been doing for the "Q" encoded stuff).  I didn't realize the "B" encoded stuff needed the extra step.  Thanks.

Lou
On Monday, April 12, 2021 at 1:40:32 PM UTC-4 Seth Berman wrote:
Hi Lou,

Codepages don't know anything about UTF-8.  UTF-8 was designed so that the first 128 'characters' of UTF-8 match ASCII, so it may seem as if they are related.
The subject lines you are providing specify 'UTF-8'...and therefore you need to treat it as such.
If all chars in the subject are between 0-127, then you can just treat it as ascii.  Otherwise, you will need to attempt lossy code page conversion which may (or may not) be able to map the UTF-8 encoded codepoints to a byte in your current code page.

If you are curious about the UTF-8 encoding, you can just checkout Wikipedia or something that will describe how the upper bits help determine how many UTF-8 bytes are required to read a decoded unicode code point.

- Seth


On Monday, April 12, 2021 at 1:24:39 PM UTC-4 [hidden email] wrote:
Hi Seth,

Thanks for the reply.  I understand everything you said.  What I'm curious about is how does a code page or anything know that an F0 (which is a valid ASCII upper end character) is the start of a UTF-8 or Unicode four byte thingy?

Lou

On Monday, April 12, 2021 at 11:04:49 AM UTC-4 Seth Berman wrote:
Hi Lou,

This is kinda what I've been saying.  This is UTF-8 encoded bytes. You need a UTF-8 decoder to properly decode those bytes.

UTF-8 is an encoding scheme for Unicode code points.  The code page you are using (or any code page) can only represent a very small portion of what the Unicode code space captures.  And certainly, emoji are not included in that unless someone in recent times has been creating strange code pages I'm not familiar with.

Using the "Code Page Converter", you can properly convert most of that...but its not going to know how to convert a 💝 to a codepage byte...because there likely is no such mapping in any useful codepage.

Conversion from Unicode to code page is a lossy conversion.  One can't possibly represent all forms of a 'character' from the valid set of Unicode code points in just a byte.
This is why I was asking earlier about what you plan to do with these incoming bytes.
If you need to just pass them on to someone else...then keep them in UTF-8 in a ByteArray (no conversion), and then just pass them on.
You could technically keep them in a String class, since it basically is a ByteArray...but I would advise against trying to assume (String at: <index>) is always going to return something useful.

If you need to work with them in Smalltalk as a set of Characters...then we arrive at the inevitable conclusion that VAST is not yet Unicode-aware.
This whole discussion was the basis of my recent post showing what we are doing for VAST 2022
https://groups.google.com/g/va-smalltalk/c/-w7FVdc1gFM/m/gvngUys3CAAJ

- Seth

On Monday, April 12, 2021 at 10:32:22 AM UTC-4 [hidden email] wrote:
Hi All,

Today I got an email with this subject:

'=?UTF-8?B?SGVhcnRmZWx0IGdpZnRzIGZvciBNb3RoZXLigJlzIERheSDwn5Kd?='

my conversion yields:

'Heartfelt gifts for Mother’s Day 💠'

an online converter yealds:

Heartfelt gifts for Mother’s Day 💝

Clearly I'm misunderstanding something.  I don't expect to get the same result but I thought the "B" encoding could be fed into "Base64Coder current decode:" and that is enough.  How is one to know the funny quote and heart are present?  It seems 16rE28099 and 16rF09F929D just show up.  How does one know they aren't high end ASCII or would something that isn't in VA Smalltalk yet deal with them?

I don't really need this for my program but I curious. 

Lou

On Sunday, April 11, 2021 at 7:19:44 PM UTC-4 Louis LaBrunda wrote:
Hi Everyone,

I spoke a little too soon when I said I had been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I thought I had been able to do some math on the value and map it to some ascii character that would look close.  It turns out that works a little but not as consistently as I hoped.  I had to go to the use of a table and that seems to cover almost everything I'm interested in.  I'm was able to work out the math for the three and two byte sections, those that start with E0 and C0.  I consider all of the code to be a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Wednesday, April 7, 2021 at 6:19:34 PM UTC-4 Louis LaBrunda wrote:
Hi Everyone,

I have been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I'm still working on the three and two byte sections, those that start with E0 and C0.  This code is a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Tuesday, April 6, 2021 at 6:02:31 AM UTC-4 [hidden email] wrote:
Hi Lou, Seth,


I've been offline for a bit over easter. So I'm late to the discussion....
As you know, I also implemented a rather sketchy RFC2047 converter and had used it for a few months already. It has some limitiations and of course has nothing to offer for emojis, bold characters and whatnot. I freely admit I never cared about these edge cases, because my main purpose of using IMAP is to receive attachments and put them on some batch for processing. Most of the encoded headers are irrelevant for my use case.

I was sure my encoder is not even close to being half-baked, so I kept it for myself for the time being. I would revisit it once I encounter a use case that requires more. It is a lot of work to "just" write some tool that implements the plethora of RFCs that you discover once you lift a corner of one of them....

I am glad you shared your experiences here and that Seth chimed in with his knowledge on the topic. UTF-x and Unicode are cans of worms, white areas on the map of a mortal application developer, the valley where many young and brave knights rode into but never came back... So there may be lots of beautiful princesses there or just an army of Dragons, nobody knows ;-)

Why am I glad? Because I know I am not alone and - even better - that Instantiations is going to send their princes into UTF-x valley for us ;-)


Joachim


As a side note: my clumsy decoder returns this: '???????????? _???????????? _???????? _????. _???????????? _????????????' ;-)


Seth Berman schrieb am Montag, 5. April 2021 um 20:43:44 UTC+2:
Hi Lou,

To put the editor in UTF-8 mode, you get an instance and apply:
And I didn't convert anything...it just goes into a mode where it interprets the bytes in that byte array correctly.
CwScintillaEditor>>setCodePage: 65001

"How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?"
That's not going to work.  Except *mostly* for the first 128 chars, UTF-8 encoded bytes are not relatable to code pages.  Again, its just a value
assigned by a group of folks, so you can't really do anything mathematical to it to accomplish much with regards to conversion.
The first 4 bytes in your example is the UTF-8 encoded value for Mathmatical Bold Capital M (https://www.compart.com/en/unicode/U+1D40C)
It's UTF-8 value is different than its UTF-16 value which is different from its Unicode Scalar value.  And they don't relate...its just a value.

"I am just trying to hack my way around it for my limited case until you guys do the real conversion."
- We won't be doing any conversion.  As a first step, what we are doing is offering a UnicodeString container that could ingest those UTF-8 encoded bytes and do interesting things with them and showing them in our editor which we will switch to UTF-8 mode.  This means a UnicodeString will need to convert itself to UTF-8 when it hands Scintilla bytes to display.
Next steps would be upgrading our CFS APIs to use them, followed by switching our APIS on Windows from narrow to wide APIs.  At that point, UTF-8 would still require conversion to UTF-16 if you are going to be showing them in table cells that are Windows widgets.  But the new Unicode support library gives you a first class container and plenty of easy APIs to make that feasible.

- Seth

On Monday, April 5, 2021 at 2:26:31 PM UTC-4 [hidden email] wrote:
Hi Seth,

If you dump those bytes to a file and open them in something like Notepad++ in UTF-8 mode...then it will be readable.

The program I need this for displays the subject as a column in a container is a window.  Joachim might be okay with sending the data to a file.
 
If you set the code page of the smalltalk scintilla editor to 65001, then it will be readable (see below)

How did you do that?  Can I call scintilla to give me a converted string?

How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?
 
But those UTF-8 encoded code points are just numbers assigned by a group of people.  It doesn't know that its technically a decorated ASCII M.  So you would need to have some program that new how to perform that conversion to some code page which is certainly not something that VAST does.

This was my guess, I am just trying to hack my way around it for my limited case until you guys do the real conversion.

Lou 

--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/e7ff6372-f837-4d51-aafd-a8c7284636acn%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Great job with IMAP interface

Louis LaBrunda
Hi Guys,

I now know far more about UTF-8 than I ever wanted to.  I apologize for asking a lot of questions because I really didn't want to read too much of those UTF-8 and RFC2047 docs.

I have split my KscRFC2047Decoder.st in two and now have it and KscUtf8ToAscii.st.  KscUtf8ToAscii is a class that maps over 2000 UTF-8 values to single byte ASCII values that look a lot like the UTF-8 images but without fancy stuff like bold etc.  This allows me to see the intended meaning (of email subjects and such) without the indication that the UTF-8 couldn't map to ASCII, like a box or something.

To my mind, this is what a UTF-8 to ASCII code page converter should do.  I know that is asking a lot but it would be nice.

Lou

On Monday, April 12, 2021 at 3:23:02 PM UTC-4 Louis LaBrunda wrote:
Hi Seth,

I see my misunderstanding.  It is 0-127 of ASCII and UTF-8 that are the same.  Above 127 that requires the 2, 3 or 4 byte encodings.  So, if a character is between 0 and 127, ASCII is good.  Above 127 one needs to map to ASCII if one wants something that looks close (which is what I have been doing for the "Q" encoded stuff).  I didn't realize the "B" encoded stuff needed the extra step.  Thanks.

Lou
On Monday, April 12, 2021 at 1:40:32 PM UTC-4 Seth Berman wrote:
Hi Lou,

Codepages don't know anything about UTF-8.  UTF-8 was designed so that the first 128 'characters' of UTF-8 match ASCII, so it may seem as if they are related.
The subject lines you are providing specify 'UTF-8'...and therefore you need to treat it as such.
If all chars in the subject are between 0-127, then you can just treat it as ascii.  Otherwise, you will need to attempt lossy code page conversion which may (or may not) be able to map the UTF-8 encoded codepoints to a byte in your current code page.

If you are curious about the UTF-8 encoding, you can just checkout Wikipedia or something that will describe how the upper bits help determine how many UTF-8 bytes are required to read a decoded unicode code point.

- Seth


On Monday, April 12, 2021 at 1:24:39 PM UTC-4 [hidden email] wrote:
Hi Seth,

Thanks for the reply.  I understand everything you said.  What I'm curious about is how does a code page or anything know that an F0 (which is a valid ASCII upper end character) is the start of a UTF-8 or Unicode four byte thingy?

Lou

On Monday, April 12, 2021 at 11:04:49 AM UTC-4 Seth Berman wrote:
Hi Lou,

This is kinda what I've been saying.  This is UTF-8 encoded bytes. You need a UTF-8 decoder to properly decode those bytes.

UTF-8 is an encoding scheme for Unicode code points.  The code page you are using (or any code page) can only represent a very small portion of what the Unicode code space captures.  And certainly, emoji are not included in that unless someone in recent times has been creating strange code pages I'm not familiar with.

Using the "Code Page Converter", you can properly convert most of that...but its not going to know how to convert a 💝 to a codepage byte...because there likely is no such mapping in any useful codepage.

Conversion from Unicode to code page is a lossy conversion.  One can't possibly represent all forms of a 'character' from the valid set of Unicode code points in just a byte.
This is why I was asking earlier about what you plan to do with these incoming bytes.
If you need to just pass them on to someone else...then keep them in UTF-8 in a ByteArray (no conversion), and then just pass them on.
You could technically keep them in a String class, since it basically is a ByteArray...but I would advise against trying to assume (String at: <index>) is always going to return something useful.

If you need to work with them in Smalltalk as a set of Characters...then we arrive at the inevitable conclusion that VAST is not yet Unicode-aware.
This whole discussion was the basis of my recent post showing what we are doing for VAST 2022
https://groups.google.com/g/va-smalltalk/c/-w7FVdc1gFM/m/gvngUys3CAAJ

- Seth

On Monday, April 12, 2021 at 10:32:22 AM UTC-4 [hidden email] wrote:
Hi All,

Today I got an email with this subject:

'=?UTF-8?B?SGVhcnRmZWx0IGdpZnRzIGZvciBNb3RoZXLigJlzIERheSDwn5Kd?='

my conversion yields:

'Heartfelt gifts for Mother’s Day 💠'

an online converter yealds:

Heartfelt gifts for Mother’s Day 💝

Clearly I'm misunderstanding something.  I don't expect to get the same result but I thought the "B" encoding could be fed into "Base64Coder current decode:" and that is enough.  How is one to know the funny quote and heart are present?  It seems 16rE28099 and 16rF09F929D just show up.  How does one know they aren't high end ASCII or would something that isn't in VA Smalltalk yet deal with them?

I don't really need this for my program but I curious. 

Lou

On Sunday, April 11, 2021 at 7:19:44 PM UTC-4 Louis LaBrunda wrote:
Hi Everyone,

I spoke a little too soon when I said I had been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I thought I had been able to do some math on the value and map it to some ascii character that would look close.  It turns out that works a little but not as consistently as I hoped.  I had to go to the use of a table and that seems to cover almost everything I'm interested in.  I'm was able to work out the math for the three and two byte sections, those that start with E0 and C0.  I consider all of the code to be a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Wednesday, April 7, 2021 at 6:19:34 PM UTC-4 Louis LaBrunda wrote:
Hi Everyone,

I have been able to squeeze the bold... section of UTF-8 (those four byte sections that start with F0) down to displayable characters.  I'm still working on the three and two byte sections, those that start with E0 and C0.  This code is a hack but I think it does what I need in my program.  I'm not sure if it will help Joachim but he and anyone who wants should feel free to us it or any part of it.

You can download my latest KscRFC2047Decoder.st.

Lou

On Tuesday, April 6, 2021 at 6:02:31 AM UTC-4 [hidden email] wrote:
Hi Lou, Seth,


I've been offline for a bit over easter. So I'm late to the discussion....
As you know, I also implemented a rather sketchy RFC2047 converter and had used it for a few months already. It has some limitiations and of course has nothing to offer for emojis, bold characters and whatnot. I freely admit I never cared about these edge cases, because my main purpose of using IMAP is to receive attachments and put them on some batch for processing. Most of the encoded headers are irrelevant for my use case.

I was sure my encoder is not even close to being half-baked, so I kept it for myself for the time being. I would revisit it once I encounter a use case that requires more. It is a lot of work to "just" write some tool that implements the plethora of RFCs that you discover once you lift a corner of one of them....

I am glad you shared your experiences here and that Seth chimed in with his knowledge on the topic. UTF-x and Unicode are cans of worms, white areas on the map of a mortal application developer, the valley where many young and brave knights rode into but never came back... So there may be lots of beautiful princesses there or just an army of Dragons, nobody knows ;-)

Why am I glad? Because I know I am not alone and - even better - that Instantiations is going to send their princes into UTF-x valley for us ;-)


Joachim


As a side note: my clumsy decoder returns this: '???????????? _???????????? _???????? _????. _???????????? _????????????' ;-)


Seth Berman schrieb am Montag, 5. April 2021 um 20:43:44 UTC+2:
Hi Lou,

To put the editor in UTF-8 mode, you get an instance and apply:
And I didn't convert anything...it just goes into a mode where it interprets the bytes in that byte array correctly.
CwScintillaEditor>>setCodePage: 65001

"How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?"
That's not going to work.  Except *mostly* for the first 128 chars, UTF-8 encoded bytes are not relatable to code pages.  Again, its just a value
assigned by a group of folks, so you can't really do anything mathematical to it to accomplish much with regards to conversion.
The first 4 bytes in your example is the UTF-8 encoded value for Mathmatical Bold Capital M (https://www.compart.com/en/unicode/U+1D40C)
It's UTF-8 value is different than its UTF-16 value which is different from its Unicode Scalar value.  And they don't relate...its just a value.

"I am just trying to hack my way around it for my limited case until you guys do the real conversion."
- We won't be doing any conversion.  As a first step, what we are doing is offering a UnicodeString container that could ingest those UTF-8 encoded bytes and do interesting things with them and showing them in our editor which we will switch to UTF-8 mode.  This means a UnicodeString will need to convert itself to UTF-8 when it hands Scintilla bytes to display.
Next steps would be upgrading our CFS APIs to use them, followed by switching our APIS on Windows from narrow to wide APIs.  At that point, UTF-8 would still require conversion to UTF-16 if you are going to be showing them in table cells that are Windows widgets.  But the new Unicode support library gives you a first class container and plenty of easy APIs to make that feasible.

- Seth

On Monday, April 5, 2021 at 2:26:31 PM UTC-4 [hidden email] wrote:
Hi Seth,

If you dump those bytes to a file and open them in something like Notepad++ in UTF-8 mode...then it will be readable.

The program I need this for displays the subject as a column in a container is a window.  Joachim might be okay with sending the data to a file.
 
If you set the code page of the smalltalk scintilla editor to 65001, then it will be readable (see below)

How did you do that?  Can I call scintilla to give me a converted string?

How about my idea (maybe crazy idea) to subtract something from each chunk (3 or 4 bytes) to bring it down to a range we can work with?
 
But those UTF-8 encoded code points are just numbers assigned by a group of people.  It doesn't know that its technically a decorated ASCII M.  So you would need to have some program that new how to perform that conversion to some code page which is certainly not something that VAST does.

This was my guess, I am just trying to hack my way around it for my limited case until you guys do the real conversion.

Lou 

--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/4aac84fe-354d-4b71-978a-23c096fd5c35n%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Great job with IMAP interface

Mariano Martinez Peck-2


On Sat, Apr 17, 2021 at 12:27 PM Louis LaBrunda <[hidden email]> wrote:
Hi Guys,

I now know far more about UTF-8 than I ever wanted to.  I apologize for asking a lot of questions because I really didn't want to read too much of those UTF-8 and RFC2047 docs.

I have split my KscRFC2047Decoder.st in two and now have it and KscUtf8ToAscii.st.  KscUtf8ToAscii is a class that maps over 2000 UTF-8 values to single byte ASCII values that look a lot like the UTF-8 images but without fancy stuff like bold etc.  This allows me to see the intended meaning (of email subjects and such) without the indication that the UTF-8 couldn't map to ASCII, like a box or something.

To my mind, this is what a UTF-8 to ASCII code page converter should do.  I know that is asking a lot but it would be nice.


It does, but up to a certain extent. This is called "Transliterate". iconv() provides a //TRANSLIT option and Windows offers a flag in the conversion API too. We have recently reified that (as well as other options) into a class EsCodePageConversionPolicy. See senders of #isTransliterateMode. Even more, we put the transliterate mode even the default one. So...again, the code page converter tries to transliterate, but up to certain extent. And from VAST point of view, we are doing up to that extent too. 

Best, 


--

Mariano Martinez Peck

Senior Software Engineer

 [hidden email]
 @MartinezPeck
 /mariano-martinez-peck
 instantiations.com
TwitterLinkedInVAST Community ForumGitHubYouTubepub.dev

--
You received this message because you are subscribed to the Google Groups "VAST Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/CAOUkibEEAf_wMuTag%2BrEaNebM%3DNSfOmunX%3D7SX-dtbnUL4TRGA%40mail.gmail.com.
12