Dear vast team, In Seth's ESUG talk about VAST 9.2 there was a slide mentioning the new Application EsCodePageUtilities and talking about libicu integration with the VM. Does this mean we'll get some of the funcionality of libicu wrapped in Smalltalk. I'm especially interested in ucsdet_detect or ucsdet_detectAll. I can't find any reference to these names in the ECAP 9.2 build. How hard would it be to wrap these on my own, and will these be available to a VAST on both Linux and Windows? Any hints? Joachim You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/ecf6fdec-2bb9-42a6-b90d-b58fdb6650cf%40googlegroups.com. |
On Fri, Nov 8, 2019 at 6:11 AM Joachim Tuchel <[hidden email]> wrote:
Hi Joachim, The EsCodePageUtilities is ready for 9.2 and you can find it in ECAP.
Yes, the idea is to wrap ICU from Smalltalk and that's probably one of the first things we will start with in 9.3. But not for 9.2.
If I were you I would just wait for 9.3. We can do an ECAP as soon as we have ICU wrapped.
Not me. Maybe Seth has something to add. Best, Mariano Martinez Peck Software Engineer, Instantiations Inc. Email: [hidden email] Twitter: https://twitter.com/MartinezPeck Blog: https://marianopeck.wordpress.com/You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/CAOUkibG%2Bb1Y9JPCNzfab7JoHiHf93pXQUZ1TUbBBDc_kx%3D_RjA%40mail.gmail.com. |
Mariano, thanks for this info. It's great to see unicade making its way into VAST. I am not sure I can wait for 9.3, though, so I might have to implement some poor man's UTF-8 detection for at least some letters... The scope of my use case is relatively narrow... Joachim Am Freitag, 8. November 2019 14:06:30 UTC+1 schrieb Mariano Martinez Peck:
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/17604cb0-7e85-435a-bbce-dc016032b73a%40googlegroups.com. |
On Fri, Nov 8, 2019 at 11:19 AM Joachim Tuchel <[hidden email]> wrote:
If it is UTF-8 ONLY detection AND the stream has a UTF BOM then it's very easy to implement. You just need to read 3 bytes and check if that looks like a BOM. If you don't have BOM or you need to detect all possible encoders, did you think about just doing a UnixProcess with 'file' or similar Unix command that gives you exactly that? Sure...this is not cross platform... Cheers, Mariano Martinez Peck Software Engineer, Instantiations Inc. Email: [hidden email] Twitter: https://twitter.com/MartinezPeck Blog: https://marianopeck.wordpress.com/You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/CAOUkibHYw%3DKFgJ8RuVYc0SuKUiSskTGQyUaDCGmPSQh4FntqoQ%40mail.gmail.com. |
Hi Mariano, thanks for the tips. If I could rely on a BOM, things were easy, as you say. Unfortunately, I can't. The streams I have to handle com from sources that do not respect any standards (i.e. German Banks) ;-) . IIUC, 'file' will only look for magic numbers, right? There are no magic numbers in these files. It 's really just text files that may or may not be encoded in UTF-8, the only chance to find out is to find a first occurence of an UTF-8 encoded character, just to guess this might be UTF-8. The fact that I receive these files as uploads from the browser would mean I have to save them to disk to use 'file'. Plus, 'file' is not available on Windows. So I really look forward to 9.3 ;-) For now, I'll brush up my little Stream reading knowledge and implement some naive search for UTF-encoded German Umlaut sequences in the uploads. Far from perfect, I know. Lots of "but that won't work for X (like french characters)". Anyways, It's great Instantiations has this area on their radar and we'll get a libicu based solution in 9.3! Thanks for that, keep up the good work! Joachim Am Freitag, 8. November 2019 19:30:06 UTC+1 schrieb Mariano Martinez Peck:
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/42776223-398d-4338-8f7f-8b7149bcc6d8%40googlegroups.com. |
Free forum by Nabble | Edit this page |