libicu ucsdet_detect in 9.2 ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

libicu ucsdet_detect in 9.2 ?

jtuchel
Dear vast team,

In Seth's ESUG talk about VAST 9.2 there was a slide mentioning the new Application EsCodePageUtilities and talking about libicu integration with the VM.

Does this mean we'll get some of the funcionality of libicu wrapped in Smalltalk. I'm especially interested in ucsdet_detect or ucsdet_detectAll. I can't find any reference to these names in the ECAP 9.2 build.

How hard would it be to wrap these on my own, and will these be available to a VAST on both Linux and Windows?

Any hints?

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/ecf6fdec-2bb9-42a6-b90d-b58fdb6650cf%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: libicu ucsdet_detect in 9.2 ?

Mariano Martinez Peck-2


On Fri, Nov 8, 2019 at 6:11 AM Joachim Tuchel <[hidden email]> wrote:
Dear vast team,

In Seth's ESUG talk about VAST 9.2 there was a slide mentioning the new Application EsCodePageUtilities and talking about libicu integration with the VM.


Hi Joachim,

The EsCodePageUtilities is ready for 9.2 and you can find it in ECAP. 
 
Does this mean we'll get some of the funcionality of libicu wrapped in Smalltalk. I'm especially interested in ucsdet_detect or ucsdet_detectAll. I can't find any reference to these names in the ECAP 9.2 build.


Yes, the idea is to wrap ICU from Smalltalk and that's probably one of the first things we will start with in 9.3. But not for 9.2. 
 
How hard would it be to wrap these on my own, and will these be available to a VAST on both Linux and Windows?


If I were you I would just wait for 9.3. We can do an ECAP as soon as we have ICU wrapped. 
 
Any hints?


Not me. Maybe Seth has something to add.

Best,

--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/CAOUkibG%2Bb1Y9JPCNzfab7JoHiHf93pXQUZ1TUbBBDc_kx%3D_RjA%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: libicu ucsdet_detect in 9.2 ?

jtuchel
Mariano,

thanks for this info. It's great to see unicade making its way into VAST.

I am not sure I can wait for 9.3, though, so I might have to implement some poor man's UTF-8 detection for at least some letters... The scope of my use case is relatively narrow...


Joachim








Am Freitag, 8. November 2019 14:06:30 UTC+1 schrieb Mariano Martinez Peck:


On Fri, Nov 8, 2019 at 6:11 AM Joachim Tuchel <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="BahIECU7AgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">jtu...@...> wrote:
Dear vast team,

In Seth's ESUG talk about VAST 9.2 there was a slide mentioning the new Application EsCodePageUtilities and talking about libicu integration with the VM.


Hi Joachim,

The EsCodePageUtilities is ready for 9.2 and you can find it in ECAP. 
 
Does this mean we'll get some of the funcionality of libicu wrapped in Smalltalk. I'm especially interested in <a href="http://icu-project.org/apiref/icu4c/ucsdet_8h.html#9f3eb31ff3d5194f9ac915214a8e94fd" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Ficu-project.org%2Fapiref%2Ficu4c%2Fucsdet_8h.html%239f3eb31ff3d5194f9ac915214a8e94fd\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNG4wDbXD1qBgNWE-rBPvtb2_bTJ6g&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Ficu-project.org%2Fapiref%2Ficu4c%2Fucsdet_8h.html%239f3eb31ff3d5194f9ac915214a8e94fd\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNG4wDbXD1qBgNWE-rBPvtb2_bTJ6g&#39;;return true;">ucsdet_detect or ucsdet_detectAll. I can't find any reference to these names in the ECAP 9.2 build.


Yes, the idea is to wrap ICU from Smalltalk and that's probably one of the first things we will start with in 9.3. But not for 9.2. 
 
How hard would it be to wrap these on my own, and will these be available to a VAST on both Linux and Windows?


If I were you I would just wait for 9.3. We can do an ECAP as soon as we have ICU wrapped. 
 
Any hints?


Not me. Maybe Seth has something to add.

Best,

--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.
Email: <a href="javascript:" target="_blank" gdf-obfuscated-mailto="BahIECU7AgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">mp...@instantiations.com
Twitter: <a href="https://twitter.com/MartinezPeck" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Ftwitter.com%2FMartinezPeck\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFPV-7Bnc-U6phGEh-VZU0iUtY7vw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Ftwitter.com%2FMartinezPeck\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFPV-7Bnc-U6phGEh-VZU0iUtY7vw&#39;;return true;">https://twitter.com/MartinezPeck
LinkedIn: <a href="https://www.linkedin.com/in/mariano-mart%C3%ADnez-peck/" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwww.linkedin.com%2Fin%2Fmariano-mart%25C3%25ADnez-peck%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGyJTUAWXPstaw4J3OpFUYRyFAqmw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwww.linkedin.com%2Fin%2Fmariano-mart%25C3%25ADnez-peck%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGyJTUAWXPstaw4J3OpFUYRyFAqmw&#39;;return true;">www.linkedin.com/in/mariano-martinez-peck
Blog: <a href="https://marianopeck.wordpress.com/" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fmarianopeck.wordpress.com%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHAOaIsyMIYgmQWdQZRuKRdD6gBfw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fmarianopeck.wordpress.com%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHAOaIsyMIYgmQWdQZRuKRdD6gBfw&#39;;return true;">https://marianopeck.wordpress.com/

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/17604cb0-7e85-435a-bbce-dc016032b73a%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: libicu ucsdet_detect in 9.2 ?

Mariano Martinez Peck-2


On Fri, Nov 8, 2019 at 11:19 AM Joachim Tuchel <[hidden email]> wrote:
Mariano,

thanks for this info. It's great to see unicade making its way into VAST.

I am not sure I can wait for 9.3, though, so I might have to implement some poor man's UTF-8 detection for at least some letters... The scope of my use case is relatively narrow...


If it is UTF-8 ONLY detection AND the stream has a UTF BOM then it's very easy to implement. You just need to read 3 bytes and check if that looks like a BOM. 

If you don't have  BOM or you need to detect all possible encoders, did you think about just doing a UnixProcess with 'file' or similar Unix command that gives you exactly that?  Sure...this is not cross platform...
 
Cheers, 

--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/CAOUkibHYw%3DKFgJ8RuVYc0SuKUiSskTGQyUaDCGmPSQh4FntqoQ%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: libicu ucsdet_detect in 9.2 ?

jtuchel
Hi Mariano,


thanks for the tips. If I could rely on a BOM, things were easy, as you say. Unfortunately, I can't. The streams I have to handle com from sources that do not respect any standards (i.e. German Banks) ;-) .

IIUC, 'file' will only look for magic numbers, right? There are no magic numbers in these files. It 's really just text files that may or may not be encoded in UTF-8, the only chance to find out is to find a first occurence of an UTF-8 encoded character, just to guess this might be UTF-8. The fact that I receive these files as uploads from the browser would mean I have to save them to disk to use 'file'. Plus, 'file' is not available on Windows. So I really look forward to 9.3 ;-)
For now, I'll brush up my little Stream reading knowledge and implement some naive search for UTF-encoded German Umlaut sequences in the uploads. Far from perfect, I know. Lots of "but that won't work for X (like french characters)".


Anyways, It's great Instantiations has this area on their radar and we'll get a libicu based solution in 9.3! Thanks for that, keep up the good work!

Joachim







Am Freitag, 8. November 2019 19:30:06 UTC+1 schrieb Mariano Martinez Peck:


On Fri, Nov 8, 2019 at 11:19 AM Joachim Tuchel <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="3Uuho81MAgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">jtu...@...> wrote:
Mariano,

thanks for this info. It's great to see unicade making its way into VAST.

I am not sure I can wait for 9.3, though, so I might have to implement some poor man's UTF-8 detection for at least some letters... The scope of my use case is relatively narrow...


If it is UTF-8 ONLY detection AND the stream has a UTF BOM then it's very easy to implement. You just need to read 3 bytes and check if that looks like a BOM. 

If you don't have  BOM or you need to detect all possible encoders, did you think about just doing a UnixProcess with 'file' or similar Unix command that gives you exactly that?  Sure...this is not cross platform...
 
Cheers, 

--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.
Email: <a href="javascript:" target="_blank" gdf-obfuscated-mailto="3Uuho81MAgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">mp...@instantiations.com
Twitter: <a href="https://twitter.com/MartinezPeck" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Ftwitter.com%2FMartinezPeck\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFPV-7Bnc-U6phGEh-VZU0iUtY7vw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Ftwitter.com%2FMartinezPeck\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFPV-7Bnc-U6phGEh-VZU0iUtY7vw&#39;;return true;">https://twitter.com/MartinezPeck
LinkedIn: <a href="https://www.linkedin.com/in/mariano-mart%C3%ADnez-peck/" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwww.linkedin.com%2Fin%2Fmariano-mart%25C3%25ADnez-peck%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGyJTUAWXPstaw4J3OpFUYRyFAqmw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwww.linkedin.com%2Fin%2Fmariano-mart%25C3%25ADnez-peck%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGyJTUAWXPstaw4J3OpFUYRyFAqmw&#39;;return true;">www.linkedin.com/in/mariano-martinez-peck
Blog: <a href="https://marianopeck.wordpress.com/" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fmarianopeck.wordpress.com%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHAOaIsyMIYgmQWdQZRuKRdD6gBfw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fmarianopeck.wordpress.com%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHAOaIsyMIYgmQWdQZRuKRdD6gBfw&#39;;return true;">https://marianopeck.wordpress.com/

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/42776223-398d-4338-8f7f-8b7149bcc6d8%40googlegroups.com.