Problem with Traditional and Simplified Chinese parsing in Pharo

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with Traditional and Simplified Chinese parsing in Pharo

Clément Béra
Hi,

I am currently parsing lua and JSON-like file in Pharo. They contain both Simplified and Traditional Chinese characters for comments and for strings displayed in the UI. Lua files are parsed correctly. However the JSON-like files aren't.

In attachment I put one of the problematic file with Simplified chinese characters (I've also copied the file at the end of the mail). The problem can be shown as follow in Pharo:

'schinese.txt' asFileReference readStream contents

The contents function sends a UTF8InvalidText 'Invalid utf8 input detected' error.
However my text editor correctly displays the file and it is correctly parsed by the Lua runtime (The program parsed has been deployed in production for years and works fine).

What can I do to parse this file correctly from Pharo ?

Thanks,

Below is the file content, non Chinese people may not have the font to display the characters, note that I have no idea what is written in Chinese (Please don't hold me responsible if there are offending contents):

"lang"
{
"Language" "Schinese"
"Tokens"
{
"text_store_cd" "贝壳商店的商品会每天随机刷新! 下次刷新冷却时间"
"text_cannot_huidaoguoqu" "现在不能使用回到过去~"
"tips1" "利用鼠标滚轮可以调节视角距离,方便你查看场地全貌~"
"tips2" "通过全部50关以后你还可以继续挑战更加有难度的无尽试炼模式!"
"tips3" "每天早晨贝壳商店会随机刷新和随机打折,留心你想要的商品!"
"tips4" "排行榜的前25名可以获得皇冠奖章,象征着你在塔防游戏中的卓越成绩!"
"tips5" "每10波敌人会有一个BOSS关卡,它比普通敌人更难击杀~"
"tips6" "飞行的敌人不会被石头阻挡,所以不能利用迷宫来增长怪物的线路~"
"tips7" "开局的时候点击左下方的英雄选择图标可以查看并选择你拥有的英雄!"
"tips8" "隐身的敌人必须通过蛋白系列塔的照明光环才能被发现!"
"tips9" "点击右侧的合成公式按钮可以打开合成面板,了解高级塔的合成以及当前配件状态~"
"tips10" "邀请好友一起游戏,可以互相协作共同对抗强大的敌人!"
"tips11" "石板会对踩上去的敌人产生效果,所以最好放置在敌人必经的路线上~"
"tips12" "为英雄购买美丽的特效,当你可以一回合合成的时候所有配件都会有特效提示!"
"tips13" "每回合伤害最高的塔将会获得最多10层的MVP光环,增加物理和魔法输出!"
"tips14" "每个月的最后一天是赛季结算日,将会根据你这赛季的排名颁发丰厚的贝壳奖励!"
"tips15" "如果不知道怎样造迷宫,你可以点击右侧的迷宫指南按钮查看或者分享推荐的迷宫~"
}
}



--
Clément Béra
Pharo consortium engineer
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq

schinese.txt (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Traditional and Simplified Chinese parsing in Pharo

Guillermo Polito
Are you sure that the file is encoded in utf8? Can you try

stream := ZnCharacterReadStream on: (File named: '...') readStream encoding: 'utf8'.
stream upToEnd.

?

If that does not work, it could mean that the file is in another encoding...

2018-01-29 18:49 GMT+01:00 Clément Bera <[hidden email]>:
Hi,

I am currently parsing lua and JSON-like file in Pharo. They contain both Simplified and Traditional Chinese characters for comments and for strings displayed in the UI. Lua files are parsed correctly. However the JSON-like files aren't.

In attachment I put one of the problematic file with Simplified chinese characters (I've also copied the file at the end of the mail). The problem can be shown as follow in Pharo:

'schinese.txt' asFileReference readStream contents

The contents function sends a UTF8InvalidText 'Invalid utf8 input detected' error.
However my text editor correctly displays the file and it is correctly parsed by the Lua runtime (The program parsed has been deployed in production for years and works fine).

What can I do to parse this file correctly from Pharo ?

Thanks,

Below is the file content, non Chinese people may not have the font to display the characters, note that I have no idea what is written in Chinese (Please don't hold me responsible if there are offending contents):

"lang"
{
"Language" "Schinese"
"Tokens"
{
"text_store_cd" "贝壳商店的商品会每天随机刷新! 下次刷新冷却时间"
"text_cannot_huidaoguoqu" "现在不能使用回到过去~"
"tips1" "利用鼠标滚轮可以调节视角距离,方便你查看场地全貌~"
"tips2" "通过全部50关以后你还可以继续挑战更加有难度的无尽试炼模式!"
"tips3" "每天早晨贝壳商店会随机刷新和随机打折,留心你想要的商品!"
"tips4" "排行榜的前25名可以获得皇冠奖章,象征着你在塔防游戏中的卓越成绩!"
"tips5" "每10波敌人会有一个BOSS关卡,它比普通敌人更难击杀~"
"tips6" "飞行的敌人不会被石头阻挡,所以不能利用迷宫来增长怪物的线路~"
"tips7" "开局的时候点击左下方的英雄选择图标可以查看并选择你拥有的英雄!"
"tips8" "隐身的敌人必须通过蛋白系列塔的照明光环才能被发现!"
"tips9" "点击右侧的合成公式按钮可以打开合成面板,了解高级塔的合成以及当前配件状态~"
"tips10" "邀请好友一起游戏,可以互相协作共同对抗强大的敌人!"
"tips11" "石板会对踩上去的敌人产生效果,所以最好放置在敌人必经的路线上~"
"tips12" "为英雄购买美丽的特效,当你可以一回合合成的时候所有配件都会有特效提示!"
"tips13" "每回合伤害最高的塔将会获得最多10层的MVP光环,增加物理和魔法输出!"
"tips14" "每个月的最后一天是赛季结算日,将会根据你这赛季的排名颁发丰厚的贝壳奖励!"
"tips15" "如果不知道怎样造迷宫,你可以点击右侧的迷宫指南按钮查看或者分享推荐的迷宫~"
}
}



--
Clément Béra
Pharo consortium engineer
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq



--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13

Reply | Threaded
Open this post in threaded view
|

Re: Problem with Traditional and Simplified Chinese parsing in Pharo

Sven Van Caekenberghe-2
In reply to this post by Clément Béra
Your file is not in UTF-8 but in UTF-16 !

This will do:

(FileLocator desktop / 'schinese.txt') readStreamDo: [ :in |
  (ZnCharacterReadStream on: in binary encoding: #utf16) upToEnd ].

BTW, this is not valid JSON.

From Windows, for sure ...

> On 29 Jan 2018, at 18:49, Clément Bera <[hidden email]> wrote:
>
> Hi,
>
> I am currently parsing lua and JSON-like file in Pharo. They contain both Simplified and Traditional Chinese characters for comments and for strings displayed in the UI. Lua files are parsed correctly. However the JSON-like files aren't.
>
> In attachment I put one of the problematic file with Simplified chinese characters (I've also copied the file at the end of the mail). The problem can be shown as follow in Pharo:
>
> 'schinese.txt' asFileReference readStream contents
>
> The contents function sends a UTF8InvalidText 'Invalid utf8 input detected' error.
> However my text editor correctly displays the file and it is correctly parsed by the Lua runtime (The program parsed has been deployed in production for years and works fine).
>
> What can I do to parse this file correctly from Pharo ?
>
> Thanks,
>
> Below is the file content, non Chinese people may not have the font to display the characters, note that I have no idea what is written in Chinese (Please don't hold me responsible if there are offending contents):
>
> "lang"
> {
> "Language" "Schinese"
> "Tokens"
> {
> "text_store_cd" "贝壳商店的商品会每天随机刷新! 下次刷新冷却时间"
> "text_cannot_huidaoguoqu" "现在不能使用回到过去~"
> "tips1" "利用鼠标滚轮可以调节视角距离,方便你查看场地全貌~"
> "tips2" "通过全部50关以后你还可以继续挑战更加有难度的无尽试炼模式!"
> "tips3" "每天早晨贝壳商店会随机刷新和随机打折,留心你想要的商品!"
> "tips4" "排行榜的前25名可以获得皇冠奖章,象征着你在塔防游戏中的卓越成绩!"
> "tips5" "每10波敌人会有一个BOSS关卡,它比普通敌人更难击杀~"
> "tips6" "飞行的敌人不会被石头阻挡,所以不能利用迷宫来增长怪物的线路~"
> "tips7" "开局的时候点击左下方的英雄选择图标可以查看并选择你拥有的英雄!"
> "tips8" "隐身的敌人必须通过蛋白系列塔的照明光环才能被发现!"
> "tips9" "点击右侧的合成公式按钮可以打开合成面板,了解高级塔的合成以及当前配件状态~"
> "tips10" "邀请好友一起游戏,可以互相协作共同对抗强大的敌人!"
> "tips11" "石板会对踩上去的敌人产生效果,所以最好放置在敌人必经的路线上~"
> "tips12" "为英雄购买美丽的特效,当你可以一回合合成的时候所有配件都会有特效提示!"
> "tips13" "每回合伤害最高的塔将会获得最多10层的MVP光环,增加物理和魔法输出!"
> "tips14" "每个月的最后一天是赛季结算日,将会根据你这赛季的排名颁发丰厚的贝壳奖励!"
> "tips15" "如果不知道怎样造迷宫,你可以点击右侧的迷宫指南按钮查看或者分享推荐的迷宫~"
> }
> }
>
>
>
> --
> Clément Béra
> Pharo consortium engineer
> https://clementbera.wordpress.com/
> Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
> <schinese.txt>


Reply | Threaded
Open this post in threaded view
|

Re: Problem with Traditional and Simplified Chinese parsing in Pharo

CyrilFerlicot
In reply to this post by Clément Béra
Le 29/01/2018 à 18:49, Clément Bera a écrit :

> Hi,
>
> I am currently parsing lua and JSON-like file in Pharo. They contain
> both Simplified and Traditional Chinese characters for comments and for
> strings displayed in the UI. Lua files are parsed correctly. However the
> JSON-like files aren't.
>
> In attachment I put one of the problematic file with Simplified chinese
> characters (I've also copied the file at the end of the mail). The
> problem can be shown as follow in Pharo:
>
> 'schinese.txt' asFileReference readStream contents
>
> The contents function sends a UTF8InvalidText 'Invalid utf8 input
> detected' error.
> However my text editor correctly displays the file and it is correctly
> parsed by the Lua runtime (The program parsed has been deployed in
> production for years and works fine).
>
> What can I do to parse this file correctly from Pharo ?
>
Hi,

Here is how we manage encoding in Moose following Sven's advices.

We detect the encoding of a file with this code:

[ self fileReference binaryReadStreamDo: [ :in | (ZnCharacterEncoder
detectEncoding: in upToEnd) identifier ] ]
        on: ZnCharacterEncodingError
        do: [ nil ]

It is not bullet proof but I never got a problem since we use it.

Then to read a file we do this:

self fileReference
        binaryReadStreamDo:
            [ :in | (ZnCharacterReadStream on: in encoding: self
encoding) upToEnd ]

Here, self encoding return the result of the previous snippet.

Or you can use it this way:

ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data')
binaryReadStreamDo: [ :in | in upToEnd ]).

(FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in |
        | bytes encoder |
        bytes := in upToEnd.
        encoder := ZnCharacterEncoder detectEncoding: bytes.
        encoder decodeBytes: bytes ].

> Thanks,
>
> Below is the file content, non Chinese people may not have the font to
> display the characters, note that I have no idea what is written in
> Chinese (Please don't hold me responsible if there are offending contents):
>
> "lang"
> {
> "Language""Schinese"
> "Tokens"
> {
> "text_store_cd""贝壳商店的商品会每天随机刷新! 下次刷新冷却时间"
> "text_cannot_huidaoguoqu""现在不能使用回到过去~"
> "tips1""利用鼠标滚轮可以调节视角距离,方便你查看场地全貌~"
> "tips2""通过全部50关以后你还可以继续挑战更加有难度的无尽试炼模式!"
> "tips3""每天早晨贝壳商店会随机刷新和随机打折,留心你想要的商品!"
> "tips4""排行榜的前25名可以获得皇冠奖章,象征着你在塔防游戏中的卓越成绩!"
> "tips5""每10波敌人会有一个BOSS关卡,它比普通敌人更难击杀~"
> "tips6""飞行的敌人不会被石头阻挡,所以不能利用迷宫来增长怪物的线路~"
> "tips7""开局的时候点击左下方的英雄选择图标可以查看并选择你拥有的英雄!"
> "tips8""隐身的敌人必须通过蛋白系列塔的照明光环才能被发现!"
> "tips9""点击右侧的合成公式按钮可以打开合成面板,了解高级塔的合成以及当前
> 配件状态~"
> "tips10""邀请好友一起游戏,可以互相协作共同对抗强大的敌人!"
> "tips11""石板会对踩上去的敌人产生效果,所以最好放置在敌人必经的路线上~"
> "tips12""为英雄购买美丽的特效,当你可以一回合合成的时候所有配件都会有特
> 效提示!"
> "tips13""每回合伤害最高的塔将会获得最多10层的MVP光环,增加物理和魔法输出!"
> "tips14""每个月的最后一天是赛季结算日,将会根据你这赛季的排名颁发丰厚的
> 贝壳奖励!"
> "tips15""如果不知道怎样造迷宫,你可以点击右侧的迷宫指南按钮查看或者分享
> 推荐的迷宫~"
> }
> }
>
>
>
> --
> Clément Béra
> Pharo consortium engineer
> https://clementbera.wordpress.com/
> Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq

--
Cyril Ferlicot
https://ferlicot.fr


signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Traditional and Simplified Chinese parsing in Pharo

Clément Béra
In reply to this post by Sven Van Caekenberghe-2
Thanks all, Sven solution works.

Yes this is not JSON it's some kind of JSON-like format (Is it yml ? I don't know it might be proprietary).

I was naive and thought there was some metadata in files precising the encoding used, and that #readStream on FileReference was able to pick automatically the correct decoder. Obviously it isn't the case.

Thanks anyway.

2018-01-29 18:59 GMT+01:00 Sven Van Caekenberghe <[hidden email]>:
Your file is not in UTF-8 but in UTF-16 !

This will do:

(FileLocator desktop / 'schinese.txt') readStreamDo: [ :in |
  (ZnCharacterReadStream on: in binary encoding: #utf16) upToEnd ].

BTW, this is not valid JSON.

From Windows, for sure ...

> On 29 Jan 2018, at 18:49, Clément Bera <[hidden email]> wrote:
>
> Hi,
>
> I am currently parsing lua and JSON-like file in Pharo. They contain both Simplified and Traditional Chinese characters for comments and for strings displayed in the UI. Lua files are parsed correctly. However the JSON-like files aren't.
>
> In attachment I put one of the problematic file with Simplified chinese characters (I've also copied the file at the end of the mail). The problem can be shown as follow in Pharo:
>
> 'schinese.txt' asFileReference readStream contents
>
> The contents function sends a UTF8InvalidText 'Invalid utf8 input detected' error.
> However my text editor correctly displays the file and it is correctly parsed by the Lua runtime (The program parsed has been deployed in production for years and works fine).
>
> What can I do to parse this file correctly from Pharo ?
>
> Thanks,
>
> Below is the file content, non Chinese people may not have the font to display the characters, note that I have no idea what is written in Chinese (Please don't hold me responsible if there are offending contents):
>
> "lang"
> {
>       "Language"      "Schinese"
>       "Tokens"
>       {
>               "text_store_cd" "贝壳商店的商品会每天随机刷新! 下次刷新冷却时间"
>               "text_cannot_huidaoguoqu"       "现在不能使用回到过去~"
>               "tips1" "利用鼠标滚轮可以调节视角距离,方便你查看场地全貌~"
>               "tips2" "通过全部50关以后你还可以继续挑战更加有难度的无尽试炼模式!"
>               "tips3" "每天早晨贝壳商店会随机刷新和随机打折,留心你想要的商品!"
>               "tips4" "排行榜的前25名可以获得皇冠奖章,象征着你在塔防游戏中的卓越成绩!"
>               "tips5" "每10波敌人会有一个BOSS关卡,它比普通敌人更难击杀~"
>               "tips6" "飞行的敌人不会被石头阻挡,所以不能利用迷宫来增长怪物的线路~"
>               "tips7" "开局的时候点击左下方的英雄选择图标可以查看并选择你拥有的英雄!"
>               "tips8" "隐身的敌人必须通过蛋白系列塔的照明光环才能被发现!"
>               "tips9" "点击右侧的合成公式按钮可以打开合成面板,了解高级塔的合成以及当前配件状态~"
>               "tips10"        "邀请好友一起游戏,可以互相协作共同对抗强大的敌人!"
>               "tips11"        "石板会对踩上去的敌人产生效果,所以最好放置在敌人必经的路线上~"
>               "tips12"        "为英雄购买美丽的特效,当你可以一回合合成的时候所有配件都会有特效提示!"
>               "tips13"        "每回合伤害最高的塔将会获得最多10层的MVP光环,增加物理和魔法输出!"
>               "tips14"        "每个月的最后一天是赛季结算日,将会根据你这赛季的排名颁发丰厚的贝壳奖励!"
>               "tips15"        "如果不知道怎样造迷宫,你可以点击右侧的迷宫指南按钮查看或者分享推荐的迷宫~"
>       }
> }
>
>
>
> --
> Clément Béra
> Pharo consortium engineer
> https://clementbera.wordpress.com/
> Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
> <schinese.txt>