Moving from mc to tonel?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Moving from mc to tonel?

tinchodias
Hi all, I'm considering to move Epicea (+Ombu+Hiedra) to github. I read that tonel is the new format and there is Peter's git-migration, which could be the tool to do it. 

Is it good idea to move now?
Is it the right tool?

Bests,
Martín
Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

Peter Uhnak
Hi Martin,

several people successfully migrated with git-migration... so I guess it works. 

> Is it good idea to move now?

There is no risk in trying to use it. If it fails, you report a bug. :-)

Is it the right tool?

Yes.

As for Tonel... git-migration uses FileTree, but I am looking into making it configurable.

(I have filetree-tonel migration tool now... but that is a bit more hardcore and potentially destructive :) (unlike git-migration)).

Cheers,
Peter





On Mon, Nov 6, 2017 at 1:42 AM, Martin Dias <[hidden email]> wrote:
Hi all, I'm considering to move Epicea (+Ombu+Hiedra) to github. I read that tonel is the new format and there is Peter's git-migration, which could be the tool to do it. 

Is it good idea to move now?
Is it the right tool?

Bests,
Martín

Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

tinchodias
Hi:

1. thanks for the tool and information.
2. I forgot to tell you that I tried the tool some weeks ago and it ran out of memory. I should try again with more memory or somehting different...

Cheers,
Martin

On Mon, Nov 6, 2017 at 5:49 AM, Peter Uhnák <[hidden email]> wrote:
Hi Martin,

several people successfully migrated with git-migration... so I guess it works. 

> Is it good idea to move now?

There is no risk in trying to use it. If it fails, you report a bug. :-)

Is it the right tool?

Yes.

As for Tonel... git-migration uses FileTree, but I am looking into making it configurable.

(I have filetree-tonel migration tool now... but that is a bit more hardcore and potentially destructive :) (unlike git-migration)).

Cheers,
Peter





On Mon, Nov 6, 2017 at 1:42 AM, Martin Dias <[hidden email]> wrote:
Hi all, I'm considering to move Epicea (+Ombu+Hiedra) to github. I read that tonel is the new format and there is Peter's git-migration, which could be the tool to do it. 

Is it good idea to move now?
Is it the right tool?

Bests,
Martín


Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

Peter Uhnak
That sounds like a bug. Is the repo you are trying to migrate public?

On Thu, Nov 30, 2017 at 6:01 PM, Martin Dias <[hidden email]> wrote:
Hi:

1. thanks for the tool and information.
2. I forgot to tell you that I tried the tool some weeks ago and it ran out of memory. I should try again with more memory or somehting different...

Cheers,
Martin

On Mon, Nov 6, 2017 at 5:49 AM, Peter Uhnák <[hidden email]> wrote:
Hi Martin,

several people successfully migrated with git-migration... so I guess it works. 

> Is it good idea to move now?

There is no risk in trying to use it. If it fails, you report a bug. :-)

Is it the right tool?

Yes.

As for Tonel... git-migration uses FileTree, but I am looking into making it configurable.

(I have filetree-tonel migration tool now... but that is a bit more hardcore and potentially destructive :) (unlike git-migration)).

Cheers,
Peter





On Mon, Nov 6, 2017 at 1:42 AM, Martin Dias <[hidden email]> wrote:
Hi all, I'm considering to move Epicea (+Ombu+Hiedra) to github. I read that tonel is the new format and there is Peter's git-migration, which could be the tool to do it. 

Is it good idea to move now?
Is it the right tool?

Bests,
Martín



Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

Julien Delplanque-2
In reply to this post by tinchodias
If you use Traits and want to have Epicea loadable from GitHub on Pharo 6.1, it will not work.

There is an issue when you try to load a Trait in Tonel format in the image.

---
Julien Delplanque
Doctorant à l’Université de Lille 1
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
Numéro de téléphone: +333 59 35 86 40

Le 6 nov. 2017 à 01:42, Martin Dias <[hidden email]> a écrit :

Hi all, I'm considering to move Epicea (+Ombu+Hiedra) to github. I read that tonel is the new format and there is Peter's git-migration, which could be the tool to do it. 

Is it good idea to move now?
Is it the right tool?

Bests,
Martín

Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

tinchodias
Julien, I dodn't know it but anyway my issue was just in the step smalltalkhub -> filtree using the migration tool.

Peter, yes it's public. This is the code I used:

----------------------------------------

migration := GitMigration on: 'MartinDias/Epicea'.
migration cacheAllVersions.
migration authors: {
'md' -> #('Martin Dias' '<[hidden email]>').
'MD' -> #('Martin Dias' '<[hidden email]>').
'MartinDias' -> #('Martin Dias' '<[hidden email]>').
'PabloTesone' -> #('Pablo Tesone' '<[hidden email]>').
'PavelKrivanek' -> #('Pavel Krivanek' '<[hidden email]>').
'DamienCassou' -> #('Damien Cassou' '<[hidden email]>').
'DenisKudryashov' -> #('Denis Kudryashov' '<[hidden email]>').
'GustavoSantos' -> #('Gustavo Santos' '<[hidden email]>').
'ThomasHeniart' -> #('Thomas Heniart' '<[hidden email]>').
'LucasGodoy' -> #('Lucas Godoy' '<[hidden email]>').
'StephaneDucasse' -> #('Stephane Ducasse' '<[hidden email]>').
'VincentBlondeau' -> #('Vincent Blondeau' '<[hidden email]>').
'AlbertoBacchelli' -> #('Alberto Bacchelli' '<[hidden email]>').
'SkipLentz' -> #('Skip Lentz' '<[hidden email]>').
'TheIntegrator' -> #('The Integrator' '<[hidden email]>').
}.l
migration
fastImportCodeToDirectory: 'src'
initialCommit: '18a30d8d'
to: './epicea-fast-import.txt'

----------------------

Martín

On Thu, Nov 30, 2017 at 2:14 PM, Julien <[hidden email]> wrote:
If you use Traits and want to have Epicea loadable from GitHub on Pharo 6.1, it will not work.

There is an issue when you try to load a Trait in Tonel format in the image.

---
Julien Delplanque
Doctorant à l’Université de Lille 1
Numéro de téléphone: +333 59 35 86 40

Le 6 nov. 2017 à 01:42, Martin Dias <[hidden email]> a écrit :

Hi all, I'm considering to move Epicea (+Ombu+Hiedra) to github. I read that tonel is the new format and there is Peter's git-migration, which could be the tool to do it. 

Is it good idea to move now?
Is it the right tool?

Bests,
Martín


Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

tinchodias
I suspect it's related to the large number of commits in my repo. I made some tweaks and succeeded to create the fast-import file. But I get:

fatal: Unsupported command: .
fast-import: dumping crash report to .git/fast_import_crash_10301

Do you recognize this error? 
I will check my changes tweaking the git-migration tool to see if I modified some behavior my mistake...

cheers,
Martín

On Fri, Dec 1, 2017 at 1:42 PM, Martin Dias <[hidden email]> wrote:
Julien, I dodn't know it but anyway my issue was just in the step smalltalkhub -> filtree using the migration tool.

Peter, yes it's public. This is the code I used:

----------------------------------------

migration := GitMigration on: 'MartinDias/Epicea'.
migration cacheAllVersions.
migration authors: {
'md' -> #('Martin Dias' '<[hidden email]>').
'MD' -> #('Martin Dias' '<[hidden email]>').
'MartinDias' -> #('Martin Dias' '<[hidden email]>').
'PabloTesone' -> #('Pablo Tesone' '<[hidden email]>').
'PavelKrivanek' -> #('Pavel Krivanek' '<[hidden email]>').
'DamienCassou' -> #('Damien Cassou' '<[hidden email]>').
'DenisKudryashov' -> #('Denis Kudryashov' '<[hidden email]>').
'GustavoSantos' -> #('Gustavo Santos' '<[hidden email]>').
'ThomasHeniart' -> #('Thomas Heniart' '<[hidden email]>').
'LucasGodoy' -> #('Lucas Godoy' '<[hidden email]>').
'StephaneDucasse' -> #('Stephane Ducasse' '<[hidden email]>').
'VincentBlondeau' -> #('Vincent Blondeau' '<[hidden email]>').
'AlbertoBacchelli' -> #('Alberto Bacchelli' '<[hidden email]>').
'SkipLentz' -> #('Skip Lentz' '<[hidden email]>').
'TheIntegrator' -> #('The Integrator' '<[hidden email]>').
}.l
migration
fastImportCodeToDirectory: 'src'
initialCommit: '18a30d8d'
to: './epicea-fast-import.txt'

----------------------

Martín

On Thu, Nov 30, 2017 at 2:14 PM, Julien <[hidden email]> wrote:
If you use Traits and want to have Epicea loadable from GitHub on Pharo 6.1, it will not work.

There is an issue when you try to load a Trait in Tonel format in the image.

---
Julien Delplanque
Doctorant à l’Université de Lille 1
Numéro de téléphone: +333 59 35 86 40

Le 6 nov. 2017 à 01:42, Martin Dias <[hidden email]> a écrit :

Hi all, I'm considering to move Epicea (+Ombu+Hiedra) to github. I read that tonel is the new format and there is Peter's git-migration, which could be the tool to do it. 

Is it good idea to move now?
Is it the right tool?

Bests,
Martín



Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

alistairgrant
On 5 December 2017 at 03:41, Martin Dias <[hidden email]> wrote:
> I suspect it's related to the large number of commits in my repo. I made
> some tweaks and succeeded to create the fast-import file. But I get:
>
> fatal: Unsupported command: .
> fast-import: dumping crash report to .git/fast_import_crash_10301
>
> Do you recognize this error?
> I will check my changes tweaking the git-migration tool to see if I modified
> some behavior my mistake...

I had the same error just last night.

In my case, it turned out to be a non-UTF8 encoded character in one of
the commit messages.

I tracked it down by looking at the crash report and searching for a
nearby command.  I've deleted the crash reports now, but I think it
was the number associated with a mark command that got me near the
problem character in the fast-import file.

I also modified the code to halt whenever it found a non-UTF8
character.  I'm sure there are better ways to do this, but:


GitMigrationCommitInfo>>inlineDataFor: aString

    | theString |
    theString := aString.
    "Ensure the message has valid UTF-8 encoding (this will raise an
error if it doesn't)"
    [ (ZnCharacterEncoder newForEncoding: 'utf8') decodeBytes: aString
asByteArray ]
        on: Error
        do: [ :ex | self halt: 'Illegal string encoding'.
            theString := aString select: [ :each | each asciiValue
between: 32 and: 128 ] ].
    ^ 'data ' , theString size asString , String cr , (theString
ifEmpty: [ '' ] ifNotEmpty: [ theString , String cr ])



Cheers,
Alistair

Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

Sven Van Caekenberghe-2


> On 5 Dec 2017, at 08:34, Alistair Grant <[hidden email]> wrote:
>
> On 5 December 2017 at 03:41, Martin Dias <[hidden email]> wrote:
>> I suspect it's related to the large number of commits in my repo. I made
>> some tweaks and succeeded to create the fast-import file. But I get:
>>
>> fatal: Unsupported command: .
>> fast-import: dumping crash report to .git/fast_import_crash_10301
>>
>> Do you recognize this error?
>> I will check my changes tweaking the git-migration tool to see if I modified
>> some behavior my mistake...
>
> I had the same error just last night.
>
> In my case, it turned out to be a non-UTF8 encoded character in one of
> the commit messages.
>
> I tracked it down by looking at the crash report and searching for a
> nearby command.  I've deleted the crash reports now, but I think it
> was the number associated with a mark command that got me near the
> problem character in the fast-import file.
>
> I also modified the code to halt whenever it found a non-UTF8
> character.  I'm sure there are better ways to do this, but:
>
>
> GitMigrationCommitInfo>>inlineDataFor: aString
>
>    | theString |
>    theString := aString.
>    "Ensure the message has valid UTF-8 encoding (this will raise an
> error if it doesn't)"
>    [ (ZnCharacterEncoder newForEncoding: 'utf8') decodeBytes: aString
> asByteArray ]
>        on: Error
>        do: [ :ex | self halt: 'Illegal string encoding'.
>            theString := aString select: [ :each | each asciiValue
> between: 32 and: 128 ] ].
>    ^ 'data ' , theString size asString , String cr , (theString
> ifEmpty: [ '' ] ifNotEmpty: [ theString , String cr ])

There is also ByteArray>>#utf8Decoded (as well as String>>#utf8Encoded), both using the newer ZnCharacterEncoders. So you could write it shorter as:

  aString asByteArray utf8Decoded

Instead of Error you could use the more intention revealing ZnCharacterEncodingError.

Apart from that, and I known you did not create this mess, encoding a String into a String, although it can be done, is so wrong. You encode a String (a collection of Characters, of Unicode code points) into a ByteArray and you decode a ByteArray into a String.

Sending #asByteArray to a String in almost always wrong, as is sending #asString to a ByteArray. These are implicit conversions (null conversions, like ZnNullEncoder) that only work for pure ASCII or Latin1 (iso-8859-1), but not for the much richer set of Characters that Pharo supports. So these will eventually fail, one day, in a country far away.

> Cheers,
> Alistair
>


Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

Peter Uhnak
In reply to this post by tinchodias

I suspect it's related to the large number of commits in my repo. I made some tweaks and succeeded to create the fast-import file. But I get:

All files in a single commit will be held temporarily in MemoryStore, so unless you have GBs of code in _single_ commit it shouldn't be a problem.
There's also a possiblity that Pharo was going too fast and didn't manage to GC it in time or there's a memory leak.
 

fatal: Unsupported command: .
fast-import: dumping crash report to .git/fast_import_crash_10301

Do you recognize this error? 


Please look into the mentioned file, it should give you a context for the error (and then you can look into the import file around the same context).

Peter

Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

Peter Uhnak
In reply to this post by Sven Van Caekenberghe-2
In my case, it turned out to be a non-UTF8 encoded character in one of the commit messages.

I've ran into this problem in a sister project (tonel-migration), and do not have a proper resolution yet. I was forcing everything to be unicode, so I need a better way to read and write encoded strings. :<

On Tue, Dec 5, 2017 at 8:56 AM, Sven Van Caekenberghe <[hidden email]> wrote:


> On 5 Dec 2017, at 08:34, Alistair Grant <[hidden email]> wrote:
>
> On 5 December 2017 at 03:41, Martin Dias <[hidden email]> wrote:
>> I suspect it's related to the large number of commits in my repo. I made
>> some tweaks and succeeded to create the fast-import file. But I get:
>>
>> fatal: Unsupported command: .
>> fast-import: dumping crash report to .git/fast_import_crash_10301
>>
>> Do you recognize this error?
>> I will check my changes tweaking the git-migration tool to see if I modified
>> some behavior my mistake...
>
> I had the same error just last night.
>
> In my case, it turned out to be a non-UTF8 encoded character in one of
> the commit messages.
>
> I tracked it down by looking at the crash report and searching for a
> nearby command.  I've deleted the crash reports now, but I think it
> was the number associated with a mark command that got me near the
> problem character in the fast-import file.
>
> I also modified the code to halt whenever it found a non-UTF8
> character.  I'm sure there are better ways to do this, but:
>
>
> GitMigrationCommitInfo>>inlineDataFor: aString
>
>    | theString |
>    theString := aString.
>    "Ensure the message has valid UTF-8 encoding (this will raise an
> error if it doesn't)"
>    [ (ZnCharacterEncoder newForEncoding: 'utf8') decodeBytes: aString
> asByteArray ]
>        on: Error
>        do: [ :ex | self halt: 'Illegal string encoding'.
>            theString := aString select: [ :each | each asciiValue
> between: 32 and: 128 ] ].
>    ^ 'data ' , theString size asString , String cr , (theString
> ifEmpty: [ '' ] ifNotEmpty: [ theString , String cr ])

There is also ByteArray>>#utf8Decoded (as well as String>>#utf8Encoded), both using the newer ZnCharacterEncoders. So you could write it shorter as:

  aString asByteArray utf8Decoded

Instead of Error you could use the more intention revealing ZnCharacterEncodingError.

Apart from that, and I known you did not create this mess, encoding a String into a String, although it can be done, is so wrong. You encode a String (a collection of Characters, of Unicode code points) into a ByteArray and you decode a ByteArray into a String.

Sending #asByteArray to a String in almost always wrong, as is sending #asString to a ByteArray. These are implicit conversions (null conversions, like ZnNullEncoder) that only work for pure ASCII or Latin1 (iso-8859-1), but not for the much richer set of Characters that Pharo supports. So these will eventually fail, one day, in a country far away.

> Cheers,
> Alistair
>



Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

Sven Van Caekenberghe-2


> On 5 Dec 2017, at 08:59, Peter Uhnák <[hidden email]> wrote:
>
> > In my case, it turned out to be a non-UTF8 encoded character in one of the commit messages.
>
> I've ran into this problem in a sister project (tonel-migration), and do not have a proper resolution yet. I was forcing everything to be unicode, so I need a better way to read and write encoded strings. :<

Maybe ZnCharacterEncoder class>>#detectEncoding: can help, although it is far from perfect.

> On Tue, Dec 5, 2017 at 8:56 AM, Sven Van Caekenberghe <[hidden email]> wrote:
>
>
> > On 5 Dec 2017, at 08:34, Alistair Grant <[hidden email]> wrote:
> >
> > On 5 December 2017 at 03:41, Martin Dias <[hidden email]> wrote:
> >> I suspect it's related to the large number of commits in my repo. I made
> >> some tweaks and succeeded to create the fast-import file. But I get:
> >>
> >> fatal: Unsupported command: .
> >> fast-import: dumping crash report to .git/fast_import_crash_10301
> >>
> >> Do you recognize this error?
> >> I will check my changes tweaking the git-migration tool to see if I modified
> >> some behavior my mistake...
> >
> > I had the same error just last night.
> >
> > In my case, it turned out to be a non-UTF8 encoded character in one of
> > the commit messages.
> >
> > I tracked it down by looking at the crash report and searching for a
> > nearby command.  I've deleted the crash reports now, but I think it
> > was the number associated with a mark command that got me near the
> > problem character in the fast-import file.
> >
> > I also modified the code to halt whenever it found a non-UTF8
> > character.  I'm sure there are better ways to do this, but:
> >
> >
> > GitMigrationCommitInfo>>inlineDataFor: aString
> >
> >    | theString |
> >    theString := aString.
> >    "Ensure the message has valid UTF-8 encoding (this will raise an
> > error if it doesn't)"
> >    [ (ZnCharacterEncoder newForEncoding: 'utf8') decodeBytes: aString
> > asByteArray ]
> >        on: Error
> >        do: [ :ex | self halt: 'Illegal string encoding'.
> >            theString := aString select: [ :each | each asciiValue
> > between: 32 and: 128 ] ].
> >    ^ 'data ' , theString size asString , String cr , (theString
> > ifEmpty: [ '' ] ifNotEmpty: [ theString , String cr ])
>
> There is also ByteArray>>#utf8Decoded (as well as String>>#utf8Encoded), both using the newer ZnCharacterEncoders. So you could write it shorter as:
>
>   aString asByteArray utf8Decoded
>
> Instead of Error you could use the more intention revealing ZnCharacterEncodingError.
>
> Apart from that, and I known you did not create this mess, encoding a String into a String, although it can be done, is so wrong. You encode a String (a collection of Characters, of Unicode code points) into a ByteArray and you decode a ByteArray into a String.
>
> Sending #asByteArray to a String in almost always wrong, as is sending #asString to a ByteArray. These are implicit conversions (null conversions, like ZnNullEncoder) that only work for pure ASCII or Latin1 (iso-8859-1), but not for the much richer set of Characters that Pharo supports. So these will eventually fail, one day, in a country far away.
>
> > Cheers,
> > Alistair
> >
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

alistairgrant
On Tue, Dec 05, 2017 at 08:56:53AM +0100, Sven Van Caekenberghe wrote:

>
> > On 5 Dec 2017, at 08:34, Alistair Grant <[hidden email]> wrote:
> >
> > On 5 December 2017 at 03:41, Martin Dias <[hidden email]> wrote:
> >> I suspect it's related to the large number of commits in my repo. I made
> >> some tweaks and succeeded to create the fast-import file. But I get:
> >>
> >> fatal: Unsupported command: .
> >> fast-import: dumping crash report to .git/fast_import_crash_10301
> >>
> >> Do you recognize this error?
> >> I will check my changes tweaking the git-migration tool to see if I modified
> >> some behavior my mistake...
> >
> > I had the same error just last night.
> >
> > In my case, it turned out to be a non-UTF8 encoded character in one of
> > the commit messages.
> >
> > I tracked it down by looking at the crash report and searching for a
> > nearby command.  I've deleted the crash reports now, but I think it
> > was the number associated with a mark command that got me near the
> > problem character in the fast-import file.
> >
> > I also modified the code to halt whenever it found a non-UTF8
> > character.  I'm sure there are better ways to do this, but:
> >
> >
> > GitMigrationCommitInfo>>inlineDataFor: aString
> >
> >    | theString |
> >    theString := aString.
> >    "Ensure the message has valid UTF-8 encoding (this will raise an
> > error if it doesn't)"
> >    [ (ZnCharacterEncoder newForEncoding: 'utf8') decodeBytes: aString
> > asByteArray ]
> >        on: Error
> >        do: [ :ex | self halt: 'Illegal string encoding'.
> >            theString := aString select: [ :each | each asciiValue
> > between: 32 and: 128 ] ].
> >    ^ 'data ' , theString size asString , String cr , (theString
> > ifEmpty: [ '' ] ifNotEmpty: [ theString , String cr ])
>
> There is also ByteArray>>#utf8Decoded (as well as
> String>>#utf8Encoded), both using the newer ZnCharacterEncoders. So
> you could write it shorter as:
>
>   aString asByteArray utf8Decoded
>
> Instead of Error you could use the more intention revealing ZnCharacterEncodingError.

I agree, this was a hack to get me past the immediate issue and see if
I could get the migration to work.


> Apart from that, and I known you did not create this mess, encoding a
> String into a String, although it can be done, is so wrong. You encode
> a String (a collection of Characters, of Unicode code points) into a
> ByteArray and you decode a ByteArray into a String.
>
> Sending #asByteArray to a String in almost always wrong, as is sending
> #asString to a ByteArray. These are implicit conversions (null
> conversions, like ZnNullEncoder) that only work for pure ASCII or
> Latin1 (iso-8859-1), but not for the much richer set of Characters
> that Pharo supports. So these will eventually fail, one day, in a
> country far away.

I hate having to deal with string encoding, and know as little as
possible. :-)



On Tue, Dec 05, 2017 at 08:59:34AM +0100, Peter Uhn??k wrote:
> > In my case, it turned out to be a non-UTF8 encoded character in one of the
> commit messages.
>
> I've ran into this problem in a sister project (tonel-migration), and do not
> have a proper resolution yet. I was forcing everything to be unicode, so I need
> a better way to read and write encoded strings. :<

I think this is always going to be a tricky problem since the file may
simply be corrupt, and the resolution won't always be the same.  In my
case I was happy to simply remove the offending characters.

I do think it would be nice if the migration tool at least stopped and
warned the user that the input is invalid.

Would you be open to a PR that incorporates Sven's changes suggested
above?



On Tue, Dec 05, 2017 at 09:05:24AM +0100, Sven Van Caekenberghe wrote:
>
> Maybe ZnCharacterEncoder class>>#detectEncoding: can help, although it
> is far from perfect.

In this particular case I think it is better to try decoding with UTF8,
since #detectEncoding: isn't completely reliable and we just want to
know if it isn't valid UTF8.


Cheers,
Alistair

Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

Peter Uhnak
Would you be open to a PR that incorporates Sven's changes suggested above?

Certainly. One thing to keep in mind is that I had to modify MemoryStore/MemoryHandle so it can process unicode at all, but it also forces everything to be unicode.

I hate having to deal with string encoding, and know as little as possible. :-)

Apparently me taking this approach resulted in the current situation. :-D

Peter

On Tue, Dec 5, 2017 at 10:27 AM, Alistair Grant <[hidden email]> wrote:
On Tue, Dec 05, 2017 at 08:56:53AM +0100, Sven Van Caekenberghe wrote:
>
> > On 5 Dec 2017, at 08:34, Alistair Grant <[hidden email]> wrote:
> >
> > On 5 December 2017 at 03:41, Martin Dias <[hidden email]> wrote:
> >> I suspect it's related to the large number of commits in my repo. I made
> >> some tweaks and succeeded to create the fast-import file. But I get:
> >>
> >> fatal: Unsupported command: .
> >> fast-import: dumping crash report to .git/fast_import_crash_10301
> >>
> >> Do you recognize this error?
> >> I will check my changes tweaking the git-migration tool to see if I modified
> >> some behavior my mistake...
> >
> > I had the same error just last night.
> >
> > In my case, it turned out to be a non-UTF8 encoded character in one of
> > the commit messages.
> >
> > I tracked it down by looking at the crash report and searching for a
> > nearby command.  I've deleted the crash reports now, but I think it
> > was the number associated with a mark command that got me near the
> > problem character in the fast-import file.
> >
> > I also modified the code to halt whenever it found a non-UTF8
> > character.  I'm sure there are better ways to do this, but:
> >
> >
> > GitMigrationCommitInfo>>inlineDataFor: aString
> >
> >    | theString |
> >    theString := aString.
> >    "Ensure the message has valid UTF-8 encoding (this will raise an
> > error if it doesn't)"
> >    [ (ZnCharacterEncoder newForEncoding: 'utf8') decodeBytes: aString
> > asByteArray ]
> >        on: Error
> >        do: [ :ex | self halt: 'Illegal string encoding'.
> >            theString := aString select: [ :each | each asciiValue
> > between: 32 and: 128 ] ].
> >    ^ 'data ' , theString size asString , String cr , (theString
> > ifEmpty: [ '' ] ifNotEmpty: [ theString , String cr ])
>
> There is also ByteArray>>#utf8Decoded (as well as
> String>>#utf8Encoded), both using the newer ZnCharacterEncoders. So
> you could write it shorter as:
>
>   aString asByteArray utf8Decoded
>
> Instead of Error you could use the more intention revealing ZnCharacterEncodingError.

I agree, this was a hack to get me past the immediate issue and see if
I could get the migration to work.


> Apart from that, and I known you did not create this mess, encoding a
> String into a String, although it can be done, is so wrong. You encode
> a String (a collection of Characters, of Unicode code points) into a
> ByteArray and you decode a ByteArray into a String.
>
> Sending #asByteArray to a String in almost always wrong, as is sending
> #asString to a ByteArray. These are implicit conversions (null
> conversions, like ZnNullEncoder) that only work for pure ASCII or
> Latin1 (iso-8859-1), but not for the much richer set of Characters
> that Pharo supports. So these will eventually fail, one day, in a
> country far away.

I hate having to deal with string encoding, and know as little as
possible. :-)



On Tue, Dec 05, 2017 at 08:59:34AM +0100, Peter Uhn??k wrote:
> > In my case, it turned out to be a non-UTF8 encoded character in one of the
> commit messages.
>
> I've ran into this problem in a sister project (tonel-migration), and do not
> have a proper resolution yet. I was forcing everything to be unicode, so I need
> a better way to read and write encoded strings. :<

I think this is always going to be a tricky problem since the file may
simply be corrupt, and the resolution won't always be the same.  In my
case I was happy to simply remove the offending characters.

I do think it would be nice if the migration tool at least stopped and
warned the user that the input is invalid.

Would you be open to a PR that incorporates Sven's changes suggested
above?



On Tue, Dec 05, 2017 at 09:05:24AM +0100, Sven Van Caekenberghe wrote:
>
> Maybe ZnCharacterEncoder class>>#detectEncoding: can help, although it
> is far from perfect.

In this particular case I think it is better to try decoding with UTF8,
since #detectEncoding: isn't completely reliable and we just want to
know if it isn't valid UTF8.


Cheers,
Alistair


Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

alistairgrant
Hi Peter,

On 5 December 2017 at 11:55, Peter Uhnák <[hidden email]> wrote:
>> Would you be open to a PR that incorporates Sven's changes suggested
>> above?
>
> Certainly. One thing to keep in mind is that I had to modify
> MemoryStore/MemoryHandle so it can process unicode at all, but it also
> forces everything to be unicode.

This is probably a good thing.  The fast-import documentation says
that it is strict about character encoding and only accepts UTF-8.

I'll let you know when the PRs are ready.

Thanks,
Alistair


>> I hate having to deal with string encoding, and know as little as
>> possible. :-)
>
> Apparently me taking this approach resulted in the current situation. :-D
>
> Peter
>
> On Tue, Dec 5, 2017 at 10:27 AM, Alistair Grant <[hidden email]>
> wrote:
>>
>> On Tue, Dec 05, 2017 at 08:56:53AM +0100, Sven Van Caekenberghe wrote:
>> >
>> > > On 5 Dec 2017, at 08:34, Alistair Grant <[hidden email]> wrote:
>> > >
>> > > On 5 December 2017 at 03:41, Martin Dias <[hidden email]> wrote:
>> > >> I suspect it's related to the large number of commits in my repo. I
>> > >> made
>> > >> some tweaks and succeeded to create the fast-import file. But I get:
>> > >>
>> > >> fatal: Unsupported command: .
>> > >> fast-import: dumping crash report to .git/fast_import_crash_10301
>> > >>
>> > >> Do you recognize this error?
>> > >> I will check my changes tweaking the git-migration tool to see if I
>> > >> modified
>> > >> some behavior my mistake...
>> > >
>> > > I had the same error just last night.
>> > >
>> > > In my case, it turned out to be a non-UTF8 encoded character in one of
>> > > the commit messages.
>> > >
>> > > I tracked it down by looking at the crash report and searching for a
>> > > nearby command.  I've deleted the crash reports now, but I think it
>> > > was the number associated with a mark command that got me near the
>> > > problem character in the fast-import file.
>> > >
>> > > I also modified the code to halt whenever it found a non-UTF8
>> > > character.  I'm sure there are better ways to do this, but:
>> > >
>> > >
>> > > GitMigrationCommitInfo>>inlineDataFor: aString
>> > >
>> > >    | theString |
>> > >    theString := aString.
>> > >    "Ensure the message has valid UTF-8 encoding (this will raise an
>> > > error if it doesn't)"
>> > >    [ (ZnCharacterEncoder newForEncoding: 'utf8') decodeBytes: aString
>> > > asByteArray ]
>> > >        on: Error
>> > >        do: [ :ex | self halt: 'Illegal string encoding'.
>> > >            theString := aString select: [ :each | each asciiValue
>> > > between: 32 and: 128 ] ].
>> > >    ^ 'data ' , theString size asString , String cr , (theString
>> > > ifEmpty: [ '' ] ifNotEmpty: [ theString , String cr ])
>> >
>> > There is also ByteArray>>#utf8Decoded (as well as
>> > String>>#utf8Encoded), both using the newer ZnCharacterEncoders. So
>> > you could write it shorter as:
>> >
>> >   aString asByteArray utf8Decoded
>> >
>> > Instead of Error you could use the more intention revealing
>> > ZnCharacterEncodingError.
>>
>> I agree, this was a hack to get me past the immediate issue and see if
>> I could get the migration to work.
>>
>>
>> > Apart from that, and I known you did not create this mess, encoding a
>> > String into a String, although it can be done, is so wrong. You encode
>> > a String (a collection of Characters, of Unicode code points) into a
>> > ByteArray and you decode a ByteArray into a String.
>> >
>> > Sending #asByteArray to a String in almost always wrong, as is sending
>> > #asString to a ByteArray. These are implicit conversions (null
>> > conversions, like ZnNullEncoder) that only work for pure ASCII or
>> > Latin1 (iso-8859-1), but not for the much richer set of Characters
>> > that Pharo supports. So these will eventually fail, one day, in a
>> > country far away.
>>
>> I hate having to deal with string encoding, and know as little as
>> possible. :-)
>>
>>
>>
>> On Tue, Dec 05, 2017 at 08:59:34AM +0100, Peter Uhn??k wrote:
>> > > In my case, it turned out to be a non-UTF8 encoded character in one of
>> > > the
>> > commit messages.
>> >
>> > I've ran into this problem in a sister project (tonel-migration), and do
>> > not
>> > have a proper resolution yet. I was forcing everything to be unicode, so
>> > I need
>> > a better way to read and write encoded strings. :<
>>
>> I think this is always going to be a tricky problem since the file may
>> simply be corrupt, and the resolution won't always be the same.  In my
>> case I was happy to simply remove the offending characters.
>>
>> I do think it would be nice if the migration tool at least stopped and
>> warned the user that the input is invalid.
>>
>> Would you be open to a PR that incorporates Sven's changes suggested
>> above?
>>
>>
>>
>> On Tue, Dec 05, 2017 at 09:05:24AM +0100, Sven Van Caekenberghe wrote:
>> >
>> > Maybe ZnCharacterEncoder class>>#detectEncoding: can help, although it
>> > is far from perfect.
>>
>> In this particular case I think it is better to try decoding with UTF8,
>> since #detectEncoding: isn't completely reliable and we just want to
>> know if it isn't valid UTF8.
>>
>>
>> Cheers,
>> Alistair
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

Stephan Eggermont-3
In reply to this post by Peter Uhnak
On 05-12-17 08:59, Peter Uhnák wrote:
>  > In my case, it turned out to be a non-UTF8 encoded character in one
> of the commit messages.
>
> I've ran into this problem in a sister project (tonel-migration), and do
> not have a proper resolution yet. I was forcing everything to be
> unicode, so I need a better way to read and write encoded strings. :<

To be exact, exactly none of the older commits will be UTF8 encoded. For
most it doesn't matter as they are ASCII, but if we want to have a
change of converting older french or german code (or japanese), we need
support for what was done with WideString. That probably needs a look in
the squeak mailing list archives.

Stephan


Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

Nicolas Cellier


2017-12-13 11:09 GMT+01:00 stephan <[hidden email]>:
On 05-12-17 08:59, Peter Uhnák wrote:
 > In my case, it turned out to be a non-UTF8 encoded character in one of the commit messages.

I've ran into this problem in a sister project (tonel-migration), and do not have a proper resolution yet. I was forcing everything to be unicode, so I need a better way to read and write encoded strings. :<

To be exact, exactly none of the older commits will be UTF8 encoded. For most it doesn't matter as they are ASCII, but if we want to have a change of converting older french or german code (or japanese), we need support for what was done with WideString. That probably needs a look in the squeak mailing list archives.

Stephan


The thread would be"MC should really write snaphsot/source.st in UTF8" end of may 2013.
I've made the switch in Squeak short after (june 2013).
I've propsed a fix for Pharo (issues 10801 and 10811) that has been integrated in 3.0 end of june 2013.

So the .mcz created before this might be problematic...

As I've just reported here, I think there are problems in tonel
(like if it was reading and writing with different encoding...)


Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

Henrik Sperre Johansen
In reply to this post by Stephan Eggermont-3
Stephan Eggermont-3 wrote

> On 05-12-17 08:59, Peter Uhnák wrote:
>>  > In my case, it turned out to be a non-UTF8 encoded character in one
>> of the commit messages.
>>
>> I've ran into this problem in a sister project (tonel-migration), and do
>> not have a proper resolution yet. I was forcing everything to be
>> unicode, so I need a better way to read and write encoded strings. :<
>
> To be exact, exactly none of the older commits will be UTF8 encoded. For
> most it doesn't matter as they are ASCII, but if we want to have a
> change of converting older french or german code (or japanese), we need
> support for what was done with WideString. That probably needs a look in
> the squeak mailing list archives.
>
> Stephan

The mcz reader used to import the .bin file (which contained correctly
serialized WideStrings), only falling back to reading the .st file if .bin
was not present, has this changed?

Or do these tools explicitly ignore the .bin file and try to read the .st
file directly?
If so, the MCDataStream class used to read .bin format still seems to be in
the image...

One could also create a tool to check/convert all mcz in a repo as a
preprocess;
if .bin contents decode as WideString,
check that .st starts with utf8 BOM,
if not, convert.

Cheers,
Henry



--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html

Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

Stephane Ducasse-3
It would be great to be able to sanitize the files.
We should get a test about tonel misbehavior.

Stef

On Thu, Dec 14, 2017 at 3:11 PM, Henrik Sperre Johansen
<[hidden email]> wrote:

> Stephan Eggermont-3 wrote
>> On 05-12-17 08:59, Peter Uhnák wrote:
>>>  > In my case, it turned out to be a non-UTF8 encoded character in one
>>> of the commit messages.
>>>
>>> I've ran into this problem in a sister project (tonel-migration), and do
>>> not have a proper resolution yet. I was forcing everything to be
>>> unicode, so I need a better way to read and write encoded strings. :<
>>
>> To be exact, exactly none of the older commits will be UTF8 encoded. For
>> most it doesn't matter as they are ASCII, but if we want to have a
>> change of converting older french or german code (or japanese), we need
>> support for what was done with WideString. That probably needs a look in
>> the squeak mailing list archives.
>>
>> Stephan
>
> The mcz reader used to import the .bin file (which contained correctly
> serialized WideStrings), only falling back to reading the .st file if .bin
> was not present, has this changed?
>
> Or do these tools explicitly ignore the .bin file and try to read the .st
> file directly?
> If so, the MCDataStream class used to read .bin format still seems to be in
> the image...
>
> One could also create a tool to check/convert all mcz in a repo as a
> preprocess;
> if .bin contents decode as WideString,
> check that .st starts with utf8 BOM,
> if not, convert.
>
> Cheers,
> Henry
>
>
>
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html
>

Reply | Threaded
Open this post in threaded view
|

Re: Moving from mc to tonel?

tinchodias
Hi, any news or experiences on the migration tools? I'd try again on next days.
Martín

On Sat, Dec 16, 2017 at 5:06 AM, Stephane Ducasse <[hidden email]> wrote:
It would be great to be able to sanitize the files.
We should get a test about tonel misbehavior.

Stef

On Thu, Dec 14, 2017 at 3:11 PM, Henrik Sperre Johansen
<[hidden email]> wrote:
> Stephan Eggermont-3 wrote
>> On 05-12-17 08:59, Peter Uhnák wrote:
>>>  > In my case, it turned out to be a non-UTF8 encoded character in one
>>> of the commit messages.
>>>
>>> I've ran into this problem in a sister project (tonel-migration), and do
>>> not have a proper resolution yet. I was forcing everything to be
>>> unicode, so I need a better way to read and write encoded strings. :<
>>
>> To be exact, exactly none of the older commits will be UTF8 encoded. For
>> most it doesn't matter as they are ASCII, but if we want to have a
>> change of converting older french or german code (or japanese), we need
>> support for what was done with WideString. That probably needs a look in
>> the squeak mailing list archives.
>>
>> Stephan
>
> The mcz reader used to import the .bin file (which contained correctly
> serialized WideStrings), only falling back to reading the .st file if .bin
> was not present, has this changed?
>
> Or do these tools explicitly ignore the .bin file and try to read the .st
> file directly?
> If so, the MCDataStream class used to read .bin format still seems to be in
> the image...
>
> One could also create a tool to check/convert all mcz in a repo as a
> preprocess;
> if .bin contents decode as WideString,
> check that .st starts with utf8 BOM,
> if not, convert.
>
> Cheers,
> Henry
>
>
>
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html
>


12