Reproducible VM crash on Win32 with callbacks

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Reproducible VM crash on Win32 with callbacks

Guillermo Polito
 
Hi all,

With Pablo we have been tracking a bug on win32 that produces a segmentation fault on callback return. We can reproduce it 100% certainly when running the Alien qsort example both in latest pharo and squeak versions.

After some debugging, it would seem that the thunkEntry function is over-optimized in 32 bits, corrupting the (C) stack. This was particularly boring because compiling the VM in debug mode was taking the bug away :-). We have cornered the bug and checked that callbacks do work ok if we disable optimizations just for the thunkEntry function like this:

long
__attribute__((optimize("O0"))) thunkEntry(void *thunkp, sqIntptr_t *stackp)

The thing is that latest mingw which we use for compiling the windows VM even in travis, now comes with gcc 7.4.0 which has a lot more of optimizations than before. Just having O1 also produces the same error.

We have tried disabling some particular optimizations like fno-combine-stack-adjustments but with no result so far.

The strange thing is that other callbacks like the ones coming from libgit work ok.

Has somebody taken a look into this too?
How would you suggest that we move on with this?
From our side, we think that using a pragma to disable optimizations for thunkEntry in the case of win32 looks okeyish at least to make the bug go away.

Cheers,
Guille & Pablo
Reply | Threaded
Open this post in threaded view
|

Re: Reproducible VM crash on Win32 with callbacks

Eliot Miranda-2
 
Hi Guille,

On Tue, Jan 15, 2019 at 8:45 AM Guillermo Polito <[hidden email]> wrote:
 
Hi all,

With Pablo we have been tracking a bug on win32 that produces a segmentation fault on callback return. We can reproduce it 100% certainly when running the Alien qsort example both in latest pharo and squeak versions.

After some debugging, it would seem that the thunkEntry function is over-optimized in 32 bits, corrupting the (C) stack. This was particularly boring because compiling the VM in debug mode was taking the bug away :-). We have cornered the bug and checked that callbacks do work ok if we disable optimizations just for the thunkEntry function like this:

long
__attribute__((optimize("O0"))) thunkEntry(void *thunkp, sqIntptr_t *stackp) 

The thing is that latest mingw which we use for compiling the windows VM even in travis, now comes with gcc 7.4.0 which has a lot more of optimizations than before. Just having O1 also produces the same error.

We have tried disabling some particular optimizations like fno-combine-stack-adjustments but with no result so far.

The strange thing is that other callbacks like the ones coming from libgit work ok.

Has somebody taken a look into this too?
How would you suggest that we move on with this?

Before adding the pragma to the source also look at whether using the volatile keyword on variables in thunkProcess fixes the issue; for example 

    volatile VMCallbackContext vmcc;
    volatile VMCallbackContext *previousCallbackContext;
    volatile int flags, returnType;

.  The other thing to do is to generate the machine code for thinkProcess with gcc 7.x and an older version that does not crash and compare to try and find out what specific optimization is causing the crash.

Finally, if you do find you have to use the pragma, please write the fix as

long __attribute__((optimize("O0")))
thunkEntry(void *thunkp, sqIntptr_t *stackp)

to keep the definition starting on a new line, which helps when using command-line tools to look for definitions outside of an ide.
 
From our side, we think that using a pragma to disable optimizations for thunkEntry in the case of win32 looks okeyish at least to make the bug go away.

Yes, but I expect it is actually that the volatile keyword has not been used (a mistake of mine).  Here's a relevant stack overflow answer:

And if volatile does fix the issue, please apply it to the other thinkEntry implementations.

Cheers,
Guille & Pablo

Cheers!
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Reproducible VM crash on Win32 with callbacks

Eliot Miranda-2
 
Hi Guille, Hi Pablo,

    I misspoke...

On Tue, Jan 15, 2019 at 9:00 AM Eliot Miranda <[hidden email]> wrote:
Hi Guille,

On Tue, Jan 15, 2019 at 8:45 AM Guillermo Polito <[hidden email]> wrote:
 
Hi all,

With Pablo we have been tracking a bug on win32 that produces a segmentation fault on callback return. We can reproduce it 100% certainly when running the Alien qsort example both in latest pharo and squeak versions.

After some debugging, it would seem that the thunkEntry function is over-optimized in 32 bits, corrupting the (C) stack. This was particularly boring because compiling the VM in debug mode was taking the bug away :-). We have cornered the bug and checked that callbacks do work ok if we disable optimizations just for the thunkEntry function like this:

long
__attribute__((optimize("O0"))) thunkEntry(void *thunkp, sqIntptr_t *stackp) 

The thing is that latest mingw which we use for compiling the windows VM even in travis, now comes with gcc 7.4.0 which has a lot more of optimizations than before. Just having O1 also produces the same error.

We have tried disabling some particular optimizations like fno-combine-stack-adjustments but with no result so far.

The strange thing is that other callbacks like the ones coming from libgit work ok.

Has somebody taken a look into this too?
How would you suggest that we move on with this?

Before adding the pragma to the source also look at whether using the volatile keyword on variables in thunkProcess fixes the issue; for example 

    volatile VMCallbackContext vmcc;
    volatile VMCallbackContext *previousCallbackContext;
    volatile int flags, returnType;

.  The other thing to do is to generate the machine code for thinkProcess with gcc 7.x and an older version that does not crash and compare to try and find out what specific optimization is causing the crash.

I meant generate the assembly with -S, e.g. -O1 -S.  You can also compare _O0 -S with -O1 -S for gcc 7.x, and generate -O1 -S with and without the volatile keyword and see what differences that makes.

Finally, if you do find you have to use the pragma, please write the fix as

long __attribute__((optimize("O0")))
thunkEntry(void *thunkp, sqIntptr_t *stackp)

to keep the definition starting on a new line, which helps when using command-line tools to look for definitions outside of an ide.
 
From our side, we think that using a pragma to disable optimizations for thunkEntry in the case of win32 looks okeyish at least to make the bug go away.

Yes, but I expect it is actually that the volatile keyword has not been used (a mistake of mine).  Here's a relevant stack overflow answer:

And if volatile does fix the issue, please apply it to the other thinkEntry implementations.

Cheers,
Guille & Pablo

Cheers!
 
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Reproducible VM crash on Win32 with callbacks

Guillermo Polito
 
Hi Eliot!

Thanks for the quick answer :)

On Tue, Jan 15, 2019 at 6:04 PM Eliot Miranda <[hidden email]> wrote:
 
Hi Guille, Hi Pablo,

    I misspoke...

On Tue, Jan 15, 2019 at 9:00 AM Eliot Miranda <[hidden email]> wrote:
Hi Guille,

On Tue, Jan 15, 2019 at 8:45 AM Guillermo Polito <[hidden email]> wrote:
 
Hi all,

With Pablo we have been tracking a bug on win32 that produces a segmentation fault on callback return. We can reproduce it 100% certainly when running the Alien qsort example both in latest pharo and squeak versions.

After some debugging, it would seem that the thunkEntry function is over-optimized in 32 bits, corrupting the (C) stack. This was particularly boring because compiling the VM in debug mode was taking the bug away :-). We have cornered the bug and checked that callbacks do work ok if we disable optimizations just for the thunkEntry function like this:

long
__attribute__((optimize("O0"))) thunkEntry(void *thunkp, sqIntptr_t *stackp) 

The thing is that latest mingw which we use for compiling the windows VM even in travis, now comes with gcc 7.4.0 which has a lot more of optimizations than before. Just having O1 also produces the same error.

We have tried disabling some particular optimizations like fno-combine-stack-adjustments but with no result so far.

The strange thing is that other callbacks like the ones coming from libgit work ok.

Has somebody taken a look into this too?
How would you suggest that we move on with this?

Before adding the pragma to the source also look at whether using the volatile keyword on variables in thunkProcess fixes the issue; for example 

    volatile VMCallbackContext vmcc;
    volatile VMCallbackContext *previousCallbackContext;
    volatile int flags, returnType;

Ok I'm trying this now. At least I can launch compilation and do something else [https://xkcd.com/303/] while it compiles :)
 

.  The other thing to do is to generate the machine code for thinkProcess with gcc 7.x and an older version that does not crash and compare to try and find out what specific optimization is causing the crash.

I meant generate the assembly with -S, e.g. -O1 -S.

Sure, this is what we understood ^^.
 
You can also compare _O0 -S with -O1 -S for gcc 7.x, and generate -O1 -S with and without the volatile keyword and see what differences that makes.

I'll do this this afternoon If I have some time...


Finally, if you do find you have to use the pragma, please write the fix as

long __attribute__((optimize("O0")))
thunkEntry(void *thunkp, sqIntptr_t *stackp)

to keep the definition starting on a new line, which helps when using command-line tools to look for definitions outside of an ide.

Sure! no problem!

Tx Again!
 
 
From our side, we think that using a pragma to disable optimizations for thunkEntry in the case of win32 looks okeyish at least to make the bug go away.

Yes, but I expect it is actually that the volatile keyword has not been used (a mistake of mine).  Here's a relevant stack overflow answer:

And if volatile does fix the issue, please apply it to the other thinkEntry implementations.

Cheers,
Guille & Pablo

Cheers!
 
_,,,^..^,,,_
best, Eliot


--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13

Reply | Threaded
Open this post in threaded view
|

Re: Reproducible VM crash on Win32 with callbacks

Nicolas Cellier
 
And if it fails with new versions of gcc, it would also be intersting to test with clang.
For Win64 I had to use clang...

Le mer. 16 janv. 2019 à 09:59, Guillermo Polito <[hidden email]> a écrit :
 
Hi Eliot!

Thanks for the quick answer :)

On Tue, Jan 15, 2019 at 6:04 PM Eliot Miranda <[hidden email]> wrote:
 
Hi Guille, Hi Pablo,

    I misspoke...

On Tue, Jan 15, 2019 at 9:00 AM Eliot Miranda <[hidden email]> wrote:
Hi Guille,

On Tue, Jan 15, 2019 at 8:45 AM Guillermo Polito <[hidden email]> wrote:
 
Hi all,

With Pablo we have been tracking a bug on win32 that produces a segmentation fault on callback return. We can reproduce it 100% certainly when running the Alien qsort example both in latest pharo and squeak versions.

After some debugging, it would seem that the thunkEntry function is over-optimized in 32 bits, corrupting the (C) stack. This was particularly boring because compiling the VM in debug mode was taking the bug away :-). We have cornered the bug and checked that callbacks do work ok if we disable optimizations just for the thunkEntry function like this:

long
__attribute__((optimize("O0"))) thunkEntry(void *thunkp, sqIntptr_t *stackp) 

The thing is that latest mingw which we use for compiling the windows VM even in travis, now comes with gcc 7.4.0 which has a lot more of optimizations than before. Just having O1 also produces the same error.

We have tried disabling some particular optimizations like fno-combine-stack-adjustments but with no result so far.

The strange thing is that other callbacks like the ones coming from libgit work ok.

Has somebody taken a look into this too?
How would you suggest that we move on with this?

Before adding the pragma to the source also look at whether using the volatile keyword on variables in thunkProcess fixes the issue; for example 

    volatile VMCallbackContext vmcc;
    volatile VMCallbackContext *previousCallbackContext;
    volatile int flags, returnType;

Ok I'm trying this now. At least I can launch compilation and do something else [https://xkcd.com/303/] while it compiles :)
 

.  The other thing to do is to generate the machine code for thinkProcess with gcc 7.x and an older version that does not crash and compare to try and find out what specific optimization is causing the crash.

I meant generate the assembly with -S, e.g. -O1 -S.

Sure, this is what we understood ^^.
 
You can also compare _O0 -S with -O1 -S for gcc 7.x, and generate -O1 -S with and without the volatile keyword and see what differences that makes.

I'll do this this afternoon If I have some time...


Finally, if you do find you have to use the pragma, please write the fix as

long __attribute__((optimize("O0")))
thunkEntry(void *thunkp, sqIntptr_t *stackp)

to keep the definition starting on a new line, which helps when using command-line tools to look for definitions outside of an ide.

Sure! no problem!

Tx Again!
 
 
From our side, we think that using a pragma to disable optimizations for thunkEntry in the case of win32 looks okeyish at least to make the bug go away.

Yes, but I expect it is actually that the volatile keyword has not been used (a mistake of mine).  Here's a relevant stack overflow answer:

And if volatile does fix the issue, please apply it to the other thinkEntry implementations.

Cheers,
Guille & Pablo

Cheers!
 
_,,,^..^,,,_
best, Eliot


--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13

Reply | Threaded
Open this post in threaded view
|

Re: Reproducible VM crash on Win32 with callbacks

Guillermo Polito
 
Hi all,

I've tested using the volatile keyword in all variables in the function and the problem is still there.
So while I get some time to look at the asm outputs I've issued a PR with the discussed solution in here: https://github.com/OpenSmalltalk/opensmalltalk-vm/pull/353

@Nicolas, I was able to compile and debug win64 almost out of the box. However, I could not make the visual studio code debugging extension work for the 32bit version.
Where you compiling on clang with just the makefiles or you have used visual studio for that?
Also, could you share your visual studio project files? I'd like to try debugging with it :)

Tx all,
Guille


On Wed, Jan 16, 2019 at 10:18 AM Nicolas Cellier <[hidden email]> wrote:
 
And if it fails with new versions of gcc, it would also be intersting to test with clang.
For Win64 I had to use clang...

Le mer. 16 janv. 2019 à 09:59, Guillermo Polito <[hidden email]> a écrit :
 
Hi Eliot!

Thanks for the quick answer :)

On Tue, Jan 15, 2019 at 6:04 PM Eliot Miranda <[hidden email]> wrote:
 
Hi Guille, Hi Pablo,

    I misspoke...

On Tue, Jan 15, 2019 at 9:00 AM Eliot Miranda <[hidden email]> wrote:
Hi Guille,

On Tue, Jan 15, 2019 at 8:45 AM Guillermo Polito <[hidden email]> wrote:
 
Hi all,

With Pablo we have been tracking a bug on win32 that produces a segmentation fault on callback return. We can reproduce it 100% certainly when running the Alien qsort example both in latest pharo and squeak versions.

After some debugging, it would seem that the thunkEntry function is over-optimized in 32 bits, corrupting the (C) stack. This was particularly boring because compiling the VM in debug mode was taking the bug away :-). We have cornered the bug and checked that callbacks do work ok if we disable optimizations just for the thunkEntry function like this:

long
__attribute__((optimize("O0"))) thunkEntry(void *thunkp, sqIntptr_t *stackp) 

The thing is that latest mingw which we use for compiling the windows VM even in travis, now comes with gcc 7.4.0 which has a lot more of optimizations than before. Just having O1 also produces the same error.

We have tried disabling some particular optimizations like fno-combine-stack-adjustments but with no result so far.

The strange thing is that other callbacks like the ones coming from libgit work ok.

Has somebody taken a look into this too?
How would you suggest that we move on with this?

Before adding the pragma to the source also look at whether using the volatile keyword on variables in thunkProcess fixes the issue; for example 

    volatile VMCallbackContext vmcc;
    volatile VMCallbackContext *previousCallbackContext;
    volatile int flags, returnType;

Ok I'm trying this now. At least I can launch compilation and do something else [https://xkcd.com/303/] while it compiles :)
 

.  The other thing to do is to generate the machine code for thinkProcess with gcc 7.x and an older version that does not crash and compare to try and find out what specific optimization is causing the crash.

I meant generate the assembly with -S, e.g. -O1 -S.

Sure, this is what we understood ^^.
 
You can also compare _O0 -S with -O1 -S for gcc 7.x, and generate -O1 -S with and without the volatile keyword and see what differences that makes.

I'll do this this afternoon If I have some time...


Finally, if you do find you have to use the pragma, please write the fix as

long __attribute__((optimize("O0")))
thunkEntry(void *thunkp, sqIntptr_t *stackp)

to keep the definition starting on a new line, which helps when using command-line tools to look for definitions outside of an ide.

Sure! no problem!

Tx Again!
 
 
From our side, we think that using a pragma to disable optimizations for thunkEntry in the case of win32 looks okeyish at least to make the bug go away.

Yes, but I expect it is actually that the volatile keyword has not been used (a mistake of mine).  Here's a relevant stack overflow answer:

And if volatile does fix the issue, please apply it to the other thinkEntry implementations.

Cheers,
Guille & Pablo

Cheers!
 
_,,,^..^,,,_
best, Eliot


--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13



--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13

Reply | Threaded
Open this post in threaded view
|

Re: Reproducible VM crash on Win32 with callbacks

Eliot Miranda-2
 
Hi Guille,


On Jan 16, 2019, at 3:21 AM, Guillermo Polito <[hidden email]> wrote:

Hi all,

I've tested using the volatile keyword in all variables in the function and the problem is still there.

Then there is a compiler bug with GCC 7.4, and we should report it.  I’ll try and take a look.  How do you produce a build configuration that includes GCC 7.4?  I’ll need to do so to take a look.

So while I get some time to look at the asm outputs I've issued a PR with the discussed solution in here: https://github.com/OpenSmalltalk/opensmalltalk-vm/pull/353

@Nicolas, I was able to compile and debug win64 almost out of the box. However, I could not make the visual studio code debugging extension work for the 32bit version.
Where you compiling on clang with just the makefiles or you have used visual studio for that?
Also, could you share your visual studio project files? I'd like to try debugging with it :)

Tx all,
Guille


On Wed, Jan 16, 2019 at 10:18 AM Nicolas Cellier <[hidden email]> wrote:
 
And if it fails with new versions of gcc, it would also be intersting to test with clang.
For Win64 I had to use clang...

Le mer. 16 janv. 2019 à 09:59, Guillermo Polito <[hidden email]> a écrit :
 
Hi Eliot!

Thanks for the quick answer :)

On Tue, Jan 15, 2019 at 6:04 PM Eliot Miranda <[hidden email]> wrote:
 
Hi Guille, Hi Pablo,

    I misspoke...

On Tue, Jan 15, 2019 at 9:00 AM Eliot Miranda <[hidden email]> wrote:
Hi Guille,

On Tue, Jan 15, 2019 at 8:45 AM Guillermo Polito <[hidden email]> wrote:
 
Hi all,

With Pablo we have been tracking a bug on win32 that produces a segmentation fault on callback return. We can reproduce it 100% certainly when running the Alien qsort example both in latest pharo and squeak versions.

After some debugging, it would seem that the thunkEntry function is over-optimized in 32 bits, corrupting the (C) stack. This was particularly boring because compiling the VM in debug mode was taking the bug away :-). We have cornered the bug and checked that callbacks do work ok if we disable optimizations just for the thunkEntry function like this:

long
__attribute__((optimize("O0"))) thunkEntry(void *thunkp, sqIntptr_t *stackp) 

The thing is that latest mingw which we use for compiling the windows VM even in travis, now comes with gcc 7.4.0 which has a lot more of optimizations than before. Just having O1 also produces the same error.

We have tried disabling some particular optimizations like fno-combine-stack-adjustments but with no result so far.

The strange thing is that other callbacks like the ones coming from libgit work ok.

Has somebody taken a look into this too?
How would you suggest that we move on with this?

Before adding the pragma to the source also look at whether using the volatile keyword on variables in thunkProcess fixes the issue; for example 

    volatile VMCallbackContext vmcc;
    volatile VMCallbackContext *previousCallbackContext;
    volatile int flags, returnType;

Ok I'm trying this now. At least I can launch compilation and do something else [https://xkcd.com/303/] while it compiles :)
 

.  The other thing to do is to generate the machine code for thinkProcess with gcc 7.x and an older version that does not crash and compare to try and find out what specific optimization is causing the crash.

I meant generate the assembly with -S, e.g. -O1 -S.

Sure, this is what we understood ^^.
 
You can also compare _O0 -S with -O1 -S for gcc 7.x, and generate -O1 -S with and without the volatile keyword and see what differences that makes.

I'll do this this afternoon If I have some time...


Finally, if you do find you have to use the pragma, please write the fix as

long __attribute__((optimize("O0")))
thunkEntry(void *thunkp, sqIntptr_t *stackp)

to keep the definition starting on a new line, which helps when using command-line tools to look for definitions outside of an ide.

Sure! no problem!

Tx Again!
 
 
From our side, we think that using a pragma to disable optimizations for thunkEntry in the case of win32 looks okeyish at least to make the bug go away.

Yes, but I expect it is actually that the volatile keyword has not been used (a mistake of mine).  Here's a relevant stack overflow answer:

And if volatile does fix the issue, please apply it to the other thinkEntry implementations.

Cheers,
Guille & Pablo

Cheers!
 
_,,,^..^,,,_
best, Eliot


--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13



--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13

Reply | Threaded
Open this post in threaded view
|

Re: Reproducible VM crash on Win32 with callbacks

Eliot Miranda-2
In reply to this post by Guillermo Polito
 
Hi Guille,

On Jan 16, 2019, at 3:21 AM, Guillermo Polito <[hidden email]> wrote:

Hi all,

I've tested using the volatile keyword in all variables in the function and the problem is still there.
So while I get some time to look at the asm outputs I've issued a PR with the discussed solution in here: https://github.com/OpenSmalltalk/opensmalltalk-vm/pull/353

@Nicolas, I was able to compile and debug win64 almost out of the box. However, I could not make the visual studio code debugging extension work for the 32bit version.

I wonder whether it suffers from the same flaw as Cygwin.  IME one can only debug the 32-bit vm  using the 32-bit version if gdb which means that one is stuck using 32-bit Cygwin yo develop the 32-bit vm.  And of course the reverse is true; one can only debug the 64-bit version on 64-bit Cygwin.

Where you compiling on clang with just the makefiles or you have used visual studio for that?
Also, could you share your visual studio project files? I'd like to try debugging with it :)

Tx all,
Guille


On Wed, Jan 16, 2019 at 10:18 AM Nicolas Cellier <[hidden email]> wrote:
 
And if it fails with new versions of gcc, it would also be intersting to test with clang.
For Win64 I had to use clang...

Le mer. 16 janv. 2019 à 09:59, Guillermo Polito <[hidden email]> a écrit :
 
Hi Eliot!

Thanks for the quick answer :)

On Tue, Jan 15, 2019 at 6:04 PM Eliot Miranda <[hidden email]> wrote:
 
Hi Guille, Hi Pablo,

    I misspoke...

On Tue, Jan 15, 2019 at 9:00 AM Eliot Miranda <[hidden email]> wrote:
Hi Guille,

On Tue, Jan 15, 2019 at 8:45 AM Guillermo Polito <[hidden email]> wrote:
 
Hi all,

With Pablo we have been tracking a bug on win32 that produces a segmentation fault on callback return. We can reproduce it 100% certainly when running the Alien qsort example both in latest pharo and squeak versions.

After some debugging, it would seem that the thunkEntry function is over-optimized in 32 bits, corrupting the (C) stack. This was particularly boring because compiling the VM in debug mode was taking the bug away :-). We have cornered the bug and checked that callbacks do work ok if we disable optimizations just for the thunkEntry function like this:

long
__attribute__((optimize("O0"))) thunkEntry(void *thunkp, sqIntptr_t *stackp) 

The thing is that latest mingw which we use for compiling the windows VM even in travis, now comes with gcc 7.4.0 which has a lot more of optimizations than before. Just having O1 also produces the same error.

We have tried disabling some particular optimizations like fno-combine-stack-adjustments but with no result so far.

The strange thing is that other callbacks like the ones coming from libgit work ok.

Has somebody taken a look into this too?
How would you suggest that we move on with this?

Before adding the pragma to the source also look at whether using the volatile keyword on variables in thunkProcess fixes the issue; for example 

    volatile VMCallbackContext vmcc;
    volatile VMCallbackContext *previousCallbackContext;
    volatile int flags, returnType;

Ok I'm trying this now. At least I can launch compilation and do something else [https://xkcd.com/303/] while it compiles :)
 

.  The other thing to do is to generate the machine code for thinkProcess with gcc 7.x and an older version that does not crash and compare to try and find out what specific optimization is causing the crash.

I meant generate the assembly with -S, e.g. -O1 -S.

Sure, this is what we understood ^^.
 
You can also compare _O0 -S with -O1 -S for gcc 7.x, and generate -O1 -S with and without the volatile keyword and see what differences that makes.

I'll do this this afternoon If I have some time...


Finally, if you do find you have to use the pragma, please write the fix as

long __attribute__((optimize("O0")))
thunkEntry(void *thunkp, sqIntptr_t *stackp)

to keep the definition starting on a new line, which helps when using command-line tools to look for definitions outside of an ide.

Sure! no problem!

Tx Again!
 
 
From our side, we think that using a pragma to disable optimizations for thunkEntry in the case of win32 looks okeyish at least to make the bug go away.

Yes, but I expect it is actually that the volatile keyword has not been used (a mistake of mine).  Here's a relevant stack overflow answer:

And if volatile does fix the issue, please apply it to the other thinkEntry implementations.

Cheers,
Guille & Pablo

Cheers!
 
_,,,^..^,,,_
best, Eliot


--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13



--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13

Reply | Threaded
Open this post in threaded view
|

Re: Reproducible VM crash on Win32 with callbacks

Nicolas Cellier
In reply to this post by Guillermo Polito
 


Le mer. 16 janv. 2019 à 12:21, Guillermo Polito <[hidden email]> a écrit :
 
Hi all,

I've tested using the volatile keyword in all variables in the function and the problem is still there.
So while I get some time to look at the asm outputs I've issued a PR with the discussed solution in here: https://github.com/OpenSmalltalk/opensmalltalk-vm/pull/353

@Nicolas, I was able to compile and debug win64 almost out of the box. However, I could not make the visual studio code debugging extension work for the 32bit version.
Where you compiling on clang with just the makefiles or you have used visual studio for that?
Also, could you share your visual studio project files? I'd like to try debugging with it :)

Tx all,
Guille


Hi Guile,
Sure, I can give you the visual studio project files, just unzip the build.msvc under opensmalltalk-vm root dir.
But they are not of great value, because they are configured for MSVC compiler., 64bits and Squeak (less dependencies than Pharo...)
And despite the entry being marked as fixed, MSVC still suffer from this preprocessor bug
I cannot test MSVC code generation yet, just use the IDE for navigating in code, and reviewing compiler warnings so far.
Also, I have not maintained the 32bit version, maybe I have an obsolete one in the branch that I used for the bug report. No time to dig...

The bug can be easily produced with minheadless I presume, so isn't using CMake for generating a VS or VSC project an option?
I did this old style cmd for 64bits MSVC backend:

set PATH=%PATH%C:\Program Files\CMake\Bin;
set PATH=%PATH%C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\MSBuild\15.0\Bin;
cmake ../../../.. -G "Visual Studio 15 Win64" -DCMAKE_BUILD_TYPE="debug" -DPHARO_BRANDING=On -DSPUR_OBJECT_MODEL=On -DCOG_JIT=On

But I'm quite sure that we can just use the VSC UI to import the CMake list like shown here:

I've not played much with VSC, maybe it's time to inquire...



build.msvc.zip (99K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Reproducible VM crash on Win32 with callbacks

Guillermo Polito
In reply to this post by Eliot Miranda-2
 
Hi Eliot,

On Wed, Jan 16, 2019 at 4:53 PM Eliot Miranda <[hidden email]> wrote:
 
Hi Guille,


On Jan 16, 2019, at 3:21 AM, Guillermo Polito <[hidden email]> wrote:

Hi all,

I've tested using the volatile keyword in all variables in the function and the problem is still there.

Then there is a compiler bug with GCC 7.4, and we should report it.  I’ll try and take a look.  How do you produce a build configuration that includes GCC 7.4?  I’ll need to do so to take a look.

Hi Eliot,

Well, I've just setup a normal environment on windows 10 with cygwin, installed mingw-gcc through cygwin, exactly the same as the CI does.
Once I've setup my environment, I just go to the build*/pharo.cog.spur directory and ./mvm -f
I've not modified any makefile whatsoever, I'm 100% sure I'm using the default values.

Also, I'm running that from a (oh-my-)zsh bash installed on top of cygwin, but I don't think that makes a difference, it's just for my convenience :)...
 

So while I get some time to look at the asm outputs I've issued a PR with the discussed solution in here: https://github.com/OpenSmalltalk/opensmalltalk-vm/pull/353

@Nicolas, I was able to compile and debug win64 almost out of the box. However, I could not make the visual studio code debugging extension work for the 32bit version.
Where you compiling on clang with just the makefiles or you have used visual studio for that?
Also, could you share your visual studio project files? I'd like to try debugging with it :)

Tx all,
Guille


On Wed, Jan 16, 2019 at 10:18 AM Nicolas Cellier <[hidden email]> wrote:
 
And if it fails with new versions of gcc, it would also be intersting to test with clang.
For Win64 I had to use clang...

Le mer. 16 janv. 2019 à 09:59, Guillermo Polito <[hidden email]> a écrit :
 
Hi Eliot!

Thanks for the quick answer :)

On Tue, Jan 15, 2019 at 6:04 PM Eliot Miranda <[hidden email]> wrote:
 
Hi Guille, Hi Pablo,

    I misspoke...

On Tue, Jan 15, 2019 at 9:00 AM Eliot Miranda <[hidden email]> wrote:
Hi Guille,

On Tue, Jan 15, 2019 at 8:45 AM Guillermo Polito <[hidden email]> wrote:
 
Hi all,

With Pablo we have been tracking a bug on win32 that produces a segmentation fault on callback return. We can reproduce it 100% certainly when running the Alien qsort example both in latest pharo and squeak versions.

After some debugging, it would seem that the thunkEntry function is over-optimized in 32 bits, corrupting the (C) stack. This was particularly boring because compiling the VM in debug mode was taking the bug away :-). We have cornered the bug and checked that callbacks do work ok if we disable optimizations just for the thunkEntry function like this:

long
__attribute__((optimize("O0"))) thunkEntry(void *thunkp, sqIntptr_t *stackp) 

The thing is that latest mingw which we use for compiling the windows VM even in travis, now comes with gcc 7.4.0 which has a lot more of optimizations than before. Just having O1 also produces the same error.

We have tried disabling some particular optimizations like fno-combine-stack-adjustments but with no result so far.

The strange thing is that other callbacks like the ones coming from libgit work ok.

Has somebody taken a look into this too?
How would you suggest that we move on with this?

Before adding the pragma to the source also look at whether using the volatile keyword on variables in thunkProcess fixes the issue; for example 

    volatile VMCallbackContext vmcc;
    volatile VMCallbackContext *previousCallbackContext;
    volatile int flags, returnType;

Ok I'm trying this now. At least I can launch compilation and do something else [https://xkcd.com/303/] while it compiles :)
 

.  The other thing to do is to generate the machine code for thinkProcess with gcc 7.x and an older version that does not crash and compare to try and find out what specific optimization is causing the crash.

I meant generate the assembly with -S, e.g. -O1 -S.

Sure, this is what we understood ^^.
 
You can also compare _O0 -S with -O1 -S for gcc 7.x, and generate -O1 -S with and without the volatile keyword and see what differences that makes.

I'll do this this afternoon If I have some time...


Finally, if you do find you have to use the pragma, please write the fix as

long __attribute__((optimize("O0")))
thunkEntry(void *thunkp, sqIntptr_t *stackp)

to keep the definition starting on a new line, which helps when using command-line tools to look for definitions outside of an ide.

Sure! no problem!

Tx Again!
 
 
From our side, we think that using a pragma to disable optimizations for thunkEntry in the case of win32 looks okeyish at least to make the bug go away.

Yes, but I expect it is actually that the volatile keyword has not been used (a mistake of mine).  Here's a relevant stack overflow answer:

And if volatile does fix the issue, please apply it to the other thinkEntry implementations.

Cheers,
Guille & Pablo

Cheers!
 
_,,,^..^,,,_
best, Eliot


--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13



--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13



--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13