Debugging GCC code generation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Debugging GCC code generation

tesonep@gmail.com
Hi,
    this mail is related to Pharo because it is knowledge I found
debugging the build of the VM, but the rest is to document it and
perhaps someone will found it interesting (also I couldn't find it
easily using Google). Sorry for the long mail!

The problem
==========

The following code does not produce good code in 8.3 when using optimizations:

long __attribute__ ((noinline)) myFunc(long i, int index){
   long v;
   long x = i >> 3;

   v = x;
   return ((int*)(&v))[index];
}

#include <stdio.h>

int main(){

    long i;
    int x;

    scanf("%ld", &i);
    scanf("%d", &x);

    printf("%ld",myFunc(i,x));
}

Basically, with -02, it generates the following code:

myFunc:
     movslq %esi, %rsi
     movslq -8(%rsp,%rsi,4), %rax
     ret

And with -01 it generates the following code:

myFunc:
     sarq $3, %rdi
     movq %rdi, -8(%rsp)
     movslq %esi, %rsi
     movslq -8(%rsp,%rsi,4), %rax
     ret

As you can see, the more optimized version is losing the bit shift (or
any other operation).
To detect the problem it was not easy, basically, I have used the
Pharo Tests to detect the failing function and then to understand the
generation of code in GCC, we need to debug its generation.

The first snippet is generated using:

gcc -O2 prb.c -S -o prb.s -Wall

and the second using:

gcc -O1 prb.c -S -o prb.s -Wall

The GCC pipeline has different steps, I have centered my self in the
tree optimizations.
GCC dumps each of the intermediate states of the compilation, working
with C like trees. For example to get the optimized version of the
program, before generating the Assembler we have to do:

gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-optimized=/dev/stdout

This will generate:

__attribute__((noinline))
myFunc (long int i, int index)
{
  long int v;
  long unsigned int _1;
  long unsigned int _2;
  int * _3;
  int _4;
  long int _8;

  <bb 2> [local count: 1073741825]:
  _1 = (long unsigned int) index_7(D);
  _2 = _1 * 4;
  _3 = &v + _2;
  _4 = *_3;
  _8 = (long int) _4;
  v ={v} {CLOBBER};
  return _8;

}

This is the optimized SSA (static single assign) version of the
function (https://gcc.gnu.org/onlinedocs/gccint/SSA.html)
As you can see in this version the code is already optimized, and our
operations not correctly generated. We expect to see something like:

__attribute__((noinline))
myFunc (long int i, int index)
{
  long int x;
  long int v;
  long unsigned int _1;
  long unsigned int _2;
  int * _3;
  int _4;
  long int _10;

  <bb 2> [local count: 1073741825]:
  x_6 = i_5(D) >> 3;
  v = x_6;
  _1 = (long unsigned int) index_9(D);
  _2 = _1 * 4;
  _3 = &v + _2;
  _4 = *_3;
  _10 = (long int) _4;
  v ={v} {CLOBBER};
  return _10;

}

In the first SSA version, we are lacking the shift operation:

  x_6 = i_5(D) >> 3;
  v = x_6;

So, we need to see in which of the optimizations and transformations
our code is lost:
To see all the steps we should execute:

gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-all

This will generate a lot, really a lot, of files in the same directory.
They have the name: prb.c.xxx.yyyy
Where xxx is the number of the step, and yyyy is the name of the step.
Each of the files contains the result of applying the changes, so what
I have done is looking in binary search where my code was lost.

I have found that the problem was in the first dead code elimination
step (prb.c.041t.cddce1)
GCC does not like the crap code that we are writing, as we are copying
a stack variable and then accessing directly to the memory:

v = x;
return ((int*)(&v))[index];

So, basically, it was assuming that we were only using v and not x, so
all the code modifying x is described (this optimization is called
"dead store code deletion"). Basically tries to remove all the code
that is affecting stack (temporary) variables that are never used.

I am not sure if this is a bug in GCC, so I will report it. But I have
fixed the problem writing the code in a proper way.

Sorry again for the long mail, I wanted to store the knowledge and
propagate it before I eventually will flush it. I like these things,
but I am not debugging the GCC optimization pipeline all the days.

--
Pablo Tesone.
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Debugging GCC code generation

Aliaksei Syrel
Hi Pablo,

Wow! Thank you for the detective story :)

Do I understand correctly that the original code causes undefined behavior and therefore can be changed (or even removed) by the compiler?
(because it returns something that is referencing memory on the stack)

Please keep posting similar things in future! It is very educative :)

Cheers,
Alex


On Wed, 11 Dec 2019 at 17:35, [hidden email] <[hidden email]> wrote:
Hi,
    this mail is related to Pharo because it is knowledge I found
debugging the build of the VM, but the rest is to document it and
perhaps someone will found it interesting (also I couldn't find it
easily using Google). Sorry for the long mail!

The problem
==========

The following code does not produce good code in 8.3 when using optimizations:

long __attribute__ ((noinline)) myFunc(long i, int index){
   long v;
   long x = i >> 3;

   v = x;
   return ((int*)(&v))[index];
}

#include <stdio.h>

int main(){

    long i;
    int x;

    scanf("%ld", &i);
    scanf("%d", &x);

    printf("%ld",myFunc(i,x));
}

Basically, with -02, it generates the following code:

myFunc:
     movslq %esi, %rsi
     movslq -8(%rsp,%rsi,4), %rax
     ret

And with -01 it generates the following code:

myFunc:
     sarq $3, %rdi
     movq %rdi, -8(%rsp)
     movslq %esi, %rsi
     movslq -8(%rsp,%rsi,4), %rax
     ret

As you can see, the more optimized version is losing the bit shift (or
any other operation).
To detect the problem it was not easy, basically, I have used the
Pharo Tests to detect the failing function and then to understand the
generation of code in GCC, we need to debug its generation.

The first snippet is generated using:

gcc -O2 prb.c -S -o prb.s -Wall

and the second using:

gcc -O1 prb.c -S -o prb.s -Wall

The GCC pipeline has different steps, I have centered my self in the
tree optimizations.
GCC dumps each of the intermediate states of the compilation, working
with C like trees. For example to get the optimized version of the
program, before generating the Assembler we have to do:

gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-optimized=/dev/stdout

This will generate:

__attribute__((noinline))
myFunc (long int i, int index)
{
  long int v;
  long unsigned int _1;
  long unsigned int _2;
  int * _3;
  int _4;
  long int _8;

  <bb 2> [local count: 1073741825]:
  _1 = (long unsigned int) index_7(D);
  _2 = _1 * 4;
  _3 = &v + _2;
  _4 = *_3;
  _8 = (long int) _4;
  v ={v} {CLOBBER};
  return _8;

}

This is the optimized SSA (static single assign) version of the
function (https://gcc.gnu.org/onlinedocs/gccint/SSA.html)
As you can see in this version the code is already optimized, and our
operations not correctly generated. We expect to see something like:

__attribute__((noinline))
myFunc (long int i, int index)
{
  long int x;
  long int v;
  long unsigned int _1;
  long unsigned int _2;
  int * _3;
  int _4;
  long int _10;

  <bb 2> [local count: 1073741825]:
  x_6 = i_5(D) >> 3;
  v = x_6;
  _1 = (long unsigned int) index_9(D);
  _2 = _1 * 4;
  _3 = &v + _2;
  _4 = *_3;
  _10 = (long int) _4;
  v ={v} {CLOBBER};
  return _10;

}

In the first SSA version, we are lacking the shift operation:

  x_6 = i_5(D) >> 3;
  v = x_6;

So, we need to see in which of the optimizations and transformations
our code is lost:
To see all the steps we should execute:

gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-all

This will generate a lot, really a lot, of files in the same directory.
They have the name: prb.c.xxx.yyyy
Where xxx is the number of the step, and yyyy is the name of the step.
Each of the files contains the result of applying the changes, so what
I have done is looking in binary search where my code was lost.

I have found that the problem was in the first dead code elimination
step (prb.c.041t.cddce1)
GCC does not like the crap code that we are writing, as we are copying
a stack variable and then accessing directly to the memory:

v = x;
return ((int*)(&v))[index];

So, basically, it was assuming that we were only using v and not x, so
all the code modifying x is described (this optimization is called
"dead store code deletion"). Basically tries to remove all the code
that is affecting stack (temporary) variables that are never used.

I am not sure if this is a bug in GCC, so I will report it. But I have
fixed the problem writing the code in a proper way.

Sorry again for the long mail, I wanted to store the knowledge and
propagate it before I eventually will flush it. I like these things,
but I am not debugging the GCC optimization pipeline all the days.

--
Pablo Tesone.
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Debugging GCC code generation

tesonep@gmail.com
Hi Aliaksei,
      to me it looks like a bug of GCC optimization. Basically, it is
assuming that the x variable is used but never read or its value is
never used. Also it assumes the same of the i variable, as we are only
accessing indirectly to the memory where it locates (the code is even
assuming that the variable exists, but it can be optimize out as in
this scenario). Even though, the original C code is valid C code, we
are not helping the compiler writing code like that. So I have
rewritten the code in a way that does not use indirect memory access
to the stack space.

One thing more that makes me think is a bug, if you use an int
constant as the index and not a parameter, the error does not occur
(the code is not badly optimized) and there is a warning about the
not-so-great access to the stack.

On Wed, Dec 11, 2019 at 6:01 PM Aliaksei Syrel <[hidden email]> wrote:

>
> Hi Pablo,
>
> Wow! Thank you for the detective story :)
>
> Do I understand correctly that the original code causes undefined behavior and therefore can be changed (or even removed) by the compiler?
> (because it returns something that is referencing memory on the stack)
>
> Please keep posting similar things in future! It is very educative :)
>
> Cheers,
> Alex
>
>
> On Wed, 11 Dec 2019 at 17:35, [hidden email] <[hidden email]> wrote:
>>
>> Hi,
>>     this mail is related to Pharo because it is knowledge I found
>> debugging the build of the VM, but the rest is to document it and
>> perhaps someone will found it interesting (also I couldn't find it
>> easily using Google). Sorry for the long mail!
>>
>> The problem
>> ==========
>>
>> The following code does not produce good code in 8.3 when using optimizations:
>>
>> long __attribute__ ((noinline)) myFunc(long i, int index){
>>    long v;
>>    long x = i >> 3;
>>
>>    v = x;
>>    return ((int*)(&v))[index];
>> }
>>
>> #include <stdio.h>
>>
>> int main(){
>>
>>     long i;
>>     int x;
>>
>>     scanf("%ld", &i);
>>     scanf("%d", &x);
>>
>>     printf("%ld",myFunc(i,x));
>> }
>>
>> Basically, with -02, it generates the following code:
>>
>> myFunc:
>>      movslq %esi, %rsi
>>      movslq -8(%rsp,%rsi,4), %rax
>>      ret
>>
>> And with -01 it generates the following code:
>>
>> myFunc:
>>      sarq $3, %rdi
>>      movq %rdi, -8(%rsp)
>>      movslq %esi, %rsi
>>      movslq -8(%rsp,%rsi,4), %rax
>>      ret
>>
>> As you can see, the more optimized version is losing the bit shift (or
>> any other operation).
>> To detect the problem it was not easy, basically, I have used the
>> Pharo Tests to detect the failing function and then to understand the
>> generation of code in GCC, we need to debug its generation.
>>
>> The first snippet is generated using:
>>
>> gcc -O2 prb.c -S -o prb.s -Wall
>>
>> and the second using:
>>
>> gcc -O1 prb.c -S -o prb.s -Wall
>>
>> The GCC pipeline has different steps, I have centered my self in the
>> tree optimizations.
>> GCC dumps each of the intermediate states of the compilation, working
>> with C like trees. For example to get the optimized version of the
>> program, before generating the Assembler we have to do:
>>
>> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-optimized=/dev/stdout
>>
>> This will generate:
>>
>> __attribute__((noinline))
>> myFunc (long int i, int index)
>> {
>>   long int v;
>>   long unsigned int _1;
>>   long unsigned int _2;
>>   int * _3;
>>   int _4;
>>   long int _8;
>>
>>   <bb 2> [local count: 1073741825]:
>>   _1 = (long unsigned int) index_7(D);
>>   _2 = _1 * 4;
>>   _3 = &v + _2;
>>   _4 = *_3;
>>   _8 = (long int) _4;
>>   v ={v} {CLOBBER};
>>   return _8;
>>
>> }
>>
>> This is the optimized SSA (static single assign) version of the
>> function (https://gcc.gnu.org/onlinedocs/gccint/SSA.html)
>> As you can see in this version the code is already optimized, and our
>> operations not correctly generated. We expect to see something like:
>>
>> __attribute__((noinline))
>> myFunc (long int i, int index)
>> {
>>   long int x;
>>   long int v;
>>   long unsigned int _1;
>>   long unsigned int _2;
>>   int * _3;
>>   int _4;
>>   long int _10;
>>
>>   <bb 2> [local count: 1073741825]:
>>   x_6 = i_5(D) >> 3;
>>   v = x_6;
>>   _1 = (long unsigned int) index_9(D);
>>   _2 = _1 * 4;
>>   _3 = &v + _2;
>>   _4 = *_3;
>>   _10 = (long int) _4;
>>   v ={v} {CLOBBER};
>>   return _10;
>>
>> }
>>
>> In the first SSA version, we are lacking the shift operation:
>>
>>   x_6 = i_5(D) >> 3;
>>   v = x_6;
>>
>> So, we need to see in which of the optimizations and transformations
>> our code is lost:
>> To see all the steps we should execute:
>>
>> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-all
>>
>> This will generate a lot, really a lot, of files in the same directory.
>> They have the name: prb.c.xxx.yyyy
>> Where xxx is the number of the step, and yyyy is the name of the step.
>> Each of the files contains the result of applying the changes, so what
>> I have done is looking in binary search where my code was lost.
>>
>> I have found that the problem was in the first dead code elimination
>> step (prb.c.041t.cddce1)
>> GCC does not like the crap code that we are writing, as we are copying
>> a stack variable and then accessing directly to the memory:
>>
>> v = x;
>> return ((int*)(&v))[index];
>>
>> So, basically, it was assuming that we were only using v and not x, so
>> all the code modifying x is described (this optimization is called
>> "dead store code deletion"). Basically tries to remove all the code
>> that is affecting stack (temporary) variables that are never used.
>>
>> I am not sure if this is a bug in GCC, so I will report it. But I have
>> fixed the problem writing the code in a proper way.
>>
>> Sorry again for the long mail, I wanted to store the knowledge and
>> propagate it before I eventually will flush it. I like these things,
>> but I am not debugging the GCC optimization pipeline all the days.
>>
>> --
>> Pablo Tesone.
>> [hidden email]
>>


--
Pablo Tesone.
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Debugging GCC code generation

Nicolas Cellier
In reply to this post by Aliaksei Syrel
Hi Pablo,
pointer aliasing is considered UB indeed
and accessing a pointer outside of allocated bounds also is UB

the recommended way to perform a between one pointer type and another (reinterpret_cast) is to use memcpy,
so it should be:

    long v;
    int i[sizeof(long)/sizeof(int)];

    memcpy( i , &v, sizeof(v) );
    return i[index];

Normally, the compiler -O2 is smart enough to not allocate memory for i[2] and can perform all ops thru registers


Le mer. 11 déc. 2019 à 18:01, Aliaksei Syrel <[hidden email]> a écrit :
Hi Pablo,

Wow! Thank you for the detective story :)

Do I understand correctly that the original code causes undefined behavior and therefore can be changed (or even removed) by the compiler?
(because it returns something that is referencing memory on the stack)

Please keep posting similar things in future! It is very educative :)

Cheers,
Alex


On Wed, 11 Dec 2019 at 17:35, [hidden email] <[hidden email]> wrote:
Hi,
    this mail is related to Pharo because it is knowledge I found
debugging the build of the VM, but the rest is to document it and
perhaps someone will found it interesting (also I couldn't find it
easily using Google). Sorry for the long mail!

The problem
==========

The following code does not produce good code in 8.3 when using optimizations:

long __attribute__ ((noinline)) myFunc(long i, int index){
   long v;
   long x = i >> 3;

   v = x;
   return ((int*)(&v))[index];
}

#include <stdio.h>

int main(){

    long i;
    int x;

    scanf("%ld", &i);
    scanf("%d", &x);

    printf("%ld",myFunc(i,x));
}

Basically, with -02, it generates the following code:

myFunc:
     movslq %esi, %rsi
     movslq -8(%rsp,%rsi,4), %rax
     ret

And with -01 it generates the following code:

myFunc:
     sarq $3, %rdi
     movq %rdi, -8(%rsp)
     movslq %esi, %rsi
     movslq -8(%rsp,%rsi,4), %rax
     ret

As you can see, the more optimized version is losing the bit shift (or
any other operation).
To detect the problem it was not easy, basically, I have used the
Pharo Tests to detect the failing function and then to understand the
generation of code in GCC, we need to debug its generation.

The first snippet is generated using:

gcc -O2 prb.c -S -o prb.s -Wall

and the second using:

gcc -O1 prb.c -S -o prb.s -Wall

The GCC pipeline has different steps, I have centered my self in the
tree optimizations.
GCC dumps each of the intermediate states of the compilation, working
with C like trees. For example to get the optimized version of the
program, before generating the Assembler we have to do:

gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-optimized=/dev/stdout

This will generate:

__attribute__((noinline))
myFunc (long int i, int index)
{
  long int v;
  long unsigned int _1;
  long unsigned int _2;
  int * _3;
  int _4;
  long int _8;

  <bb 2> [local count: 1073741825]:
  _1 = (long unsigned int) index_7(D);
  _2 = _1 * 4;
  _3 = &v + _2;
  _4 = *_3;
  _8 = (long int) _4;
  v ={v} {CLOBBER};
  return _8;

}

This is the optimized SSA (static single assign) version of the
function (https://gcc.gnu.org/onlinedocs/gccint/SSA.html)
As you can see in this version the code is already optimized, and our
operations not correctly generated. We expect to see something like:

__attribute__((noinline))
myFunc (long int i, int index)
{
  long int x;
  long int v;
  long unsigned int _1;
  long unsigned int _2;
  int * _3;
  int _4;
  long int _10;

  <bb 2> [local count: 1073741825]:
  x_6 = i_5(D) >> 3;
  v = x_6;
  _1 = (long unsigned int) index_9(D);
  _2 = _1 * 4;
  _3 = &v + _2;
  _4 = *_3;
  _10 = (long int) _4;
  v ={v} {CLOBBER};
  return _10;

}

In the first SSA version, we are lacking the shift operation:

  x_6 = i_5(D) >> 3;
  v = x_6;

So, we need to see in which of the optimizations and transformations
our code is lost:
To see all the steps we should execute:

gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-all

This will generate a lot, really a lot, of files in the same directory.
They have the name: prb.c.xxx.yyyy
Where xxx is the number of the step, and yyyy is the name of the step.
Each of the files contains the result of applying the changes, so what
I have done is looking in binary search where my code was lost.

I have found that the problem was in the first dead code elimination
step (prb.c.041t.cddce1)
GCC does not like the crap code that we are writing, as we are copying
a stack variable and then accessing directly to the memory:

v = x;
return ((int*)(&v))[index];

So, basically, it was assuming that we were only using v and not x, so
all the code modifying x is described (this optimization is called
"dead store code deletion"). Basically tries to remove all the code
that is affecting stack (temporary) variables that are never used.

I am not sure if this is a bug in GCC, so I will report it. But I have
fixed the problem writing the code in a proper way.

Sorry again for the long mail, I wanted to store the knowledge and
propagate it before I eventually will flush it. I like these things,
but I am not debugging the GCC optimization pipeline all the days.

--
Pablo Tesone.
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Debugging GCC code generation

Nicolas Cellier
In reply to this post by tesonep@gmail.com
Hi Pablo (again),
no, not a bug.

The problem is in the source code. The compiler has the right to presume that your code is exempt of UB, because you cannot depend on UB (obviously).
So it can eliminate all code which corresponds to UB.

The compiler has the right to assume that a pointer to an int cannot point to a long (UB).
So modifying a long cannot have any sort of impact on the content of the int pointer.
So the compiler can decouple both path return int content and assign long.
But assigning the long has no effect, so the code can be suppressed altogether.

Le mer. 11 déc. 2019 à 18:54, [hidden email] <[hidden email]> a écrit :
Hi Aliaksei,
      to me it looks like a bug of GCC optimization. Basically, it is
assuming that the x variable is used but never read or its value is
never used. Also it assumes the same of the i variable, as we are only
accessing indirectly to the memory where it locates (the code is even
assuming that the variable exists, but it can be optimize out as in
this scenario). Even though, the original C code is valid C code, we
are not helping the compiler writing code like that. So I have
rewritten the code in a way that does not use indirect memory access
to the stack space.

One thing more that makes me think is a bug, if you use an int
constant as the index and not a parameter, the error does not occur
(the code is not badly optimized) and there is a warning about the
not-so-great access to the stack.

On Wed, Dec 11, 2019 at 6:01 PM Aliaksei Syrel <[hidden email]> wrote:
>
> Hi Pablo,
>
> Wow! Thank you for the detective story :)
>
> Do I understand correctly that the original code causes undefined behavior and therefore can be changed (or even removed) by the compiler?
> (because it returns something that is referencing memory on the stack)
>
> Please keep posting similar things in future! It is very educative :)
>
> Cheers,
> Alex
>
>
> On Wed, 11 Dec 2019 at 17:35, [hidden email] <[hidden email]> wrote:
>>
>> Hi,
>>     this mail is related to Pharo because it is knowledge I found
>> debugging the build of the VM, but the rest is to document it and
>> perhaps someone will found it interesting (also I couldn't find it
>> easily using Google). Sorry for the long mail!
>>
>> The problem
>> ==========
>>
>> The following code does not produce good code in 8.3 when using optimizations:
>>
>> long __attribute__ ((noinline)) myFunc(long i, int index){
>>    long v;
>>    long x = i >> 3;
>>
>>    v = x;
>>    return ((int*)(&v))[index];
>> }
>>
>> #include <stdio.h>
>>
>> int main(){
>>
>>     long i;
>>     int x;
>>
>>     scanf("%ld", &i);
>>     scanf("%d", &x);
>>
>>     printf("%ld",myFunc(i,x));
>> }
>>
>> Basically, with -02, it generates the following code:
>>
>> myFunc:
>>      movslq %esi, %rsi
>>      movslq -8(%rsp,%rsi,4), %rax
>>      ret
>>
>> And with -01 it generates the following code:
>>
>> myFunc:
>>      sarq $3, %rdi
>>      movq %rdi, -8(%rsp)
>>      movslq %esi, %rsi
>>      movslq -8(%rsp,%rsi,4), %rax
>>      ret
>>
>> As you can see, the more optimized version is losing the bit shift (or
>> any other operation).
>> To detect the problem it was not easy, basically, I have used the
>> Pharo Tests to detect the failing function and then to understand the
>> generation of code in GCC, we need to debug its generation.
>>
>> The first snippet is generated using:
>>
>> gcc -O2 prb.c -S -o prb.s -Wall
>>
>> and the second using:
>>
>> gcc -O1 prb.c -S -o prb.s -Wall
>>
>> The GCC pipeline has different steps, I have centered my self in the
>> tree optimizations.
>> GCC dumps each of the intermediate states of the compilation, working
>> with C like trees. For example to get the optimized version of the
>> program, before generating the Assembler we have to do:
>>
>> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-optimized=/dev/stdout
>>
>> This will generate:
>>
>> __attribute__((noinline))
>> myFunc (long int i, int index)
>> {
>>   long int v;
>>   long unsigned int _1;
>>   long unsigned int _2;
>>   int * _3;
>>   int _4;
>>   long int _8;
>>
>>   <bb 2> [local count: 1073741825]:
>>   _1 = (long unsigned int) index_7(D);
>>   _2 = _1 * 4;
>>   _3 = &v + _2;
>>   _4 = *_3;
>>   _8 = (long int) _4;
>>   v ={v} {CLOBBER};
>>   return _8;
>>
>> }
>>
>> This is the optimized SSA (static single assign) version of the
>> function (https://gcc.gnu.org/onlinedocs/gccint/SSA.html)
>> As you can see in this version the code is already optimized, and our
>> operations not correctly generated. We expect to see something like:
>>
>> __attribute__((noinline))
>> myFunc (long int i, int index)
>> {
>>   long int x;
>>   long int v;
>>   long unsigned int _1;
>>   long unsigned int _2;
>>   int * _3;
>>   int _4;
>>   long int _10;
>>
>>   <bb 2> [local count: 1073741825]:
>>   x_6 = i_5(D) >> 3;
>>   v = x_6;
>>   _1 = (long unsigned int) index_9(D);
>>   _2 = _1 * 4;
>>   _3 = &v + _2;
>>   _4 = *_3;
>>   _10 = (long int) _4;
>>   v ={v} {CLOBBER};
>>   return _10;
>>
>> }
>>
>> In the first SSA version, we are lacking the shift operation:
>>
>>   x_6 = i_5(D) >> 3;
>>   v = x_6;
>>
>> So, we need to see in which of the optimizations and transformations
>> our code is lost:
>> To see all the steps we should execute:
>>
>> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-all
>>
>> This will generate a lot, really a lot, of files in the same directory.
>> They have the name: prb.c.xxx.yyyy
>> Where xxx is the number of the step, and yyyy is the name of the step.
>> Each of the files contains the result of applying the changes, so what
>> I have done is looking in binary search where my code was lost.
>>
>> I have found that the problem was in the first dead code elimination
>> step (prb.c.041t.cddce1)
>> GCC does not like the crap code that we are writing, as we are copying
>> a stack variable and then accessing directly to the memory:
>>
>> v = x;
>> return ((int*)(&v))[index];
>>
>> So, basically, it was assuming that we were only using v and not x, so
>> all the code modifying x is described (this optimization is called
>> "dead store code deletion"). Basically tries to remove all the code
>> that is affecting stack (temporary) variables that are never used.
>>
>> I am not sure if this is a bug in GCC, so I will report it. But I have
>> fixed the problem writing the code in a proper way.
>>
>> Sorry again for the long mail, I wanted to store the knowledge and
>> propagate it before I eventually will flush it. I like these things,
>> but I am not debugging the GCC optimization pipeline all the days.
>>
>> --
>> Pablo Tesone.
>> [hidden email]
>>


--
Pablo Tesone.
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Debugging GCC code generation

Nicolas Cellier
Of course, when I say "your" code, it's the code you have shown, and probably "our" (VMMaker) code ;)

Le mer. 11 déc. 2019 à 19:05, Nicolas Cellier <[hidden email]> a écrit :
Hi Pablo (again),
no, not a bug.

The problem is in the source code. The compiler has the right to presume that your code is exempt of UB, because you cannot depend on UB (obviously).
So it can eliminate all code which corresponds to UB.

The compiler has the right to assume that a pointer to an int cannot point to a long (UB).
So modifying a long cannot have any sort of impact on the content of the int pointer.
So the compiler can decouple both path return int content and assign long.
But assigning the long has no effect, so the code can be suppressed altogether.

Le mer. 11 déc. 2019 à 18:54, [hidden email] <[hidden email]> a écrit :
Hi Aliaksei,
      to me it looks like a bug of GCC optimization. Basically, it is
assuming that the x variable is used but never read or its value is
never used. Also it assumes the same of the i variable, as we are only
accessing indirectly to the memory where it locates (the code is even
assuming that the variable exists, but it can be optimize out as in
this scenario). Even though, the original C code is valid C code, we
are not helping the compiler writing code like that. So I have
rewritten the code in a way that does not use indirect memory access
to the stack space.

One thing more that makes me think is a bug, if you use an int
constant as the index and not a parameter, the error does not occur
(the code is not badly optimized) and there is a warning about the
not-so-great access to the stack.

On Wed, Dec 11, 2019 at 6:01 PM Aliaksei Syrel <[hidden email]> wrote:
>
> Hi Pablo,
>
> Wow! Thank you for the detective story :)
>
> Do I understand correctly that the original code causes undefined behavior and therefore can be changed (or even removed) by the compiler?
> (because it returns something that is referencing memory on the stack)
>
> Please keep posting similar things in future! It is very educative :)
>
> Cheers,
> Alex
>
>
> On Wed, 11 Dec 2019 at 17:35, [hidden email] <[hidden email]> wrote:
>>
>> Hi,
>>     this mail is related to Pharo because it is knowledge I found
>> debugging the build of the VM, but the rest is to document it and
>> perhaps someone will found it interesting (also I couldn't find it
>> easily using Google). Sorry for the long mail!
>>
>> The problem
>> ==========
>>
>> The following code does not produce good code in 8.3 when using optimizations:
>>
>> long __attribute__ ((noinline)) myFunc(long i, int index){
>>    long v;
>>    long x = i >> 3;
>>
>>    v = x;
>>    return ((int*)(&v))[index];
>> }
>>
>> #include <stdio.h>
>>
>> int main(){
>>
>>     long i;
>>     int x;
>>
>>     scanf("%ld", &i);
>>     scanf("%d", &x);
>>
>>     printf("%ld",myFunc(i,x));
>> }
>>
>> Basically, with -02, it generates the following code:
>>
>> myFunc:
>>      movslq %esi, %rsi
>>      movslq -8(%rsp,%rsi,4), %rax
>>      ret
>>
>> And with -01 it generates the following code:
>>
>> myFunc:
>>      sarq $3, %rdi
>>      movq %rdi, -8(%rsp)
>>      movslq %esi, %rsi
>>      movslq -8(%rsp,%rsi,4), %rax
>>      ret
>>
>> As you can see, the more optimized version is losing the bit shift (or
>> any other operation).
>> To detect the problem it was not easy, basically, I have used the
>> Pharo Tests to detect the failing function and then to understand the
>> generation of code in GCC, we need to debug its generation.
>>
>> The first snippet is generated using:
>>
>> gcc -O2 prb.c -S -o prb.s -Wall
>>
>> and the second using:
>>
>> gcc -O1 prb.c -S -o prb.s -Wall
>>
>> The GCC pipeline has different steps, I have centered my self in the
>> tree optimizations.
>> GCC dumps each of the intermediate states of the compilation, working
>> with C like trees. For example to get the optimized version of the
>> program, before generating the Assembler we have to do:
>>
>> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-optimized=/dev/stdout
>>
>> This will generate:
>>
>> __attribute__((noinline))
>> myFunc (long int i, int index)
>> {
>>   long int v;
>>   long unsigned int _1;
>>   long unsigned int _2;
>>   int * _3;
>>   int _4;
>>   long int _8;
>>
>>   <bb 2> [local count: 1073741825]:
>>   _1 = (long unsigned int) index_7(D);
>>   _2 = _1 * 4;
>>   _3 = &v + _2;
>>   _4 = *_3;
>>   _8 = (long int) _4;
>>   v ={v} {CLOBBER};
>>   return _8;
>>
>> }
>>
>> This is the optimized SSA (static single assign) version of the
>> function (https://gcc.gnu.org/onlinedocs/gccint/SSA.html)
>> As you can see in this version the code is already optimized, and our
>> operations not correctly generated. We expect to see something like:
>>
>> __attribute__((noinline))
>> myFunc (long int i, int index)
>> {
>>   long int x;
>>   long int v;
>>   long unsigned int _1;
>>   long unsigned int _2;
>>   int * _3;
>>   int _4;
>>   long int _10;
>>
>>   <bb 2> [local count: 1073741825]:
>>   x_6 = i_5(D) >> 3;
>>   v = x_6;
>>   _1 = (long unsigned int) index_9(D);
>>   _2 = _1 * 4;
>>   _3 = &v + _2;
>>   _4 = *_3;
>>   _10 = (long int) _4;
>>   v ={v} {CLOBBER};
>>   return _10;
>>
>> }
>>
>> In the first SSA version, we are lacking the shift operation:
>>
>>   x_6 = i_5(D) >> 3;
>>   v = x_6;
>>
>> So, we need to see in which of the optimizations and transformations
>> our code is lost:
>> To see all the steps we should execute:
>>
>> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-all
>>
>> This will generate a lot, really a lot, of files in the same directory.
>> They have the name: prb.c.xxx.yyyy
>> Where xxx is the number of the step, and yyyy is the name of the step.
>> Each of the files contains the result of applying the changes, so what
>> I have done is looking in binary search where my code was lost.
>>
>> I have found that the problem was in the first dead code elimination
>> step (prb.c.041t.cddce1)
>> GCC does not like the crap code that we are writing, as we are copying
>> a stack variable and then accessing directly to the memory:
>>
>> v = x;
>> return ((int*)(&v))[index];
>>
>> So, basically, it was assuming that we were only using v and not x, so
>> all the code modifying x is described (this optimization is called
>> "dead store code deletion"). Basically tries to remove all the code
>> that is affecting stack (temporary) variables that are never used.
>>
>> I am not sure if this is a bug in GCC, so I will report it. But I have
>> fixed the problem writing the code in a proper way.
>>
>> Sorry again for the long mail, I wanted to store the knowledge and
>> propagate it before I eventually will flush it. I like these things,
>> but I am not debugging the GCC optimization pipeline all the days.
>>
>> --
>> Pablo Tesone.
>> [hidden email]
>>


--
Pablo Tesone.
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Debugging GCC code generation

tesonep@gmail.com
Hi Nicolas,
  thanks for the comment, you are right the problem is the bad
original code. But my opinion is that it just is not reporting the
situation correctly, generating a warning or non-optimizing the code
looks more like a expected behavior. Because as I have said, using a
constant as index in the last statement generates a meaningful warning
and the non-optimizated version of the function.

And again as you said, the only thing to learn about all this is that
we should not write crappy code.

On Wed, Dec 11, 2019 at 7:11 PM Nicolas Cellier
<[hidden email]> wrote:

>
> Of course, when I say "your" code, it's the code you have shown, and probably "our" (VMMaker) code ;)
>
> Le mer. 11 déc. 2019 à 19:05, Nicolas Cellier <[hidden email]> a écrit :
>>
>> Hi Pablo (again),
>> no, not a bug.
>>
>> The problem is in the source code. The compiler has the right to presume that your code is exempt of UB, because you cannot depend on UB (obviously).
>> So it can eliminate all code which corresponds to UB.
>>
>> The compiler has the right to assume that a pointer to an int cannot point to a long (UB).
>> So modifying a long cannot have any sort of impact on the content of the int pointer.
>> So the compiler can decouple both path return int content and assign long.
>> But assigning the long has no effect, so the code can be suppressed altogether.
>>
>> Le mer. 11 déc. 2019 à 18:54, [hidden email] <[hidden email]> a écrit :
>>>
>>> Hi Aliaksei,
>>>       to me it looks like a bug of GCC optimization. Basically, it is
>>> assuming that the x variable is used but never read or its value is
>>> never used. Also it assumes the same of the i variable, as we are only
>>> accessing indirectly to the memory where it locates (the code is even
>>> assuming that the variable exists, but it can be optimize out as in
>>> this scenario). Even though, the original C code is valid C code, we
>>> are not helping the compiler writing code like that. So I have
>>> rewritten the code in a way that does not use indirect memory access
>>> to the stack space.
>>>
>>> One thing more that makes me think is a bug, if you use an int
>>> constant as the index and not a parameter, the error does not occur
>>> (the code is not badly optimized) and there is a warning about the
>>> not-so-great access to the stack.
>>>
>>> On Wed, Dec 11, 2019 at 6:01 PM Aliaksei Syrel <[hidden email]> wrote:
>>> >
>>> > Hi Pablo,
>>> >
>>> > Wow! Thank you for the detective story :)
>>> >
>>> > Do I understand correctly that the original code causes undefined behavior and therefore can be changed (or even removed) by the compiler?
>>> > (because it returns something that is referencing memory on the stack)
>>> >
>>> > Please keep posting similar things in future! It is very educative :)
>>> >
>>> > Cheers,
>>> > Alex
>>> >
>>> >
>>> > On Wed, 11 Dec 2019 at 17:35, [hidden email] <[hidden email]> wrote:
>>> >>
>>> >> Hi,
>>> >>     this mail is related to Pharo because it is knowledge I found
>>> >> debugging the build of the VM, but the rest is to document it and
>>> >> perhaps someone will found it interesting (also I couldn't find it
>>> >> easily using Google). Sorry for the long mail!
>>> >>
>>> >> The problem
>>> >> ==========
>>> >>
>>> >> The following code does not produce good code in 8.3 when using optimizations:
>>> >>
>>> >> long __attribute__ ((noinline)) myFunc(long i, int index){
>>> >>    long v;
>>> >>    long x = i >> 3;
>>> >>
>>> >>    v = x;
>>> >>    return ((int*)(&v))[index];
>>> >> }
>>> >>
>>> >> #include <stdio.h>
>>> >>
>>> >> int main(){
>>> >>
>>> >>     long i;
>>> >>     int x;
>>> >>
>>> >>     scanf("%ld", &i);
>>> >>     scanf("%d", &x);
>>> >>
>>> >>     printf("%ld",myFunc(i,x));
>>> >> }
>>> >>
>>> >> Basically, with -02, it generates the following code:
>>> >>
>>> >> myFunc:
>>> >>      movslq %esi, %rsi
>>> >>      movslq -8(%rsp,%rsi,4), %rax
>>> >>      ret
>>> >>
>>> >> And with -01 it generates the following code:
>>> >>
>>> >> myFunc:
>>> >>      sarq $3, %rdi
>>> >>      movq %rdi, -8(%rsp)
>>> >>      movslq %esi, %rsi
>>> >>      movslq -8(%rsp,%rsi,4), %rax
>>> >>      ret
>>> >>
>>> >> As you can see, the more optimized version is losing the bit shift (or
>>> >> any other operation).
>>> >> To detect the problem it was not easy, basically, I have used the
>>> >> Pharo Tests to detect the failing function and then to understand the
>>> >> generation of code in GCC, we need to debug its generation.
>>> >>
>>> >> The first snippet is generated using:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall
>>> >>
>>> >> and the second using:
>>> >>
>>> >> gcc -O1 prb.c -S -o prb.s -Wall
>>> >>
>>> >> The GCC pipeline has different steps, I have centered my self in the
>>> >> tree optimizations.
>>> >> GCC dumps each of the intermediate states of the compilation, working
>>> >> with C like trees. For example to get the optimized version of the
>>> >> program, before generating the Assembler we have to do:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-optimized=/dev/stdout
>>> >>
>>> >> This will generate:
>>> >>
>>> >> __attribute__((noinline))
>>> >> myFunc (long int i, int index)
>>> >> {
>>> >>   long int v;
>>> >>   long unsigned int _1;
>>> >>   long unsigned int _2;
>>> >>   int * _3;
>>> >>   int _4;
>>> >>   long int _8;
>>> >>
>>> >>   <bb 2> [local count: 1073741825]:
>>> >>   _1 = (long unsigned int) index_7(D);
>>> >>   _2 = _1 * 4;
>>> >>   _3 = &v + _2;
>>> >>   _4 = *_3;
>>> >>   _8 = (long int) _4;
>>> >>   v ={v} {CLOBBER};
>>> >>   return _8;
>>> >>
>>> >> }
>>> >>
>>> >> This is the optimized SSA (static single assign) version of the
>>> >> function (https://gcc.gnu.org/onlinedocs/gccint/SSA.html)
>>> >> As you can see in this version the code is already optimized, and our
>>> >> operations not correctly generated. We expect to see something like:
>>> >>
>>> >> __attribute__((noinline))
>>> >> myFunc (long int i, int index)
>>> >> {
>>> >>   long int x;
>>> >>   long int v;
>>> >>   long unsigned int _1;
>>> >>   long unsigned int _2;
>>> >>   int * _3;
>>> >>   int _4;
>>> >>   long int _10;
>>> >>
>>> >>   <bb 2> [local count: 1073741825]:
>>> >>   x_6 = i_5(D) >> 3;
>>> >>   v = x_6;
>>> >>   _1 = (long unsigned int) index_9(D);
>>> >>   _2 = _1 * 4;
>>> >>   _3 = &v + _2;
>>> >>   _4 = *_3;
>>> >>   _10 = (long int) _4;
>>> >>   v ={v} {CLOBBER};
>>> >>   return _10;
>>> >>
>>> >> }
>>> >>
>>> >> In the first SSA version, we are lacking the shift operation:
>>> >>
>>> >>   x_6 = i_5(D) >> 3;
>>> >>   v = x_6;
>>> >>
>>> >> So, we need to see in which of the optimizations and transformations
>>> >> our code is lost:
>>> >> To see all the steps we should execute:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-all
>>> >>
>>> >> This will generate a lot, really a lot, of files in the same directory.
>>> >> They have the name: prb.c.xxx.yyyy
>>> >> Where xxx is the number of the step, and yyyy is the name of the step.
>>> >> Each of the files contains the result of applying the changes, so what
>>> >> I have done is looking in binary search where my code was lost.
>>> >>
>>> >> I have found that the problem was in the first dead code elimination
>>> >> step (prb.c.041t.cddce1)
>>> >> GCC does not like the crap code that we are writing, as we are copying
>>> >> a stack variable and then accessing directly to the memory:
>>> >>
>>> >> v = x;
>>> >> return ((int*)(&v))[index];
>>> >>
>>> >> So, basically, it was assuming that we were only using v and not x, so
>>> >> all the code modifying x is described (this optimization is called
>>> >> "dead store code deletion"). Basically tries to remove all the code
>>> >> that is affecting stack (temporary) variables that are never used.
>>> >>
>>> >> I am not sure if this is a bug in GCC, so I will report it. But I have
>>> >> fixed the problem writing the code in a proper way.
>>> >>
>>> >> Sorry again for the long mail, I wanted to store the knowledge and
>>> >> propagate it before I eventually will flush it. I like these things,
>>> >> but I am not debugging the GCC optimization pipeline all the days.
>>> >>
>>> >> --
>>> >> Pablo Tesone.
>>> >> [hidden email]
>>> >>
>>>
>>>
>>> --
>>> Pablo Tesone.
>>> [hidden email]
>>>


--
Pablo Tesone.
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Debugging GCC code generation

Nicolas Cellier
Yes,
But we have to replace natural crappy code (split a long in 2 ints) that was once legal, by an even more crappy code (memcpy), so all in all, it's a crappy art.

Le mer. 11 déc. 2019 à 19:30, [hidden email] <[hidden email]> a écrit :
Hi Nicolas,
  thanks for the comment, you are right the problem is the bad
original code. But my opinion is that it just is not reporting the
situation correctly, generating a warning or non-optimizing the code
looks more like a expected behavior. Because as I have said, using a
constant as index in the last statement generates a meaningful warning
and the non-optimizated version of the function.

And again as you said, the only thing to learn about all this is that
we should not write crappy code.

On Wed, Dec 11, 2019 at 7:11 PM Nicolas Cellier
<[hidden email]> wrote:
>
> Of course, when I say "your" code, it's the code you have shown, and probably "our" (VMMaker) code ;)
>
> Le mer. 11 déc. 2019 à 19:05, Nicolas Cellier <[hidden email]> a écrit :
>>
>> Hi Pablo (again),
>> no, not a bug.
>>
>> The problem is in the source code. The compiler has the right to presume that your code is exempt of UB, because you cannot depend on UB (obviously).
>> So it can eliminate all code which corresponds to UB.
>>
>> The compiler has the right to assume that a pointer to an int cannot point to a long (UB).
>> So modifying a long cannot have any sort of impact on the content of the int pointer.
>> So the compiler can decouple both path return int content and assign long.
>> But assigning the long has no effect, so the code can be suppressed altogether.
>>
>> Le mer. 11 déc. 2019 à 18:54, [hidden email] <[hidden email]> a écrit :
>>>
>>> Hi Aliaksei,
>>>       to me it looks like a bug of GCC optimization. Basically, it is
>>> assuming that the x variable is used but never read or its value is
>>> never used. Also it assumes the same of the i variable, as we are only
>>> accessing indirectly to the memory where it locates (the code is even
>>> assuming that the variable exists, but it can be optimize out as in
>>> this scenario). Even though, the original C code is valid C code, we
>>> are not helping the compiler writing code like that. So I have
>>> rewritten the code in a way that does not use indirect memory access
>>> to the stack space.
>>>
>>> One thing more that makes me think is a bug, if you use an int
>>> constant as the index and not a parameter, the error does not occur
>>> (the code is not badly optimized) and there is a warning about the
>>> not-so-great access to the stack.
>>>
>>> On Wed, Dec 11, 2019 at 6:01 PM Aliaksei Syrel <[hidden email]> wrote:
>>> >
>>> > Hi Pablo,
>>> >
>>> > Wow! Thank you for the detective story :)
>>> >
>>> > Do I understand correctly that the original code causes undefined behavior and therefore can be changed (or even removed) by the compiler?
>>> > (because it returns something that is referencing memory on the stack)
>>> >
>>> > Please keep posting similar things in future! It is very educative :)
>>> >
>>> > Cheers,
>>> > Alex
>>> >
>>> >
>>> > On Wed, 11 Dec 2019 at 17:35, [hidden email] <[hidden email]> wrote:
>>> >>
>>> >> Hi,
>>> >>     this mail is related to Pharo because it is knowledge I found
>>> >> debugging the build of the VM, but the rest is to document it and
>>> >> perhaps someone will found it interesting (also I couldn't find it
>>> >> easily using Google). Sorry for the long mail!
>>> >>
>>> >> The problem
>>> >> ==========
>>> >>
>>> >> The following code does not produce good code in 8.3 when using optimizations:
>>> >>
>>> >> long __attribute__ ((noinline)) myFunc(long i, int index){
>>> >>    long v;
>>> >>    long x = i >> 3;
>>> >>
>>> >>    v = x;
>>> >>    return ((int*)(&v))[index];
>>> >> }
>>> >>
>>> >> #include <stdio.h>
>>> >>
>>> >> int main(){
>>> >>
>>> >>     long i;
>>> >>     int x;
>>> >>
>>> >>     scanf("%ld", &i);
>>> >>     scanf("%d", &x);
>>> >>
>>> >>     printf("%ld",myFunc(i,x));
>>> >> }
>>> >>
>>> >> Basically, with -02, it generates the following code:
>>> >>
>>> >> myFunc:
>>> >>      movslq %esi, %rsi
>>> >>      movslq -8(%rsp,%rsi,4), %rax
>>> >>      ret
>>> >>
>>> >> And with -01 it generates the following code:
>>> >>
>>> >> myFunc:
>>> >>      sarq $3, %rdi
>>> >>      movq %rdi, -8(%rsp)
>>> >>      movslq %esi, %rsi
>>> >>      movslq -8(%rsp,%rsi,4), %rax
>>> >>      ret
>>> >>
>>> >> As you can see, the more optimized version is losing the bit shift (or
>>> >> any other operation).
>>> >> To detect the problem it was not easy, basically, I have used the
>>> >> Pharo Tests to detect the failing function and then to understand the
>>> >> generation of code in GCC, we need to debug its generation.
>>> >>
>>> >> The first snippet is generated using:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall
>>> >>
>>> >> and the second using:
>>> >>
>>> >> gcc -O1 prb.c -S -o prb.s -Wall
>>> >>
>>> >> The GCC pipeline has different steps, I have centered my self in the
>>> >> tree optimizations.
>>> >> GCC dumps each of the intermediate states of the compilation, working
>>> >> with C like trees. For example to get the optimized version of the
>>> >> program, before generating the Assembler we have to do:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-optimized=/dev/stdout
>>> >>
>>> >> This will generate:
>>> >>
>>> >> __attribute__((noinline))
>>> >> myFunc (long int i, int index)
>>> >> {
>>> >>   long int v;
>>> >>   long unsigned int _1;
>>> >>   long unsigned int _2;
>>> >>   int * _3;
>>> >>   int _4;
>>> >>   long int _8;
>>> >>
>>> >>   <bb 2> [local count: 1073741825]:
>>> >>   _1 = (long unsigned int) index_7(D);
>>> >>   _2 = _1 * 4;
>>> >>   _3 = &v + _2;
>>> >>   _4 = *_3;
>>> >>   _8 = (long int) _4;
>>> >>   v ={v} {CLOBBER};
>>> >>   return _8;
>>> >>
>>> >> }
>>> >>
>>> >> This is the optimized SSA (static single assign) version of the
>>> >> function (https://gcc.gnu.org/onlinedocs/gccint/SSA.html)
>>> >> As you can see in this version the code is already optimized, and our
>>> >> operations not correctly generated. We expect to see something like:
>>> >>
>>> >> __attribute__((noinline))
>>> >> myFunc (long int i, int index)
>>> >> {
>>> >>   long int x;
>>> >>   long int v;
>>> >>   long unsigned int _1;
>>> >>   long unsigned int _2;
>>> >>   int * _3;
>>> >>   int _4;
>>> >>   long int _10;
>>> >>
>>> >>   <bb 2> [local count: 1073741825]:
>>> >>   x_6 = i_5(D) >> 3;
>>> >>   v = x_6;
>>> >>   _1 = (long unsigned int) index_9(D);
>>> >>   _2 = _1 * 4;
>>> >>   _3 = &v + _2;
>>> >>   _4 = *_3;
>>> >>   _10 = (long int) _4;
>>> >>   v ={v} {CLOBBER};
>>> >>   return _10;
>>> >>
>>> >> }
>>> >>
>>> >> In the first SSA version, we are lacking the shift operation:
>>> >>
>>> >>   x_6 = i_5(D) >> 3;
>>> >>   v = x_6;
>>> >>
>>> >> So, we need to see in which of the optimizations and transformations
>>> >> our code is lost:
>>> >> To see all the steps we should execute:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-all
>>> >>
>>> >> This will generate a lot, really a lot, of files in the same directory.
>>> >> They have the name: prb.c.xxx.yyyy
>>> >> Where xxx is the number of the step, and yyyy is the name of the step.
>>> >> Each of the files contains the result of applying the changes, so what
>>> >> I have done is looking in binary search where my code was lost.
>>> >>
>>> >> I have found that the problem was in the first dead code elimination
>>> >> step (prb.c.041t.cddce1)
>>> >> GCC does not like the crap code that we are writing, as we are copying
>>> >> a stack variable and then accessing directly to the memory:
>>> >>
>>> >> v = x;
>>> >> return ((int*)(&v))[index];
>>> >>
>>> >> So, basically, it was assuming that we were only using v and not x, so
>>> >> all the code modifying x is described (this optimization is called
>>> >> "dead store code deletion"). Basically tries to remove all the code
>>> >> that is affecting stack (temporary) variables that are never used.
>>> >>
>>> >> I am not sure if this is a bug in GCC, so I will report it. But I have
>>> >> fixed the problem writing the code in a proper way.
>>> >>
>>> >> Sorry again for the long mail, I wanted to store the knowledge and
>>> >> propagate it before I eventually will flush it. I like these things,
>>> >> but I am not debugging the GCC optimization pipeline all the days.
>>> >>
>>> >> --
>>> >> Pablo Tesone.
>>> >> [hidden email]
>>> >>
>>>
>>>
>>> --
>>> Pablo Tesone.
>>> [hidden email]
>>>


--
Pablo Tesone.
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Debugging GCC code generation

Eliot Miranda-2


On Wed, Dec 11, 2019 at 12:03 PM Nicolas Cellier <[hidden email]> wrote:
Yes,
But we have to replace natural crappy code (split a long in 2 ints) that was once legal, by an even more crappy code (memcpy), so all in all, it's a crappy art.

:-)  Indeed.  Personally I liked it when C was a portable assembler.  That's what it's fit for and that's what it should be good at.  Trying to pretend it's Pascal is, um, pretentious?
 

Le mer. 11 déc. 2019 à 19:30, [hidden email] <[hidden email]> a écrit :
Hi Nicolas,
  thanks for the comment, you are right the problem is the bad
original code. But my opinion is that it just is not reporting the
situation correctly, generating a warning or non-optimizing the code
looks more like a expected behavior. Because as I have said, using a
constant as index in the last statement generates a meaningful warning
and the non-optimizated version of the function.

And again as you said, the only thing to learn about all this is that
we should not write crappy code.

On Wed, Dec 11, 2019 at 7:11 PM Nicolas Cellier
<[hidden email]> wrote:
>
> Of course, when I say "your" code, it's the code you have shown, and probably "our" (VMMaker) code ;)
>
> Le mer. 11 déc. 2019 à 19:05, Nicolas Cellier <[hidden email]> a écrit :
>>
>> Hi Pablo (again),
>> no, not a bug.
>>
>> The problem is in the source code. The compiler has the right to presume that your code is exempt of UB, because you cannot depend on UB (obviously).
>> So it can eliminate all code which corresponds to UB.
>>
>> The compiler has the right to assume that a pointer to an int cannot point to a long (UB).
>> So modifying a long cannot have any sort of impact on the content of the int pointer.
>> So the compiler can decouple both path return int content and assign long.
>> But assigning the long has no effect, so the code can be suppressed altogether.
>>
>> Le mer. 11 déc. 2019 à 18:54, [hidden email] <[hidden email]> a écrit :
>>>
>>> Hi Aliaksei,
>>>       to me it looks like a bug of GCC optimization. Basically, it is
>>> assuming that the x variable is used but never read or its value is
>>> never used. Also it assumes the same of the i variable, as we are only
>>> accessing indirectly to the memory where it locates (the code is even
>>> assuming that the variable exists, but it can be optimize out as in
>>> this scenario). Even though, the original C code is valid C code, we
>>> are not helping the compiler writing code like that. So I have
>>> rewritten the code in a way that does not use indirect memory access
>>> to the stack space.
>>>
>>> One thing more that makes me think is a bug, if you use an int
>>> constant as the index and not a parameter, the error does not occur
>>> (the code is not badly optimized) and there is a warning about the
>>> not-so-great access to the stack.
>>>
>>> On Wed, Dec 11, 2019 at 6:01 PM Aliaksei Syrel <[hidden email]> wrote:
>>> >
>>> > Hi Pablo,
>>> >
>>> > Wow! Thank you for the detective story :)
>>> >
>>> > Do I understand correctly that the original code causes undefined behavior and therefore can be changed (or even removed) by the compiler?
>>> > (because it returns something that is referencing memory on the stack)
>>> >
>>> > Please keep posting similar things in future! It is very educative :)
>>> >
>>> > Cheers,
>>> > Alex
>>> >
>>> >
>>> > On Wed, 11 Dec 2019 at 17:35, [hidden email] <[hidden email]> wrote:
>>> >>
>>> >> Hi,
>>> >>     this mail is related to Pharo because it is knowledge I found
>>> >> debugging the build of the VM, but the rest is to document it and
>>> >> perhaps someone will found it interesting (also I couldn't find it
>>> >> easily using Google). Sorry for the long mail!
>>> >>
>>> >> The problem
>>> >> ==========
>>> >>
>>> >> The following code does not produce good code in 8.3 when using optimizations:
>>> >>
>>> >> long __attribute__ ((noinline)) myFunc(long i, int index){
>>> >>    long v;
>>> >>    long x = i >> 3;
>>> >>
>>> >>    v = x;
>>> >>    return ((int*)(&v))[index];
>>> >> }
>>> >>
>>> >> #include <stdio.h>
>>> >>
>>> >> int main(){
>>> >>
>>> >>     long i;
>>> >>     int x;
>>> >>
>>> >>     scanf("%ld", &i);
>>> >>     scanf("%d", &x);
>>> >>
>>> >>     printf("%ld",myFunc(i,x));
>>> >> }
>>> >>
>>> >> Basically, with -02, it generates the following code:
>>> >>
>>> >> myFunc:
>>> >>      movslq %esi, %rsi
>>> >>      movslq -8(%rsp,%rsi,4), %rax
>>> >>      ret
>>> >>
>>> >> And with -01 it generates the following code:
>>> >>
>>> >> myFunc:
>>> >>      sarq $3, %rdi
>>> >>      movq %rdi, -8(%rsp)
>>> >>      movslq %esi, %rsi
>>> >>      movslq -8(%rsp,%rsi,4), %rax
>>> >>      ret
>>> >>
>>> >> As you can see, the more optimized version is losing the bit shift (or
>>> >> any other operation).
>>> >> To detect the problem it was not easy, basically, I have used the
>>> >> Pharo Tests to detect the failing function and then to understand the
>>> >> generation of code in GCC, we need to debug its generation.
>>> >>
>>> >> The first snippet is generated using:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall
>>> >>
>>> >> and the second using:
>>> >>
>>> >> gcc -O1 prb.c -S -o prb.s -Wall
>>> >>
>>> >> The GCC pipeline has different steps, I have centered my self in the
>>> >> tree optimizations.
>>> >> GCC dumps each of the intermediate states of the compilation, working
>>> >> with C like trees. For example to get the optimized version of the
>>> >> program, before generating the Assembler we have to do:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-optimized=/dev/stdout
>>> >>
>>> >> This will generate:
>>> >>
>>> >> __attribute__((noinline))
>>> >> myFunc (long int i, int index)
>>> >> {
>>> >>   long int v;
>>> >>   long unsigned int _1;
>>> >>   long unsigned int _2;
>>> >>   int * _3;
>>> >>   int _4;
>>> >>   long int _8;
>>> >>
>>> >>   <bb 2> [local count: 1073741825]:
>>> >>   _1 = (long unsigned int) index_7(D);
>>> >>   _2 = _1 * 4;
>>> >>   _3 = &v + _2;
>>> >>   _4 = *_3;
>>> >>   _8 = (long int) _4;
>>> >>   v ={v} {CLOBBER};
>>> >>   return _8;
>>> >>
>>> >> }
>>> >>
>>> >> This is the optimized SSA (static single assign) version of the
>>> >> function (https://gcc.gnu.org/onlinedocs/gccint/SSA.html)
>>> >> As you can see in this version the code is already optimized, and our
>>> >> operations not correctly generated. We expect to see something like:
>>> >>
>>> >> __attribute__((noinline))
>>> >> myFunc (long int i, int index)
>>> >> {
>>> >>   long int x;
>>> >>   long int v;
>>> >>   long unsigned int _1;
>>> >>   long unsigned int _2;
>>> >>   int * _3;
>>> >>   int _4;
>>> >>   long int _10;
>>> >>
>>> >>   <bb 2> [local count: 1073741825]:
>>> >>   x_6 = i_5(D) >> 3;
>>> >>   v = x_6;
>>> >>   _1 = (long unsigned int) index_9(D);
>>> >>   _2 = _1 * 4;
>>> >>   _3 = &v + _2;
>>> >>   _4 = *_3;
>>> >>   _10 = (long int) _4;
>>> >>   v ={v} {CLOBBER};
>>> >>   return _10;
>>> >>
>>> >> }
>>> >>
>>> >> In the first SSA version, we are lacking the shift operation:
>>> >>
>>> >>   x_6 = i_5(D) >> 3;
>>> >>   v = x_6;
>>> >>
>>> >> So, we need to see in which of the optimizations and transformations
>>> >> our code is lost:
>>> >> To see all the steps we should execute:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-all
>>> >>
>>> >> This will generate a lot, really a lot, of files in the same directory.
>>> >> They have the name: prb.c.xxx.yyyy
>>> >> Where xxx is the number of the step, and yyyy is the name of the step.
>>> >> Each of the files contains the result of applying the changes, so what
>>> >> I have done is looking in binary search where my code was lost.
>>> >>
>>> >> I have found that the problem was in the first dead code elimination
>>> >> step (prb.c.041t.cddce1)
>>> >> GCC does not like the crap code that we are writing, as we are copying
>>> >> a stack variable and then accessing directly to the memory:
>>> >>
>>> >> v = x;
>>> >> return ((int*)(&v))[index];
>>> >>
>>> >> So, basically, it was assuming that we were only using v and not x, so
>>> >> all the code modifying x is described (this optimization is called
>>> >> "dead store code deletion"). Basically tries to remove all the code
>>> >> that is affecting stack (temporary) variables that are never used.
>>> >>
>>> >> I am not sure if this is a bug in GCC, so I will report it. But I have
>>> >> fixed the problem writing the code in a proper way.
>>> >>
>>> >> Sorry again for the long mail, I wanted to store the knowledge and
>>> >> propagate it before I eventually will flush it. I like these things,
>>> >> but I am not debugging the GCC optimization pipeline all the days.
>>> >>
>>> >> --
>>> >> Pablo Tesone.
>>> >> [hidden email]
>>> >>
>>>
>>>
>>> --
>>> Pablo Tesone.
>>> [hidden email]
>>>


--
Pablo Tesone.
[hidden email]



--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Debugging GCC code generation

Nicolas Cellier
It's the tribute to pay for optimization.
With unbounded pointer and pointer aliasing, the C compiler is in the worse world for optimizing.
As long as a FORTRAN compiler will beat a C compiler in generated code efficiency, the C guys will have to amend the standard ;)
So we have plenty of new rules (C89 is 30 years ago, not exactly as new as we pretend), the restrict keyword, etc...

Le mer. 11 déc. 2019 à 21:14, Eliot Miranda <[hidden email]> a écrit :


On Wed, Dec 11, 2019 at 12:03 PM Nicolas Cellier <[hidden email]> wrote:
Yes,
But we have to replace natural crappy code (split a long in 2 ints) that was once legal, by an even more crappy code (memcpy), so all in all, it's a crappy art.

:-)  Indeed.  Personally I liked it when C was a portable assembler.  That's what it's fit for and that's what it should be good at.  Trying to pretend it's Pascal is, um, pretentious?
 

Le mer. 11 déc. 2019 à 19:30, [hidden email] <[hidden email]> a écrit :
Hi Nicolas,
  thanks for the comment, you are right the problem is the bad
original code. But my opinion is that it just is not reporting the
situation correctly, generating a warning or non-optimizing the code
looks more like a expected behavior. Because as I have said, using a
constant as index in the last statement generates a meaningful warning
and the non-optimizated version of the function.

And again as you said, the only thing to learn about all this is that
we should not write crappy code.

On Wed, Dec 11, 2019 at 7:11 PM Nicolas Cellier
<[hidden email]> wrote:
>
> Of course, when I say "your" code, it's the code you have shown, and probably "our" (VMMaker) code ;)
>
> Le mer. 11 déc. 2019 à 19:05, Nicolas Cellier <[hidden email]> a écrit :
>>
>> Hi Pablo (again),
>> no, not a bug.
>>
>> The problem is in the source code. The compiler has the right to presume that your code is exempt of UB, because you cannot depend on UB (obviously).
>> So it can eliminate all code which corresponds to UB.
>>
>> The compiler has the right to assume that a pointer to an int cannot point to a long (UB).
>> So modifying a long cannot have any sort of impact on the content of the int pointer.
>> So the compiler can decouple both path return int content and assign long.
>> But assigning the long has no effect, so the code can be suppressed altogether.
>>
>> Le mer. 11 déc. 2019 à 18:54, [hidden email] <[hidden email]> a écrit :
>>>
>>> Hi Aliaksei,
>>>       to me it looks like a bug of GCC optimization. Basically, it is
>>> assuming that the x variable is used but never read or its value is
>>> never used. Also it assumes the same of the i variable, as we are only
>>> accessing indirectly to the memory where it locates (the code is even
>>> assuming that the variable exists, but it can be optimize out as in
>>> this scenario). Even though, the original C code is valid C code, we
>>> are not helping the compiler writing code like that. So I have
>>> rewritten the code in a way that does not use indirect memory access
>>> to the stack space.
>>>
>>> One thing more that makes me think is a bug, if you use an int
>>> constant as the index and not a parameter, the error does not occur
>>> (the code is not badly optimized) and there is a warning about the
>>> not-so-great access to the stack.
>>>
>>> On Wed, Dec 11, 2019 at 6:01 PM Aliaksei Syrel <[hidden email]> wrote:
>>> >
>>> > Hi Pablo,
>>> >
>>> > Wow! Thank you for the detective story :)
>>> >
>>> > Do I understand correctly that the original code causes undefined behavior and therefore can be changed (or even removed) by the compiler?
>>> > (because it returns something that is referencing memory on the stack)
>>> >
>>> > Please keep posting similar things in future! It is very educative :)
>>> >
>>> > Cheers,
>>> > Alex
>>> >
>>> >
>>> > On Wed, 11 Dec 2019 at 17:35, [hidden email] <[hidden email]> wrote:
>>> >>
>>> >> Hi,
>>> >>     this mail is related to Pharo because it is knowledge I found
>>> >> debugging the build of the VM, but the rest is to document it and
>>> >> perhaps someone will found it interesting (also I couldn't find it
>>> >> easily using Google). Sorry for the long mail!
>>> >>
>>> >> The problem
>>> >> ==========
>>> >>
>>> >> The following code does not produce good code in 8.3 when using optimizations:
>>> >>
>>> >> long __attribute__ ((noinline)) myFunc(long i, int index){
>>> >>    long v;
>>> >>    long x = i >> 3;
>>> >>
>>> >>    v = x;
>>> >>    return ((int*)(&v))[index];
>>> >> }
>>> >>
>>> >> #include <stdio.h>
>>> >>
>>> >> int main(){
>>> >>
>>> >>     long i;
>>> >>     int x;
>>> >>
>>> >>     scanf("%ld", &i);
>>> >>     scanf("%d", &x);
>>> >>
>>> >>     printf("%ld",myFunc(i,x));
>>> >> }
>>> >>
>>> >> Basically, with -02, it generates the following code:
>>> >>
>>> >> myFunc:
>>> >>      movslq %esi, %rsi
>>> >>      movslq -8(%rsp,%rsi,4), %rax
>>> >>      ret
>>> >>
>>> >> And with -01 it generates the following code:
>>> >>
>>> >> myFunc:
>>> >>      sarq $3, %rdi
>>> >>      movq %rdi, -8(%rsp)
>>> >>      movslq %esi, %rsi
>>> >>      movslq -8(%rsp,%rsi,4), %rax
>>> >>      ret
>>> >>
>>> >> As you can see, the more optimized version is losing the bit shift (or
>>> >> any other operation).
>>> >> To detect the problem it was not easy, basically, I have used the
>>> >> Pharo Tests to detect the failing function and then to understand the
>>> >> generation of code in GCC, we need to debug its generation.
>>> >>
>>> >> The first snippet is generated using:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall
>>> >>
>>> >> and the second using:
>>> >>
>>> >> gcc -O1 prb.c -S -o prb.s -Wall
>>> >>
>>> >> The GCC pipeline has different steps, I have centered my self in the
>>> >> tree optimizations.
>>> >> GCC dumps each of the intermediate states of the compilation, working
>>> >> with C like trees. For example to get the optimized version of the
>>> >> program, before generating the Assembler we have to do:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-optimized=/dev/stdout
>>> >>
>>> >> This will generate:
>>> >>
>>> >> __attribute__((noinline))
>>> >> myFunc (long int i, int index)
>>> >> {
>>> >>   long int v;
>>> >>   long unsigned int _1;
>>> >>   long unsigned int _2;
>>> >>   int * _3;
>>> >>   int _4;
>>> >>   long int _8;
>>> >>
>>> >>   <bb 2> [local count: 1073741825]:
>>> >>   _1 = (long unsigned int) index_7(D);
>>> >>   _2 = _1 * 4;
>>> >>   _3 = &v + _2;
>>> >>   _4 = *_3;
>>> >>   _8 = (long int) _4;
>>> >>   v ={v} {CLOBBER};
>>> >>   return _8;
>>> >>
>>> >> }
>>> >>
>>> >> This is the optimized SSA (static single assign) version of the
>>> >> function (https://gcc.gnu.org/onlinedocs/gccint/SSA.html)
>>> >> As you can see in this version the code is already optimized, and our
>>> >> operations not correctly generated. We expect to see something like:
>>> >>
>>> >> __attribute__((noinline))
>>> >> myFunc (long int i, int index)
>>> >> {
>>> >>   long int x;
>>> >>   long int v;
>>> >>   long unsigned int _1;
>>> >>   long unsigned int _2;
>>> >>   int * _3;
>>> >>   int _4;
>>> >>   long int _10;
>>> >>
>>> >>   <bb 2> [local count: 1073741825]:
>>> >>   x_6 = i_5(D) >> 3;
>>> >>   v = x_6;
>>> >>   _1 = (long unsigned int) index_9(D);
>>> >>   _2 = _1 * 4;
>>> >>   _3 = &v + _2;
>>> >>   _4 = *_3;
>>> >>   _10 = (long int) _4;
>>> >>   v ={v} {CLOBBER};
>>> >>   return _10;
>>> >>
>>> >> }
>>> >>
>>> >> In the first SSA version, we are lacking the shift operation:
>>> >>
>>> >>   x_6 = i_5(D) >> 3;
>>> >>   v = x_6;
>>> >>
>>> >> So, we need to see in which of the optimizations and transformations
>>> >> our code is lost:
>>> >> To see all the steps we should execute:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-all
>>> >>
>>> >> This will generate a lot, really a lot, of files in the same directory.
>>> >> They have the name: prb.c.xxx.yyyy
>>> >> Where xxx is the number of the step, and yyyy is the name of the step.
>>> >> Each of the files contains the result of applying the changes, so what
>>> >> I have done is looking in binary search where my code was lost.
>>> >>
>>> >> I have found that the problem was in the first dead code elimination
>>> >> step (prb.c.041t.cddce1)
>>> >> GCC does not like the crap code that we are writing, as we are copying
>>> >> a stack variable and then accessing directly to the memory:
>>> >>
>>> >> v = x;
>>> >> return ((int*)(&v))[index];
>>> >>
>>> >> So, basically, it was assuming that we were only using v and not x, so
>>> >> all the code modifying x is described (this optimization is called
>>> >> "dead store code deletion"). Basically tries to remove all the code
>>> >> that is affecting stack (temporary) variables that are never used.
>>> >>
>>> >> I am not sure if this is a bug in GCC, so I will report it. But I have
>>> >> fixed the problem writing the code in a proper way.
>>> >>
>>> >> Sorry again for the long mail, I wanted to store the knowledge and
>>> >> propagate it before I eventually will flush it. I like these things,
>>> >> but I am not debugging the GCC optimization pipeline all the days.
>>> >>
>>> >> --
>>> >> Pablo Tesone.
>>> >> [hidden email]
>>> >>
>>>
>>>
>>> --
>>> Pablo Tesone.
>>> [hidden email]
>>>


--
Pablo Tesone.
[hidden email]



--
_,,,^..^,,,_
best, Eliot