Re: [Pharo-dev] Debugging GCC code generation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] Debugging GCC code generation

Eliot Miranda-2
 
Hi Pablo,

On Wed, Dec 11, 2019 at 11:33 AM [hidden email] <[hidden email]> wrote:
Hi Nicolas,
  thanks for the comment, you are right the problem is the bad
original code. But my opinion is that it just is not reporting the
situation correctly, generating a warning or non-optimizing the code
looks more like a expected behavior. Because as I have said, using a
constant as index in the last statement generates a meaningful warning
and the non-optimizated version of the function.

And again as you said, the only thing to learn about all this is that
we should not write crappy code.

Hang on.  "crappy code" is too strong.  Back in tre ANSI days it was not undefined behavior, and in fact I think this only because undefined behavior because C compilers were faced with lots of code where variables were typed int but were being used on 64-bit machines.  Had the C world used long as its default type then there would have been no need to make casting the address of a long into the address of an int undefined behavior.  So this isn't crappy code (if you think of what it means in memory on a 64-bit little endnan machine it is perfectly sensible).  It /is/ code that is inappropriate given C99.

So yes, we have to bow to the C gods and generate C99 compliant code, but what we were doing here wasn't crappy, it was merely code that became undefined behavior so that C compilers could generate more efficient code for 64-bit systems in the presence of code that contains lots of integer variables declared as int.  So let's just say that we have to generate C99 compliant code.  You'll notice that the Smalltalk code in the original does exactly what you suggest Pablo.

fetchLong32: fieldIndex ofFloatObject: oop
"index by word size, and return a pointer as long as the word size"
| bits |
(self isImmediateFloat: oop) ifFalse:
[^self fetchLong32: fieldIndex ofObject: oop].
bits := self smallFloatBitsOf: oop.
^self
cCode: [(self cCoerceSimple: (self addressOf: bits) to: #'int *') at: fieldIndex]
inSmalltalk:
[self flag: #endian.
fieldIndex = 0
ifTrue: [bits bitAnd: 16rFFFFFFFF]
ifFalse: [bits >> 32]]


On Wed, Dec 11, 2019 at 7:11 PM Nicolas Cellier
<[hidden email]> wrote:
>
> Of course, when I say "your" code, it's the code you have shown, and probably "our" (VMMaker) code ;)
>
> Le mer. 11 déc. 2019 à 19:05, Nicolas Cellier <[hidden email]> a écrit :
>>
>> Hi Pablo (again),
>> no, not a bug.
>>
>> The problem is in the source code. The compiler has the right to presume that your code is exempt of UB, because you cannot depend on UB (obviously).
>> So it can eliminate all code which corresponds to UB.
>>
>> The compiler has the right to assume that a pointer to an int cannot point to a long (UB).
>> So modifying a long cannot have any sort of impact on the content of the int pointer.
>> So the compiler can decouple both path return int content and assign long.
>> But assigning the long has no effect, so the code can be suppressed altogether.
>>
>> Le mer. 11 déc. 2019 à 18:54, [hidden email] <[hidden email]> a écrit :
>>>
>>> Hi Aliaksei,
>>>       to me it looks like a bug of GCC optimization. Basically, it is
>>> assuming that the x variable is used but never read or its value is
>>> never used. Also it assumes the same of the i variable, as we are only
>>> accessing indirectly to the memory where it locates (the code is even
>>> assuming that the variable exists, but it can be optimize out as in
>>> this scenario). Even though, the original C code is valid C code, we
>>> are not helping the compiler writing code like that. So I have
>>> rewritten the code in a way that does not use indirect memory access
>>> to the stack space.
>>>
>>> One thing more that makes me think is a bug, if you use an int
>>> constant as the index and not a parameter, the error does not occur
>>> (the code is not badly optimized) and there is a warning about the
>>> not-so-great access to the stack.
>>>
>>> On Wed, Dec 11, 2019 at 6:01 PM Aliaksei Syrel <[hidden email]> wrote:
>>> >
>>> > Hi Pablo,
>>> >
>>> > Wow! Thank you for the detective story :)
>>> >
>>> > Do I understand correctly that the original code causes undefined behavior and therefore can be changed (or even removed) by the compiler?
>>> > (because it returns something that is referencing memory on the stack)
>>> >
>>> > Please keep posting similar things in future! It is very educative :)
>>> >
>>> > Cheers,
>>> > Alex
>>> >
>>> >
>>> > On Wed, 11 Dec 2019 at 17:35, [hidden email] <[hidden email]> wrote:
>>> >>
>>> >> Hi,
>>> >>     this mail is related to Pharo because it is knowledge I found
>>> >> debugging the build of the VM, but the rest is to document it and
>>> >> perhaps someone will found it interesting (also I couldn't find it
>>> >> easily using Google). Sorry for the long mail!
>>> >>
>>> >> The problem
>>> >> ==========
>>> >>
>>> >> The following code does not produce good code in 8.3 when using optimizations:
>>> >>
>>> >> long __attribute__ ((noinline)) myFunc(long i, int index){
>>> >>    long v;
>>> >>    long x = i >> 3;
>>> >>
>>> >>    v = x;
>>> >>    return ((int*)(&v))[index];
>>> >> }
>>> >>
>>> >> #include <stdio.h>
>>> >>
>>> >> int main(){
>>> >>
>>> >>     long i;
>>> >>     int x;
>>> >>
>>> >>     scanf("%ld", &i);
>>> >>     scanf("%d", &x);
>>> >>
>>> >>     printf("%ld",myFunc(i,x));
>>> >> }
>>> >>
>>> >> Basically, with -02, it generates the following code:
>>> >>
>>> >> myFunc:
>>> >>      movslq %esi, %rsi
>>> >>      movslq -8(%rsp,%rsi,4), %rax
>>> >>      ret
>>> >>
>>> >> And with -01 it generates the following code:
>>> >>
>>> >> myFunc:
>>> >>      sarq $3, %rdi
>>> >>      movq %rdi, -8(%rsp)
>>> >>      movslq %esi, %rsi
>>> >>      movslq -8(%rsp,%rsi,4), %rax
>>> >>      ret
>>> >>
>>> >> As you can see, the more optimized version is losing the bit shift (or
>>> >> any other operation).
>>> >> To detect the problem it was not easy, basically, I have used the
>>> >> Pharo Tests to detect the failing function and then to understand the
>>> >> generation of code in GCC, we need to debug its generation.
>>> >>
>>> >> The first snippet is generated using:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall
>>> >>
>>> >> and the second using:
>>> >>
>>> >> gcc -O1 prb.c -S -o prb.s -Wall
>>> >>
>>> >> The GCC pipeline has different steps, I have centered my self in the
>>> >> tree optimizations.
>>> >> GCC dumps each of the intermediate states of the compilation, working
>>> >> with C like trees. For example to get the optimized version of the
>>> >> program, before generating the Assembler we have to do:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-optimized=/dev/stdout
>>> >>
>>> >> This will generate:
>>> >>
>>> >> __attribute__((noinline))
>>> >> myFunc (long int i, int index)
>>> >> {
>>> >>   long int v;
>>> >>   long unsigned int _1;
>>> >>   long unsigned int _2;
>>> >>   int * _3;
>>> >>   int _4;
>>> >>   long int _8;
>>> >>
>>> >>   <bb 2> [local count: 1073741825]:
>>> >>   _1 = (long unsigned int) index_7(D);
>>> >>   _2 = _1 * 4;
>>> >>   _3 = &v + _2;
>>> >>   _4 = *_3;
>>> >>   _8 = (long int) _4;
>>> >>   v ={v} {CLOBBER};
>>> >>   return _8;
>>> >>
>>> >> }
>>> >>
>>> >> This is the optimized SSA (static single assign) version of the
>>> >> function (https://gcc.gnu.org/onlinedocs/gccint/SSA.html)
>>> >> As you can see in this version the code is already optimized, and our
>>> >> operations not correctly generated. We expect to see something like:
>>> >>
>>> >> __attribute__((noinline))
>>> >> myFunc (long int i, int index)
>>> >> {
>>> >>   long int x;
>>> >>   long int v;
>>> >>   long unsigned int _1;
>>> >>   long unsigned int _2;
>>> >>   int * _3;
>>> >>   int _4;
>>> >>   long int _10;
>>> >>
>>> >>   <bb 2> [local count: 1073741825]:
>>> >>   x_6 = i_5(D) >> 3;
>>> >>   v = x_6;
>>> >>   _1 = (long unsigned int) index_9(D);
>>> >>   _2 = _1 * 4;
>>> >>   _3 = &v + _2;
>>> >>   _4 = *_3;
>>> >>   _10 = (long int) _4;
>>> >>   v ={v} {CLOBBER};
>>> >>   return _10;
>>> >>
>>> >> }
>>> >>
>>> >> In the first SSA version, we are lacking the shift operation:
>>> >>
>>> >>   x_6 = i_5(D) >> 3;
>>> >>   v = x_6;
>>> >>
>>> >> So, we need to see in which of the optimizations and transformations
>>> >> our code is lost:
>>> >> To see all the steps we should execute:
>>> >>
>>> >> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-all
>>> >>
>>> >> This will generate a lot, really a lot, of files in the same directory.
>>> >> They have the name: prb.c.xxx.yyyy
>>> >> Where xxx is the number of the step, and yyyy is the name of the step.
>>> >> Each of the files contains the result of applying the changes, so what
>>> >> I have done is looking in binary search where my code was lost.
>>> >>
>>> >> I have found that the problem was in the first dead code elimination
>>> >> step (prb.c.041t.cddce1)
>>> >> GCC does not like the crap code that we are writing, as we are copying
>>> >> a stack variable and then accessing directly to the memory:
>>> >>
>>> >> v = x;
>>> >> return ((int*)(&v))[index];
>>> >>
>>> >> So, basically, it was assuming that we were only using v and not x, so
>>> >> all the code modifying x is described (this optimization is called
>>> >> "dead store code deletion"). Basically tries to remove all the code
>>> >> that is affecting stack (temporary) variables that are never used.
>>> >>
>>> >> I am not sure if this is a bug in GCC, so I will report it. But I have
>>> >> fixed the problem writing the code in a proper way.
>>> >>
>>> >> Sorry again for the long mail, I wanted to store the knowledge and
>>> >> propagate it before I eventually will flush it. I like these things,
>>> >> but I am not debugging the GCC optimization pipeline all the days.
>>> >>
>>> >> --
>>> >> Pablo Tesone.
>>> >> [hidden email]
>>> >>
>>>
>>>
>>> --
>>> Pablo Tesone.
>>> [hidden email]
>>>


--
Pablo Tesone.
[hidden email]



--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] Debugging GCC code generation

Ben Coman
 


On Thu, 12 Dec 2019 at 04:03, Eliot Miranda <[hidden email]> wrote:
 
Hi Pablo,

On Wed, Dec 11, 2019 at 11:33 AM [hidden email] <[hidden email]> wrote:
Hi Nicolas,
  thanks for the comment, you are right the problem is the bad
original code. But my opinion is that it just is not reporting the
situation correctly, generating a warning or non-optimizing the code
looks more like a expected behavior. Because as I have said, using a
constant as index in the last statement generates a meaningful warning
and the non-optimizated version of the function.

And again as you said, the only thing to learn about all this is that
we should not write crappy code.

Hang on.  "crappy code" is too strong.  Back in tre ANSI days it was not undefined behavior, and in fact I think this only because undefined behavior because C compilers were faced with lots of code where variables were typed int but were being used on 64-bit machines.  Had the C world used long as its default type then there would have been no need to make casting the address of a long into the address of an int undefined behavior.  So this isn't crappy code (if you think of what it means in memory on a 64-bit little endnan machine it is perfectly sensible).  It /is/ code that is inappropriate given C99.

So yes, we have to bow to the C gods and generate C99 compliant code, but what we were doing here wasn't crappy, it was merely code that became undefined behavior so that C compilers could generate more efficient code for 64-bit systems in the presence of code that contains lots of integer variables declared as int.  So let's just say that we have to generate C99 compliant code.  You'll notice that the Smalltalk code in the original does exactly what you suggest Pablo.

So maybe interesting to see what happens if C89 is specified...

Possibly useful sanitizer...

cheers -ben