linux build stability

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

linux build stability

Eliot Miranda-2
Hi All,

     you may already know that there have been strange stability problems with the Cog VM on linux.  Problems with the heartbeat appear to derive from specific compilations, one compilation of the same source producing an executable that will crash, another producing one that won't.  recent testing at Teleplace showed that an effect due to what was presumed to be a compiler bug (specifically the optimization level used to compile the heartbeat, high causing a crash) was not repeatable.  So today in building new production VMs for Teleplace I decided to do three parallel linux builds and see if all produced the same results.  While there are macros used in the source that are date dependent (use of __DATE__) AFAIA there are none apart from version.c/version.o that depend on time, and no timestamps or current directory paths in linux objects, and so, provided different compilations of the same source are done on the same day, the results should be bit-identical.  In my experiment this turns out not to be the case, which is more than a little alarming.

What I'm seeing is different results duplicating unixbuild/bld to unixbuild/bldb and unixbuild/bldc, doing identical configures and makes in each of the three directories and then comparing resulting objects.  I see this in a bare metal laptop with local sources running CERN SLC5 and on a Parallels VM running CentOS 5.3 (both derived from RHEL).  I'm using gcc 4.1.2.  Here's a script that shows example differences:

bld$ for f in *.o vm/*.o; do echo $f;cmp $f ../bldb/$f; cmp $f ../bldc/$f; done
disabledPlugins.o
disabledPlugins.o ../bldb/disabledPlugins.o differ: byte 200, line 4
disabledPlugins.o ../bldc/disabledPlugins.o differ: byte 200, line 4
version.o
version.o ../bldb/version.o differ: byte 166, line 3
version.o ../bldc/version.o differ: byte 166, line 3
vm/aio.o
vm/cogit.o
vm/debug.o
vm/gcc3x-cointerp.o
vm/osExports.o
vm/sqExternalSemaphores.o
vm/sqHeapMap.o
vm/sqLinuxHeartbeat.o
vm/sqLinuxWatchdog.o
vm/sqLinuxWatchdog.o ../bldb/vm/sqLinuxWatchdog.o differ: byte 33, line 1
vm/sqLinuxWatchdog.o ../bldc/vm/sqLinuxWatchdog.o differ: byte 33, line 1
vm/sqNamedPrims.o
vm/sqNamedPrims.o ../bldb/vm/sqNamedPrims.o differ: byte 6346, line 30
vm/sqNamedPrims.o ../bldc/vm/sqNamedPrims.o differ: byte 6346, line 30
vm/sqTicker.o
vm/sqUnixCharConv.o
vm/sqUnixExternalPrims.o
vm/sqUnixMain.o
vm/sqUnixMain.o ../bldb/vm/sqUnixMain.o differ: byte 31415, line 170
vm/sqUnixMain.o ../bldc/vm/sqUnixMain.o differ: byte 31414, line 170
vm/sqUnixMemory.o
vm/sqUnixThreads.o
vm/sqUnixVMProfile.o
vm/sqVirtualMachine.o

Using objdump --disassemble I can see for example that sqLinuxWatchdog.o and sqUnixMain.o differ only in the symbol table, not the executable code.  So perhaps this is not meaningful, and merely noise.  But with simple files like disabledPlugins.c that different objects are produced at all in different runs is rather worrying:

bld$ cat disabledPlugins.c
/* this should be in a header file, but it isn't.  ho hum. */
typedef struct {
  char *pluginName;
  char *primitiveName;
  void *primitiveAddress;
} sqExport;
sqExport vm_display_Quartz_exports[] = { 0, 0, 0 };
sqExport vm_display_custom_exports[] = { 0, 0, 0 };
sqExport vm_display_fbdev_exports[] = { 0, 0, 0 };
sqExport vm_sound_MacOSX_exports[] = { 0, 0, 0 };
sqExport vm_sound_NAS_exports[] = { 0, 0, 0 };
sqExport vm_sound_OSS_exports[] = { 0, 0, 0 };
sqExport vm_sound_Sun_exports[] = { 0, 0, 0 };
sqExport vm_sound_custom_exports[] = { 0, 0, 0 };


I wonder
- do you see the same effect?
- does this happen with gcc versions other than 4.1.2?
- does it happen on non-RHEL-derived distros?
- is this a meaningful signal or just harmless noise?
- what am I doing wrong?

Clearly I need to look more carefully but I thought I'd ask y'all in order to understand and hopefully solve the build instabilities as swiftly as possible.

If you do want to try and reproduce this simply duplicate the build directory (unixbuild/bld in the Cog VM source) twice and do three separate configures and makes, one in each of the build directories, each from the same source code.  Then run some variation fo the script above to compare the object files so produced.

best
Eliot
Reply | Threaded
Open this post in threaded view
|

Re: linux build stability

Andres Valloud-4
In case it helps, I've seen GCC produce different code for function A
when I change function B.  This can happen even if function B is not in
the (reasonably close) execution path of function A.  However, the
differences I've seen are along the lines of e.g.: using %r10 instead of
%r9, or %r13+3 instead of %r12+4 (where the pointers basically point to
the same desired data).  Since both outputs are legal, the differences
can be attributed to the internal state of the compiler which perhaps is
not totally deterministic.  Nevertheless, as long as the output is
valid, one should not be able to complain about this variability.
Nevertheless, these changes do expose other bugs.  So, if you see these
kind of differences correlated with "the VM crashes" or "the VM appears
to work", then I'd suspect ABI violations with regards to the registers
involved.

On 2/1/11 22:17 , Eliot Miranda wrote:

> Hi All,
>
>       you may already know that there have been strange stability
> problems with the Cog VM on linux.  Problems with the heartbeat appear
> to derive from specific compilations, one compilation of the same source
> producing an executable that will crash, another producing one that
> won't.  recent testing at Teleplace showed that an effect due to what
> was presumed to be a compiler bug (specifically the optimization level
> used to compile the heartbeat, high causing a crash) was not repeatable.
>   So today in building new production VMs for Teleplace I decided to do
> three parallel linux builds and see if all produced the same results.
>   While there are macros used in the source that are date dependent (use
> of __DATE__) AFAIA there are none apart from version.c/version.o that
> depend on time, and no timestamps or current directory paths in linux
> objects, and so, provided different compilations of the same source are
> done on the same day, the results should be bit-identical.  In my
> experiment this turns out not to be the case, which is more than a
> little alarming.
>
> What I'm seeing is different results duplicating unixbuild/bld to
> unixbuild/bldb and unixbuild/bldc, doing identical configures and makes
> in each of the three directories and then comparing resulting objects.
>   I see this in a bare metal laptop with local sources running CERN SLC5
> and on a Parallels VM running CentOS 5.3 (both derived from RHEL).  I'm
> using gcc 4.1.2.  Here's a script that shows example differences:
>
> bld$ for f in *.o vm/*.o; do echo $f;cmp $f ../bldb/$f; cmp $f
> ../bldc/$f; done
> disabledPlugins.o
> disabledPlugins.o ../bldb/disabledPlugins.o differ: byte 200, line 4
> disabledPlugins.o ../bldc/disabledPlugins.o differ: byte 200, line 4
> version.o
> version.o ../bldb/version.o differ: byte 166, line 3
> version.o ../bldc/version.o differ: byte 166, line 3
> vm/aio.o
> vm/cogit.o
> vm/debug.o
> vm/gcc3x-cointerp.o
> vm/osExports.o
> vm/sqExternalSemaphores.o
> vm/sqHeapMap.o
> vm/sqLinuxHeartbeat.o
> vm/sqLinuxWatchdog.o
> vm/sqLinuxWatchdog.o ../bldb/vm/sqLinuxWatchdog.o differ: byte 33, line 1
> vm/sqLinuxWatchdog.o ../bldc/vm/sqLinuxWatchdog.o differ: byte 33, line 1
> vm/sqNamedPrims.o
> vm/sqNamedPrims.o ../bldb/vm/sqNamedPrims.o differ: byte 6346, line 30
> vm/sqNamedPrims.o ../bldc/vm/sqNamedPrims.o differ: byte 6346, line 30
> vm/sqTicker.o
> vm/sqUnixCharConv.o
> vm/sqUnixExternalPrims.o
> vm/sqUnixMain.o
> vm/sqUnixMain.o ../bldb/vm/sqUnixMain.o differ: byte 31415, line 170
> vm/sqUnixMain.o ../bldc/vm/sqUnixMain.o differ: byte 31414, line 170
> vm/sqUnixMemory.o
> vm/sqUnixThreads.o
> vm/sqUnixVMProfile.o
> vm/sqVirtualMachine.o
>
> Using objdump --disassemble I can see for example that sqLinuxWatchdog.o
> and sqUnixMain.o differ only in the symbol table, not the executable
> code.  So perhaps this is not meaningful, and merely noise.  But with
> simple files like disabledPlugins.c that different objects are produced
> at all in different runs is rather worrying:
>
> bld$ cat disabledPlugins.c
> /* this should be in a header file, but it isn't.  ho hum. */
> typedef struct {
>    char *pluginName;
>    char *primitiveName;
>    void *primitiveAddress;
> } sqExport;
> sqExport vm_display_Quartz_exports[] = { 0, 0, 0 };
> sqExport vm_display_custom_exports[] = { 0, 0, 0 };
> sqExport vm_display_fbdev_exports[] = { 0, 0, 0 };
> sqExport vm_sound_MacOSX_exports[] = { 0, 0, 0 };
> sqExport vm_sound_NAS_exports[] = { 0, 0, 0 };
> sqExport vm_sound_OSS_exports[] = { 0, 0, 0 };
> sqExport vm_sound_Sun_exports[] = { 0, 0, 0 };
> sqExport vm_sound_custom_exports[] = { 0, 0, 0 };
>
>
> I wonder
> - do you see the same effect?
> - does this happen with gcc versions other than 4.1.2?
> - does it happen on non-RHEL-derived distros?
> - is this a meaningful signal or just harmless noise?
> - what am I doing wrong?
>
> Clearly I need to look more carefully but I thought I'd ask y'all in
> order to understand and hopefully solve the build instabilities as
> swiftly as possible.
>
> If you do want to try and reproduce this simply duplicate the build
> directory (unixbuild/bld in the Cog VM source) twice and do three
> separate configures and makes, one in each of the build directories,
> each from the same source code.  Then run some variation fo the script
> above to compare the object files so produced.
>
> best
> Eliot