My development repository occasionally gets into a state where the session
dies and GemTools tells me "Please report to your GemStone Administrator: Fatal Internal Error, 5015, 15621, IO Error" when running certain Smalltalk expressions or doing a search for strings in source. Same in Topaz. This persists across restarting the stone, I need to rebuild the extent from scratch. The only strange thing I did is "WAFastCGIAdaptor startOn: 8383" which I then needed to interrupt by pressing ALT-. Thereafter any Smalltalk, even just evaluating '1' would cause a stacktrace which I hope to reproduce here when it happens again. Logfile produced by using GEM_HALT_ON_ERROR = 4126 attached. _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass gemnetobject10663revelations.finworks.biz.log (31K) Download Attachment |
Pieter, Thanks for the report ... I've passed this along ... I'm out of the office for the day and most folks are out with their kids for "spring break" ... I'll get back to you with more information when I have more information ...
Dale On Thu, Mar 27, 2014 at 7:00 AM, Pieter Nagel <[hidden email]> wrote: My development repository occasionally gets into a state where the session _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by Pieter Nagel-3
Pieter, In order to get to the bottom of this please run a page audit and object audit on the extent producing the IO errors and send us any error output. after that we may need a copy of the extent which is producing the errors, along with tranlogs. Dale
On Thu, Mar 27, 2014 at 7:00 AM, Pieter Nagel <[hidden email]> wrote: My development repository occasionally gets into a state where the session _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Dale
Pageaudit found errors, log attached. Object audit fails to run because it hits the same Invalid pageKind error that the page audit already reported. Not sure what else I can do to run an object audit, since the documented way to recover from the page audit is to restore from backup, at which point the repository is in its pre-error state anyway. _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass pageaudit.log (14K) Download Attachment |
These fatal errors are still going on. so far it affects only development
and staging repositories, so we can work around it, but we're growing increasingly nervous that it might hit production as well. To recap: repositories are becoming randomly corrupted into a state where any meaningfull Smalltalk evaluation causes topaz to dump errors like: RepRead Error hdr.pageId 0 != pageId 14635----------------------------------------------------- GemStone: Error Fatal Please report to your GemStone Administrator: Fatal Internal Error, 5015, 14635, IO Error Error Category: 231169 [Internal Fatal] Number: 4126 Arg Count: 3 Context : 20 exception : 20 Arg 1: 40122 Arg 2: 117082 Arg 3: 20 These bugs seem to be random. Rebuilding a clean development extent will fail one time and pass the next even though there are no changes to the code that is being filed in on top a virgin exten0.dbf in the two situations. Sometimes reboots inbetween seem to be needed for the problem to go away. Any advice for getting to the bottom of this? _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Pieter, start with a fresh extent0.seaside.dbf and before starting stone , add STN_TRAN_FULL_LOGGING=true; STN_TRAN_LOG_DEBUG_LEVEL=3; to the stone config file . Then after repository gets corrupted, run pageaudit and if errors are similar , send us the extent and the tranlogs for analysis (if possible). Otherwise we will want you to run some tranlog searches for us ... On Mon, Jul 7, 2014 at 1:47 AM, <[hidden email]> wrote: These fatal errors are still going on. so far it affects only development _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
> Pieter,
> > start with a fresh extent0.seaside.dbf and > before starting stone , add > STN_TRAN_FULL_LOGGING=true; > STN_TRAN_LOG_DEBUG_LEVEL=3; > to the stone config file . > Then after repository gets corrupted, run pageaudit and > if errors are similar , send us the extent and the tranlogs > for analysis (if possible). Otherwise we will want you to > run some tranlog searches for us ... Will do. But note that we are using /dev/null as STN_TRAN_LOG_DIRECTORIES, so there will not be any tranlogs for us to send/search through. I sometimes wonder if using /dev/null in itself may be behind this bug? Do you recommend taking pains that STN_TRAN_LOG_DEBUG_LEVEL is *not* set on production repositories? _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Pieter, Recommendations from engineering: You need to write tranlogs to actual file system if we are to have tranlogs to debug. Suggest reducing to STN_TRAN_LOG_DEBUG_LEVEL=1; if performance with STN_TRAN_LOG_DEBUG_LEVEL=3; is a problem. For production we recommend STN_TRAN_FULL_LOGGING=true; STN_TRAN_LOG_DEBUG_LEVEL=0; Suggest tranlogs written to separate file system from extents to reduce performance impact of writing them. You should probably start thinking of upgrading to 3.1.0.6 or 3.2.1 there are no specific bugfixes for the issue you are seeing, but when hitting bugs, it is always prudent to be running with the latest version. Dale On Tue, Jul 8, 2014 at 6:14 AM, <[hidden email]> wrote:
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Just an update:
So far it seems as if the bug goes away when we roll out the STN_TRAN_LOG_DEBUG_LEVEL setting - hard to say for sure though with a Heisenbug. As part of setting STN_TRAN_LOG_DEBUG_LEVEL, we also have to switch tranlogs from /dev/null to real directories. So I strongly suspect that tranlogging to /dev/null is the root cause of these RepRead errors. I won't be surprised if the code paths around that are not as well tested - I think we here are a bit unusual in the GemStone world in spending so much time in that "mode" which other people only use occasionally to speed up bulk imports. _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
It really seems this bug only occurs when tranlogging to /dev/null.
It's now two weeks later after switching tranlogs to real directories, and this bug still has not resurfaced. Used to happen a handful of times a week. But on development test-suites etc. run about 3 times faster when tranlogging to /dev/null. Is there other way to get debug information on this while still tranlogging to /dev/null? _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Setting GEM_HALT_ON_ERROR in the session's config file to the number of the error you want more diagnostic data on will cause a stack trace to be printed out to the gem log. That would be a start... ------------------------------------------------------------------------ Bill Erickson GemTalk Systems Engineering 15220 NW Greenbrier Parkway #240, Beaverton OR 97006 ------------------------------------------------------------------------ On Mon, Jul 28, 2014 at 2:28 AM, <[hidden email]> wrote: It really seems this bug only occurs when tranlogging to /dev/null. _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by Pieter Nagel-3
Pieter, Despite your observations that /dev/null seems to be implicated in the page faults, we don't really think that it is the root cause. We have had some bugs that we have fixed in 3.1.0.6 and 3.2.1 that might address the a problem where timing issues with stopping and restarting stones could cause corruption. If you could send us the stone logs for the old sessions when you saw the corruption we might be able to see a familiar pattern ...
If it is this problem, then /dev/null tranlogs could make things a bit worse... Dale On Mon, Jul 28, 2014 at 2:28 AM, <[hidden email]> wrote: It really seems this bug only occurs when tranlogging to /dev/null. _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Dale,
We're going to upgrade to 3.1.0.6 or 3.2.1 first, and get back to you if the bug persists. But note that this bug only ever occured to stones that had never been shut down at all. In other words, to a pristine extent0 that was fired up, used to restore or had our code loaded to it, and had not yet the first shutdown of its lifetime. Once a stone was broken, not even stopstone could log in to it. Conversely, any stone that could be shutdown and restarted would thereafter be totally immune to this bug, in my experience. So I'm not sure if the shutdown code is involved here at all. _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Pieter, You should be glad to know that based on your feedback and experience we decided to turn on /dev/null tranlogs in some of our stress tests and have indeed found some problems that could lead to the errors that you are seeing ... no schedule for when the bugfixes will show up ...
BTW the fact that you never stopped/restarted your stone kinda sealed the deal on implicating /dev/null tranlogs:) So thanks for your patience and help in isolating the problem ... Dale On Tue, Jul 29, 2014 at 6:30 AM, <[hidden email]> wrote: Dale, _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Dale,
I see from the release notes that 3.2.2 fixes this. Thanks for all the help. > Pieter, > > You should be glad to know that based on your feedback and experience we > decided to turn on /dev/null tranlogs in some of our stress tests and have > indeed found some problems that could lead to the errors that you are > seeing ... no schedule for when the bugfixes will show up ... > > BTW the fact that you never stopped/restarted your stone kinda sealed the > deal on implicating /dev/null tranlogs:) > > So thanks for your patience and help in isolating the problem ... > > Dale _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Free forum by Nabble | Edit this page |