Hi,
a strange Postgres mistery currently drives me mad. I've never experienced such an absurd anomaly. I lost the entire day trying to find the cause, to no avail. Hopefully someone can point me to something helpful. Literally overnight, without changing any code, queries submitted via PostgresSQLEXDI that worked yesterday do no longer work now. The only thing I did was I restarted the postgres server this morning (Linux). However, it is always started by the same script (via init.d), so the config shouldn't have changed. The server also works fine with all other applications. As the stack below implies, a SocketAccessor waits infinitely for an answer to arrive from the server after the query has successfully been submitted (the server log confirms a successful execution, along with the execution time of the query). Now for the fun: This happens with /particular/ queries and tables only (reliably), while most others work fine. Example 1: After inserting four tuples into a particular table (simple INSERT INTO ... VALUES ...), it always jams. Example 2: A simple "SELECT * FROM ... ORDER BY" jams (not a single row returned). Even more fun: - The same queries work on a local database without problems! - The same queries work fine with another tool (pgAdmin3), so it can't be a server or hardware (disk) issue. - The PostgreSQLEXDI parcel is the same as yesterday - No errors are reported (neither local, nor at the server side) - I'm on a Mac, but this also happens on a Windows VM (image unchanged since days) - No, I didn't change any database related code - No, I didn't tamper with the VM sources What the heck? Is someone sniffing packets off my wire? What can I do to narrow down the issue? It must be something really creepy ... I'm using VW 7.5 and the PostgreSQLEXDI that came with it. Any hint is appreciated, as I'm really desperate. Many thanks in advance! Andre -- User Interrupt Semaphore>>wait SocketAccessor(IOAccessor)>>readWaitWithTimeoutMs: SpSocket>>waitForReadDataUpToMs: OSkSocketReadStream>>fillBufferFromSocket OSkSocketReadStream>>read:into: OSkSocketReadStream>>readFromBuffer: OSkSocketReadStream>>next: ByteValue(FixedSizeValue)>>size:readFrom:for: ByteValue class(FixedSizeValue class)>>size:readFrom:for: PostgreSQLMessage class>>readFrom:for: PostgreSQLMessageReadStream>>nextInTheContextOf: PostgreSQLMessageReadStream>>next PostgreSQLQueryResult(PostgreSQLRequestResult)>>readResult PostgreSQLQueryResult(PostgreSQLRequestResult)>>for: PostgreSQLQueryResult class(PostgreSQLRequestResult class)>>for: PostgreSQLQueryResultList>>for: PostgreSQLQueryResultList class>>for: PostgreSQLQuery>>readResults PostgreSQLQuery>>execute optimized [] in PostgreSQLEXDISession>>executeExternal BlockClosure>>on:do: PostgreSQLEXDISession>>executeExternal optimized [] in ExternalDatabaseSession>>execute BlockClosure>>ifCurtailed: PostgreSQLEXDISession(ExternalDatabaseSession)>>execute _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
When you do this, does the server confirm the number of rows
returned as well? If it thinks it's done the query, then it seems like
there are only a few options. The bits aren't ever getting to the
Smalltalk process, or the wrong bits arrive. The bits get to the
Smalltalk process, and are somehow eaten before they get to Smalltalk
code. The bits arrive in Smalltalk and are misinterpreted. The last one
seems easiest to deal with. So, when the socket hangs, how much data has
it already read, and what does it say? You might also try the queries
using a current Postgresql driver and/or a current image/VM, and if that
works then you at least know the problem is fixed, and it's just a matter
of finding out where.
At 05:33 PM 2009-11-26, Andre Schnoor wrote: Hi, --
Alan Knight [|], Engineering Manager, Cincom Smalltalk
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Andre, I would not be surprised if sockets
misbehaved every so often with a 7.5 configuration. I'd suggest
using at least a 7.6c VM, preferably with a 7.7 image (or at least using the
support resolution that addresses potential socket issues when multiple
Smalltalk processes use the same socket --- the more thorough fix
for this problem is included in 7.7). The 7.7 VM/image socket
implementation is significantly better than that of 7.6 and earlier
releases, and especially so on Windows.
Andres. From: [hidden email] [mailto:[hidden email]] On Behalf Of Alan Knight Sent: Thursday, November 26, 2009 3:56 PM To: Andre Schnoor; VWNC NC Subject: Re: [vwnc] Absolutely weird PostgreSQLEXDI mistery At 05:33 PM 2009-11-26, Andre Schnoor wrote: Hi, --
Alan Knight [|], Engineering Manager, Cincom Smalltalk
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
I doubt that that would lead to it failing consistently on
the same tables.
At 09:13 PM 2009-11-26, Valloud, Andres wrote: Content-class: urn:content-classes:messageAndre, I would not be surprised if sockets misbehaved every so often with a 7.5 configuration. I'd suggest using at least a 7.6c VM, preferably with a 7.7 image (or at least using the support resolution that addresses potential socket issues when multiple Smalltalk processes use the same socket --- the more thorough fix for this problem is included in 7.7). The 7.7 VM/image socket implementation is significantly better than that of 7.6 and earlier releases, and especially so on Windows. Andres. From: [hidden email] [[hidden email]] On Behalf Of Alan Knight Sent: Thursday, November 26, 2009 3:56 PM To: Andre Schnoor; VWNC NC Subject: Re: [vwnc] Absolutely weird PostgreSQLEXDI mistery When you do this, does the server confirm the number of rows returned as well? If it thinks it's done the query, then it seems like there are only a few options. The bits aren't ever getting to the Smalltalk process, or the wrong bits arrive. The bits get to the Smalltalk process, and are somehow eaten before they get to Smalltalk code. The bits arrive in Smalltalk and are misinterpreted. The last one seems easiest to deal with. So, when the socket hangs, how much data has it already read, and what does it say? You might also try the queries using a current Postgresql driver and/or a current image/VM, and if that works then you at least know the problem is fixed, and it's just a matter of finding out where. At 05:33 PM 2009-11-26, Andre Schnoor wrote: Hi, -- Alan Knight [|], Engineering Manager, Cincom Smalltalk [hidden email] [hidden email] http://www.cincom.com/smalltalk _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc --
Alan Knight [|], Engineering Manager, Cincom Smalltalk
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Um, I hate to stick my nose in where it might not be wanted, but did you rebuild your image Andre, that is, did you start with a fresh image and reconstruct your application by loading the code and trying your tests?
David L. Caster ~~ On Nov 26, 2009, at 6:28 PM, Alan Knight wrote:
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Alan Knight-2
Yes, the same table failure does seem odd. I'd
be assuming, however, that whatever it is that is causing the problem is
not exposing the socket bugs we fixed recently. For example, let's assume
that some table index got damaged and now the db has to do linear search over
the whole table. This takes time, and because it does, some socket
misbehavior that would otherwise not be exercised is exposed. My concern
is that, if something like that is going on, then proper socket behavior might
allow proper Store behavior, even if it is slow, which would be better than
outright failure to work. Maybe it's a far fetched assumption, but the
possibility concerns me in light of the socket problems that existed in 7.5
and, thus, socket misbehavior is something I'd like to rule out
if I was debugging the problem. From: Alan Knight [mailto:[hidden email]] Sent: Thursday, November 26, 2009 6:28 PM To: Valloud, Andres; VWNC NC Subject: Re: [vwnc] Absolutely weird PostgreSQLEXDI mistery At 09:13 PM 2009-11-26, Valloud, Andres wrote: Content-class: urn:content-classes:messageAndre, I would not be surprised if sockets misbehaved every so often with a 7.5 configuration. I'd suggest using at least a 7.6c VM, preferably with a 7.7 image (or at least using the support resolution that addresses potential socket issues when multiple Smalltalk processes use the same socket --- the more thorough fix for this problem is included in 7.7). The 7.7 VM/image socket implementation is significantly better than that of 7.6 and earlier releases, and especially so on Windows. Andres. From: [hidden email] [[hidden email]] On Behalf Of Alan Knight Sent: Thursday, November 26, 2009 3:56 PM To: Andre Schnoor; VWNC NC Subject: Re: [vwnc] Absolutely weird PostgreSQLEXDI mistery When you do this, does the server confirm the number of rows returned as well? If it thinks it's done the query, then it seems like there are only a few options. The bits aren't ever getting to the Smalltalk process, or the wrong bits arrive. The bits get to the Smalltalk process, and are somehow eaten before they get to Smalltalk code. The bits arrive in Smalltalk and are misinterpreted. The last one seems easiest to deal with. So, when the socket hangs, how much data has it already read, and what does it say? You might also try the queries using a current Postgresql driver and/or a current image/VM, and if that works then you at least know the problem is fixed, and it's just a matter of finding out where. At 05:33 PM 2009-11-26, Andre Schnoor wrote: Hi, -- Alan Knight [|], Engineering Manager, Cincom Smalltalk [hidden email] [hidden email] http://www.cincom.com/smalltalk _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc --
Alan Knight [|], Engineering Manager, Cincom Smalltalk
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Andre Schnoor
Andre Schnoor wrote:
> What the heck? Is someone sniffing packets off my wire? What can I do > to narrow down the issue? It must be something really creepy ... > > I'm using VW 7.5 and the PostgreSQLEXDI that came with it. Any hint is > appreciated, as I'm really desperate. Things to try: 1. Increase the debug level of your postgres server. I think it can be done with an interrupt signal, without restarting the server. You'll then know for sure whether the server thinks that it has processed the SQL, and sent the result, by looking a the server log. 2. Use a packet sniffer to see what bytes have been sent from the postgres server to the smalltalk VM. It helps to know the postgres frontend/backend protocol, but it may give some clue. 3. Does your app. have multiple threads/processes that read off the same socket - i.e. more than one Process is using the same database connection. If so, then it could be that one thread has consumed some acknowledgment bytes that are part of the frontend/backend protocol. 4. Since this started with a server restart, are you sure the hardware is okay. Maybe there's bad RAM or bad disk blocks. Good luck. -- Yanni _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Thanks everyone for your suggestions. I much appreciate your help.
It looks this is a server-side problem, for the most part at least. After a full vacuum and reindex of all tables, the issue became less penetrant and eventually went away (?). Hm. I don't like issues that go away without knowing why. The thing I still wonder about is why the same queries worked fine on other clients. Actually this excludes a server-side problem. One thing I forgot to mention is that "ps -ef" showed the postmaster was running since 2011(!), which was the actual reason to restart it in the first place. This clock anomaly might have had some influence on postgres meta data. Otherwise hardware seems to be ok (SMART), although it's up for five years now. SMART shows many seek errors and hardware ECC events, but no failures or sector relocations yet, so at least data integrity is ok. I also observed that on an already established and used connection, jams occured less likely and rather randomly. I don't now if its normal that the postmaster spawned multiple workers in idle state, listing the same client IP. Actually it should reuse them? @Yanni: > 1. Increase the debug level of your postgres server. I did that. It showed the query statements and execution times, but no indication whether the answer was actually sent. > 2. Use a packet sniffer Very good idea. If it reoccurs, I'll try that. > 3. ... multiple threads/processes that read off the same socket ... In theory yes, however I use a RecursionLock to prevent this. > For example, let's assume that some table index got damaged and now > the db has to do linear search over the whole table. This takes > time, and because it does, some socket misbehavior that would > otherwise not be exercised is exposed. @Andres: This would support the assumption of a combined server/client issue. Both the connection and the server were very slow yesterday. Unfortunately, I can't move to 7.7 yet, as this is way to expensive at this time. This would hold up my business for weeks, if not months. I would rather copy selected code only in both VM sources and image. > ... did you rebuild your image Andre ... ? @David: Sure, multiple times. Thanks again, Andre _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Hi Andre!
Andre Schnoor wrote: > It looks this is a server-side problem, for the most part at least. > After a full vacuum and reindex of all tables, the issue became > less penetrant and eventually went away (?). Hm. I don't like > issues that go away without knowing why. The thing I still wonder > about is why the same queries worked fine on other clients. > Actually this excludes a server-side problem. The already given suggestion to sniff the traffic with something like wireshark might help. Comparing the traffic for the same request from the failing connection to a working connection might give a clue, if the server sends different responses. > One thing I forgot to mention is that "ps -ef" showed the > postmaster was running since 2011(!), which was the actual reason > to restart it in the first place. This clock anomaly might have had > some influence on postgres meta data. Otherwise hardware seems to > be ok (SMART), although it's up for five years now. SMART shows > many seek errors and hardware ECC events, but no failures or sector > relocations yet, so at least data integrity is ok. Well having these events is (IMHO) a sign of looming trouble, even if no errors were recorded yet. Did you upgrade the postgres software? Since you mentioned vacuuming and reindexing: at least vacuuming should be done automatically by a version 8 or above server (see autovacuum). If you are still using v7 or earlier server my advice is "upgrade asap". IIRC postgres offers also some way to check your database, but i forgot about the details. > I also observed that on an already established and used connection, > jams occured less likely and rather randomly. I don't now if its > normal that the postmaster spawned multiple workers in idle state, > listing the same client IP. Actually it should reuse them? Each user connected via store has an idle process running, my "ps" output shows the idle connections and the user using it (this is postgres v8.1). Regards Jan _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Andre Schnoor
One possible reason for a difference between behaviour from
Smalltalk and other clients is that the Postgresql client is using an
older protocol which is deprecated. It continues to work, but it's
possible that there are edge cases that expose problems that don't show
up for things using the newer protocol.
At 04:55 AM 2009-11-27, Andre Schnoor wrote: Thanks everyone for your suggestions. I much appreciate your help. --
Alan Knight [|], Engineering Manager, Cincom Smalltalk
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Old, yes. Deprecated?
http://www.postgresql.org/docs/8.4/static/protocol.html Does not seem to be. IIRC the focus of the newer versions of the protocol was performance, not robustness. Nonetheless it would be nice to move up the newer protocol version. Any sponsors? :-) All the best, Bruce 2009/11/27 Alan Knight <[hidden email]>: > One possible reason for a difference between behaviour from Smalltalk and > other clients is that the Postgresql client is using an older protocol which > is deprecated. It continues to work, but it's possible that there are edge > cases that expose problems that don't show up for things using the newer > protocol. > > At 04:55 AM 2009-11-27, Andre Schnoor wrote: > > Thanks everyone for your suggestions. I much appreciate your help. > > It looks this is a server-side problem, for the most part at least. > After a full vacuum and reindex of all tables, the issue became less > penetrant and eventually went away (?). Hm. I don't like issues that > go away without knowing why. The thing I still wonder about is why the > same queries worked fine on other clients. Actually this excludes a > server-side problem. > > One thing I forgot to mention is that "ps -ef" showed the postmaster > was running since 2011(!), which was the actual reason to restart it > in the first place. This clock anomaly might have had some influence > on postgres meta data. Otherwise hardware seems to be ok (SMART), > although it's up for five years now. SMART shows many seek errors and > hardware ECC events, but no failures or sector relocations yet, so at > least data integrity is ok. > > I also observed that on an already established and used connection, > jams occured less likely and rather randomly. I don't now if its > normal that the postmaster spawned multiple workers in idle state, > listing the same client IP. Actually it should reuse them? > > > @Yanni: > >> 1. Increase the debug level of your postgres server. > > I did that. It showed the query statements and execution times, but no > indication whether the answer was actually sent. > >> 2. Use a packet sniffer > > Very good idea. If it reoccurs, I'll try that. > >> 3. ... multiple threads/processes that read off the same socket ... > > In theory yes, however I use a RecursionLock to prevent this. > >> For example, let's assume that some table index got damaged and now >> the db has to do linear search over the whole table. This takes >> time, and because it does, some socket misbehavior that would >> otherwise not be exercised is exposed. > > @Andres: This would support the assumption of a combined server/client > issue. Both the connection and the server were very slow yesterday. > > Unfortunately, I can't move to 7.7 yet, as this is way to expensive at > this time. This would hold up my business for weeks, if not months. I > would rather copy selected code only in both VM sources and image. > >> ... did you rebuild your image Andre ... ? > > @David: Sure, multiple times. > > Thanks again, > Andre > > > > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > > -- > Alan Knight [|], Engineering Manager, Cincom Smalltalk > [hidden email] > [hidden email] > http://www.cincom.com/smalltalk > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > > -- Make the most of your skills - with OpenSkills http://www.openskills.org/ _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |