Smalltalk › Frameworks & Tools › Seaside › Seaside General

Scaling Seaside apps (was: About SToR)

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

29 messages Options

Richard Huxton

Re: Re: Scaling Seaside apps (was: About SToR)

Rick Flower wrote:

>
> I plopped a note over on the Glorp mailing list about your Cursor
> comment (I hope you didn't mind) and got the following reply from Alan
> Knight about what happens with Cursors & Glorp (he wanted me to post
> this since he wasn't able to post directly) :
>
> ==========================================================================
> If you can post, you might mention that Glorp actually does everything
> internally in terms of cursors. If you want the result set returned only
> part at a time, you can set the query collectionType: to
> GlorpCursoredStream, which gives you a stream on the results. However,
> that will then depend on the underlying driver's behaviour. I know that
> in VW, I've seen complaints the Postgresql driver doesn't do cursors
> very effectively - it gets all the results before returning anything.
> Other drivers, however, certainly do do cursors.
> ==========================================================================

Hmm - I just don't believe that. OK, for certain queries you'll have to
fetch everything (if you want a sorted list, you can't get the first
item until you're sure it's the smallest value). However, if you ask for
a cursor then you get a cursor. It even plans queries differently on the
grounds that with a cursor you'll want the first row asap.

If anyone is interested in exploring Glorp and its performance with
PostgreSQL then the pgsql-performance mailing list is a good place to
start. I'm there much of the time as are a lot of people smarter than me.

People will want to see:
1. The problem query
2. Relevant parts of the database schema
3. output of EXPLAIN ANALYZE for the problem query
4. Assurance that VACUUM/ANALYSE are being run
5. Possibly details of hardware & what configuration changes you've made.

--
Richard Huxton
Archonet Ltd
_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Jeremy Shute

Re: Re: Scaling Seaside apps (was: About SToR)

As I said, I don't think the cursors are going to help the whole 10-12x issue. What I am doing is essentially very write-heavy.

I read the table for some context. "Dear database, what was the state of the object last time I retrieved it from the network?" The cursor would help here, as right now I'm seeking each object one-at-a-time. A cursor would eliminate needing to construct this query in the middle of the loop, and store where I am in the B-tree (instead of skipping down from the root every time).

But, looking at the process browser, the messages in the GLORP-operating thread seem to relate to the assembly of queries. The cursor would eliminate the READ queries, but the WRITE queries are still necessary, requiring assembly one-at-a-time, unless GLORP is smart enough to use a global prepared statement underneath for insertions. I see no reason why it shouldn't be -- insertions are very homogenous operations, unlike WHERE clauses. Perhaps a cursor WOULD solve the problem?

Really, I need to profile the message sends to see exactly where the CPU time is going, but one thing is for sure, the CPU is pegged.

Jeremy

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Alan Knight-2

Re: Scaling Seaside apps (was: About SToR)

In reply to this post by Chris Muller

OK, I've joined this list now. I'll respond to a couple of comments in various messages.

I note that as far as Postgresql, my experience is with the driver that Bruce Badger did in VW.

In response to my proxied post about Glorp doing everything internally in terms of cursors, Richard Huxton dev at archonet.com wrote:
---------
Hmm - I just don't believe that. OK, for certain queries you'll have to
fetch everything (if you want a sorted list, you can't get the first
item until you're sure it's the smallest value). However, if you ask for
a cursor then you get a cursor. It even plans queries differently on the
grounds that with a cursor you'll want the first row asap.
------------

I think that was a lack of clear terms on my part. When I said uses cursors internally, what I meant is that when Glorp gets results back, it expects to get them as a stream, and to get the results one by one using the driver level API. If you asked for the results in an array, well, it will get them all from the stream and build the array. If you asked for them in a GlorpCursoredStream it will essentially expose that streaming API to you. As I understand it, it is common for databases to calculate results lazily, so if you fetch those results one by one, the database will do only enough to send an initial set of results. Explicitly telling it you want a cursor will give the database a better hint as to what you want, but usually if you ask for a billion rows at once it will get the idea that for its own internal purposes it might want to defer calculation. So I don't really know if for a large query the bottleneck is postgresql doing all the calculations up front, or that the postgresql protocol does not have a mechanism for passing only part of the results back before the remainder has been calculated, or if it's just the driver at the Smalltalk level ignoring such protocol and getting everything.

Glorp doesn't do anything in terms of actually declaring named cursors and issuing fetch statements.

---------------------
In a different part of the thread, Jeremy Shute shutej at crazilocks.com wrote:

But, looking at the process browser, the messages in the GLORP-operating
thread seem to relate to the assembly of queries. The cursor would
eliminate the READ queries, but the WRITE queries are still necessary,
requiring assembly one-at-a-time, unless GLORP is smart enough to use a
global prepared statement underneath for insertions. I see no reason why it
shouldn't be -- insertions are very homogenous operations, unlike WHERE
clauses. Perhaps a cursor WOULD solve the problem?
-----------------

Glorp does group insertions together. This can give some very large performance improvements, e.g. if it can use Oracle array binding over a high-latency connection. However, for Postgresql there isn't any directly analogous mechanism. The mechanisms I saw available were a) just concatenating the inserts together in a single large statement with semicolons separating them. This helps with network traffic, but doesn't really tell the database it's doing many similar inserts. b) The bulk load mechanism. This seemed very much aimed at loading data from files, and looked like it might be possible to use more directly, but quite awkward. I haven't done this.

So right now on Postgresql (and others) Glorp will just concatenate the inserts together (up to a limit). One thing to watch out for from a performance perspective is to turn binding off for Postgresql. At least in the VW driver, binding was implemented by taking the queries, splitting them up into strings, and then concatenating the string representation of the bound values into the query. If you have a query with a very long string (e.g. because it's a large number of inserts being done all at once) then this became a huge performance bottleneck, and doesn't even seem to offer any of the benefits of binding in terms of not needing to convert back and forth between string representations.

Glorp will also keep a cache of prepared statements and re-use them if the hooks are in place, but I wasn't aware of how to do that for Postgresql. I suspect it's less important in such a situation, in that you'd have to have binding done better in order to effectively reuse prepared statements. I also have a feeling, although I haven't measured, that Postgresql is more tolerant of dynamic SQL for performance than, say, Oracle, which really really wants you to use prepared statements.
-------------------------------------------
and in other messages mentioned both bypassing caches and someone else mentioned a "report query" which doesn't create objects or put things into cache, but just returns raw data.

I haven't gotten around to writing a NoCacheCachePolicy, but it would be quite an easy implementation. But be aware that then you lose object identity, which means relationships can get confused, and if you're writing, Glorp may not properly know whether to insert or update (it bases that on knowing whether or not it read the object). If you only want raw data, then you can do that using a retrieve:. e.g.
aQuery retrieve: [:each | each primaryKey].
aQuery retrieve: [:each | each parentNode primaryKey timestamp].

which would then give you back an array of two simple types, a primary key and the timestamp of the parent, and put nothing into cache.

--
Alan Knight [|], Cincom Smalltalk Development
[hidden email]
[hidden email]
http://www.cincom.com/smalltalk

"The Static Typing Philosophy: Make it fast. Make it right. Make it run." - Niall Ross

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Yanni Chiu

Re: Scaling Seaside apps (was: About SToR)

Alan Knight wrote:
> I note that as far as Postgresql, my experience is with the driver
> that Bruce Badger did in VW.

Presently, both are based on the postgres v2 protocol.

> ... or that the postgresql protocol does
> not have a mechanism for passing only part of the results back before
> the remainder has been calculated, or if it's just the driver at the
> Smalltalk level ignoring such protocol and getting everything.

The protocol returns a row at a time until there no more (indicated
by a CompletedResponse message, instead of another Row message). It
seems both Bruce and I chose to implement in such a way that a result
set is only available after the last row has been received.

I suppose a better choice (for the future) would return with a
result set stream as soon as the first row is retrieved. Then add
additional protocol to return the entire result set by pulling in
the whole stream contents.

> Glorp doesn't do anything in terms of actually declaring named
> cursors and issuing fetch statements.

That's what I figured, from a quick glance. Thanks for confiming it.

> Glorp will also keep a cache of prepared statements and re-use them
> if the hooks are in place, but I wasn't aware of how to do that for
> Postgresql.

I think the new "extended query" feature of the postgres v3 protocol
is needed for that.

> I haven't gotten around to writing a NoCacheCachePolicy, but it would
> be quite an easy implementation. But be aware that then you lose
> object identity, which means relationships can get confused, and if
> you're writing, Glorp may not properly know whether to insert or
> update (it bases that on knowing whether or not it read the object).

I don't think a "report query" optimization is equivalent to a
no-cache policy. The report query is no longer trying to maintain
any kind of O-R mapping concept.

--
Yanni Chiu

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Alan Knight-2

Re: Re: Scaling Seaside apps (was: About SToR)

At 11:28 AM 8/5/2006, Yanni Chiu wrote:
>I suppose a better choice (for the future) would return with a
>result set stream as soon as the first row is retrieved. Then add
>additional protocol to return the entire result set by pulling in
>the whole stream contents.

Yes, I'd think that's the easiest way to approach it.

>I don't think a "report query" optimization is equivalent to a
>no-cache policy. The report query is no longer trying to maintain
>any kind of O-R mapping concept.

No, they're not at all equivalent, but both are possible approaches to avoid caching huge amounts of data. The report query does maintain a certain amount of O/R mapping. For example, the things to be retrieved can still be described in object terms. And you would need to use a report query for things that don't necessarily correspond to single objects, but are still related to the object model. For example, aggregate queries like the sum of the open orders for all customers located in a particular city.

--
Alan Knight [|], Cincom Smalltalk Development
[hidden email]
[hidden email]
http://www.cincom.com/smalltalk

"The Static Typing Philosophy: Make it fast. Make it right. Make it run." - Niall Ross

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Colin Putney

Re: Re: Scaling Seaside apps (was: About SToR)

In reply to this post by Yanni Chiu

On Aug 5, 2006, at 11:28 AM, Yanni Chiu wrote:

> The protocol returns a row at a time until there no more (indicated
> by a CompletedResponse message, instead of another Row message). It
> seems both Bruce and I chose to implement in such a way that a
> result set is only available after the last row has been received.
>
> I suppose a better choice (for the future) would return with a
> result set stream as soon as the first row is retrieved. Then add
> additional protocol to return the entire result set by pulling in
> the whole stream contents.

Yeah, that's handy alright. The Squeak MySQL driver does this. The
implementation in the MySQL driver has a significant flaw though, in
that it basically relies on the client reading rows from the stream
to cause the driver to read bytes from the socket. This leads to
problems when the client fails to read all the rows that were
returned. If there is an error and reading is curtailed, the user
cancels the action or whatever, the row bytes are left in the socket
buffer and the driver gets confused the next time it sends a query
and receives unexpected results.

That's not to say that a result set stream is a bad idea, it just has
to be implemented to so that the driver maintains control over
reading from the socket and can keep a consistent state when the
result stream isn't read completely.

Just a heads up - the problem can be hard to track down and has
caused me lots of grief in the past.

Colin
_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Yanni Chiu

Re: Scaling Seaside apps (was: About SToR)

Colin Putney wrote:
> That's not to say that a result set stream is a bad idea, it just has
> to be implemented to so that the driver maintains control over reading
> from the socket and can keep a consistent state when the result stream
> isn't read completely.

Thanks for the heads up. The v2 protocol has a CancelRequest message
which the driver can send. So, in theory, with the state machine
design, the handshaking should still work right.

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

William Harford

Re: Re: Scaling Seaside apps (was: About SToR)

In reply to this post by Colin Putney

On Aug 5, 2006, at 3:25 PM, Colin Putney wrote:

>
> Yeah, that's handy alright. The Squeak MySQL driver does this. The
> implementation in the MySQL driver has a significant flaw though,
> in that it basically relies on the client reading rows from the
> stream to cause the driver to read bytes from the socket. This
> leads to problems when the client fails to read all the rows that
> were returned. If there is an error and reading is curtailed, the
> user cancels the action or whatever, the row bytes are left in the
> socket buffer and the driver gets confused the next time it sends a
> query and receives unexpected results.
>

My version of the MySQL driver has JdmResultSet #flush.

I am not sure if I added it or it came stock or if I added it. I seem
to remember it coming from a version of the driver but "versions"
tell me I added it.

I have changed the drivers a bit to work with MySQL 4.1.8 (had some
problems with dates) and to handle transactions.

The changes are minor but can be found at http://squeaksource.com/
IOSPersistent .

flush
self hasNext ifFalse: [^self].
[self next] whileTrue: []

hasNext
^self currentRow isEmpty not

Will

> Colin
> _______________________________________________
> Seaside mailing list
> [hidden email]
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Jeremy Shute

Re: Re: Scaling Seaside apps (was: About SToR)

In reply to this post by Alan Knight-2

On 8/5/06, Alan Knight <[hidden email]> wrote:

OK, I've joined this list now. I'll respond to a couple of comments in various messages.

Cool! :-)

---------------------
In a different part of the thread, Jeremy Shute shutej at crazilocks.com wrote:

But, looking at the process browser, the messages in the GLORP-operating
thread seem to relate to the assembly of queries.  The cursor would
eliminate the READ queries, but the WRITE queries are still necessary,
requiring assembly one-at-a-time, unless GLORP is smart enough to use a
global prepared statement underneath for insertions.  I see no reason why it
shouldn't be -- insertions are very homogenous operations, unlike WHERE
clauses.  Perhaps a cursor WOULD solve the problem?
-----------------

Glorp does group insertions together. This can give some very large performance improvements, e.g. if it can use Oracle array binding over a high-latency connection. However, for Postgresql there isn't any directly analogous mechanism. The mechanisms I saw available were a) just concatenating the inserts together in a single large statement with semicolons separating them. This helps with network traffic, but doesn't really tell the database it's doing many similar inserts. b) The bulk load mechanism. This seemed very much aimed at loading data from files, and looked like it might be possible to use more directly, but quite awkward. I haven't done this.

Yes, neither option seems to be very fun. I wonder if there's a way to hack up the internal Postgres mechanisms to talk more directly to Squeak, like an INSERT INTO with a subselect clause, and provide a veneer where Squeak offers the contents of the subselect over the wire? It would be a monstrous, nasty, warty thing, but at least it could take advantage of the homogeneous nature of the data. I know some people have written code using remote tables, if you could trick Postgres into thinking the running Squeak image was a remote table WITHOUT hacking a plugin, you could probably get more expedient bulk inserts...

So right now on Postgresql (and others) Glorp will just concatenate the inserts together (up to a limit). One thing to watch out for from a performance perspective is to turn binding off for Postgresql. At least in the VW driver, binding was implemented by taking the queries, splitting them up into strings, and then concatenating the string representation of the bound values into the query. If you have a query with a very long string ( e.g. because it's a large number of inserts being done all at once) then this became a huge performance bottleneck, and doesn't even seem to offer any of the benefits of binding in terms of not needing to convert back and forth between string representations.

Are you saying turning binding off will help my GLORP performance?

I'm a little dense - I read this and I think you're saying "binding" is tokenizing the query, hunting for '?', and performing string concatenation if you're issuing a prepared statement. Also, does the driver do piecemeal ',' concatenation, or StringBuffer-like appends?

Glorp will also keep a cache of prepared statements and re-use them if the hooks are in place, but I wasn't aware of how to do that for Postgresql.

This sentence is what's confusing me about the above... It makes it sound like GLORP does not use prepared statements on Postgres, in which case, my understanding of turning off "binding" must be askew... If you're not issuing prepared statements, why would turning off "binding" help?

I suspect it's less important in such a situation, in that you'd have to have binding done better in order to effectively reuse prepared statements. I also have a feeling, although I haven't measured, that Postgresql is more tolerant of dynamic SQL for performance than, say, Oracle, which really really wants you to use prepared statements.

Still, I think of the prepared statement as a way to construct a partial abstract syntax tree on the server side, once, and plug in missing sections. In turn, this means you ship a handle to this server-side object instead of a much larger stream of characters (or boxed symbols and strings, I don't know how the driver works), which has to get parsed into an abstract syntax tree at the server for every inserted row. Even on localhost over a pipe, a prepared statement must be preferable for any RDBMS, right? (As a side note, do you know WHY the difference on Oracle so dramatic -- are they doing less optimization in the driver than Postgres and therefore shipping more data?)

Jeremy

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside