Smalltalk › Usenets › Dolphin Smalltalk

Performance Issue with Runtime Image

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

13 messages Options

steve geringer-4

Performance Issue with Runtime Image

Hi,

I am working on a server type system and am getting
an interesting performance problem.

When running in the Development Image, I can get up
to 122 transactions per second.

However, the same code in a packaged image runs
only 15-19 transactions per second.

(In both cases a transaction consists of reading a string
from a file and adding it a process protected queue).

Does anyone know why a packaged image would run so much slower?

Note: in both cases I am driving the server
with test data from dolphin. (so in the second case
I have the packaged image and the development image
running on the same machine).

Does anyone have any ideas for what to look for
or had a similar situation?

Thanks,
Steve Geringer

======================================================

Steve Geringer, Managing Partner
SCG Associates, LLC
http://www.SCGLabs.com
http://www.TradePerformance.com

Schwab,Wilhelm K

Re: Performance Issue with Runtime Image

Steve,

> Does anyone know why a packaged image would run so much slower?
>
> Note: in both cases I am driving the server
> with test data from dolphin. (so in the second case
> I have the packaged image and the development image
> running on the same machine).

If I am understanding you, you are comparing one image talking to itself
with an image talking to a separate executable, with the latter running
slower. That makes sense because of the context switching that would
need to occur.

Fair?

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]

Chris Uppal-3

Re: Performance Issue with Runtime Image

In reply to this post by steve geringer-4

Steve,

> I am working on a server type system and am getting
> an interesting performance problem.
>
> When running in the Development Image, I can get up
> to 122 transactions per second.
>
> However, the same code in a packaged image runs
> only 15-19 transactions per second.

I can't suggest an obvious reason, but a few questions:

What happens if you run the two cases in otherwise /identical/ set-ups ?

What's the bottleneck for the slow case ? Is the CPU maxed-out ? If not then
it suggests that the slow version is managing to network-limit itself somehow.
It may be worth using a tool like Ethereal (www.ethereal.com). No one should
ever do any kind of networking without Ethereal, or equivalent, IMO -- it just
wastes /so/ much time.

How is logging handled ? It may be that you are logging (if you are at all) in
a way that takes a fast path in a dev image, but a slow path when deployed.

Does 'DBGView' (from www.sysinternals.com) show anything odd ?

BTW, one time when I saw a huge difference between two apparently
almost-identical networking apps (an order of magnitude difference in
performance -- as you are seeing), it eventually turned out to be due to the
overhead of reverse DNS lookups in the logging code. In one case the lookup
was hitting the local cache, in the other case it was having to go to a DNS
server every time. You /might/ be seeing something similar. These days I
never log hostnames, only IP addresses.

BWT2 [I meant to reply to your earlier post, but never got around to it ;-( ]
The networking code has changed between D5 and D6, and I would expect to see a
performance difference (but have no idea how much). In D5 it's all handled
using the asynchronous networking APIs, so that each (lightweight) Dolphin
Process dispatches asynchronous networking notifications distributed via the
normal Windows event loop -- which imposes some overhead. D6 adds a new
networking package which uses the (faster) synchronous APIs, but avoids
blocking the entire image by using "overlapped" calls. That will have the
effect that each Dolphin Process that is handling networking stuff will have an
OS-level thread bound to it (dedicated to its sole use, as I understand it).
If your server works by forking off a new Dolphin Process for each connection
(which is a sensible architecture using the old Sockets stuff), then you may
want to consider whether the new sockets are suitable if you expect lots of
simultaneous connections, since (I think) that will result in correspondingly
huge number of OS threads (which is not a good thing for performance).

-- chris

Schwab,Wilhelm K

Re: Performance Issue with Runtime Image

Chris,

> BWT2 [I meant to reply to your earlier post, but never got around to it ;-( ]
> The networking code has changed between D5 and D6, and I would expect to see a
> performance difference (but have no idea how much). In D5 it's all handled
> using the asynchronous networking APIs, so that each (lightweight) Dolphin
> Process dispatches asynchronous networking notifications distributed via the
> normal Windows event loop -- which imposes some overhead. D6 adds a new
> networking package which uses the (faster) synchronous APIs,

Have a look at this:

http://www.cs.wustl.edu/~schmidt/PDF/PDCP.pdf

IMHO, adding overlapped sockets is great; removing asynchronous ones
would be bad.

> but avoids
> blocking the entire image by using "overlapped" calls.

Asynchronous sockets should not necessarily lead to a blocked image, if
they work. The reality is that either they are not robust, or perhaps
Dolphin didn't call them quite correctly(??]), so overlapping connect
and IIRC some DNS related operations is perhaps necessary.

> That will have the
> effect that each Dolphin Process that is handling networking stuff will have an
> OS-level thread bound to it (dedicated to its sole use, as I understand it).
> If your server works by forking off a new Dolphin Process for each connection
> (which is a sensible architecture using the old Sockets stuff), then you may
> want to consider whether the new sockets are suitable if you expect lots of
> simultaneous connections, since (I think) that will result in correspondingly
> huge number of OS threads (which is not a good thing for performance).

It can vary with many factors. The best approach would be to provide a
well factored interface to both types of sockets, allow us to choose the
best approach for any given situation. In fact, the choice could even
be left to end users at the developer's discretion.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]

Chris Uppal-3

Re: Performance Issue with Runtime Image

In reply to this post by steve geringer-4

steve,

> I am working on a server type system and am getting
> an interesting performance problem.

I've been running a few tests on D5 and D6, you might be interested in the
results.

The test was a simple server that accepts a connection and forks a new Dolphin
Process which reads 512 bytes, sends the same 512 bytes back to the client,
closes the connection, then dies. Not particularly realistic ;-). The client
(written in Java to keep the test clean, and to ensure that the client wasn't a
bottleneck) just sat in a tight loop, opening a connection, sending 512 bytes,
reading 512 bytes back, and then closing the connection. That's highly
unrealistic in one important way -- the server is never asked to process more
than one request concurrently (I don't have enough machines to set up a
realistic test with multiple clients, so I didn't even try).

First off, I've been unable to duplicate the difference you saw between a
deployed app and running in the image -- /except/ in the case where the client
and server were running on the same machine. In that case the deployed version
(in my case) was about half the speed of the same code running in the IDE. I
have no idea at all why that should be, but it doesn't reproduce (for me) when
the client and server are talking across the my network.

I saw one other difference between deployed and IDE execution. On one of the
three machines I tried (a Win2K box) running the server in D5 IDE worked fine,
but when I tried the deployed version, it consistently failed after running for
a little while -- the crash dump showed that an accept() operation failed, but
Windows reported no error (the error code was zero). That didn't happen at all
when running in the IDE. I have no idea what the problem is. FWIW, on a
WinXP Pro box, I saw the same problem, but only very occasionally, in both the
IDE and deployed. On another (significantly slower) Win2K box, the problem
didn't occur at all in either configuration.

I've also tried the same tests with Dolphin 6 beta 1. Using the old Sockets
package, I saw no significant differences in speed between D5 and D6, but the
deployed server worked perfectly on all three machines. I don't know if that's
because of changes to the Sockets implementation, changes to the VM and/or
event loop, or was even just luck...

Switching to the new Sockets implementation (a trivial change), produced a very
significant increase in throughput -- about 2 or 3 times higher connections per
second on all three machines (about 600 in the best case -- running the server
on a 1.5 GHz laptop and the client on a cheap 2.4 GHz consumer-grade box).
However, remember that my test setup was such that the server wasn't seeing a
lot of concurrent requests. If the 600 requests per second had come from 600
clients all sending one per second, then the server would have been running an
infeasibly high number of OS-threads. A realistic server implementation, based
on the new sockets, and intended to take reasonably high loads, would probably
have to be coded to restrict the number of concurrent connections in some way.

-- chris

Chris Uppal-3

Re: Performance Issue with Runtime Image

In reply to this post by Schwab,Wilhelm K

Bill,

> Have a look at this:
>
> http://www.cs.wustl.edu/~schmidt/PDF/PDCP.pdf
>
> IMHO, adding overlapped sockets is great; removing asynchronous ones
> would be bad.

That's an interesting paper. I've wanted to play with ACE (which JAWS is built
on) for a while.

I think we are talking about different things, though. The current Sockets
implementation is asynchronous, true, but only in the sense that reads, etc,
are non-blocking. Everything is done through the Windows event queue which
adds quite a lot of overhead, and (apparently) some unreliability (see my other
post in this thread today for details). I agree that using an OS-thread per
connection is an architecture that doesn't scale to heavily loaded servers, and
so -- in a way -- the current Sockets implementation follows the "correct"
architecture for that situation. But I don't think the actual implementation
is particularly suitable.

I /think/ that the new Sockets implementation is well suited to client-side
programs, and seems to add a lot of performance (or rather, get rid of a lot of
unnecessary overhead) compared to the current implementation. Which is great.
But I don't /think/ that either implementation is ideal for high-performance
servers. At least not in the "obvious" approach of giving one Dolphin Process
to each connection. I suspect that to get the best performance out of Dolphin
as a server, you'd have to drop down to the low-level asynchronous networking
APIs (the new overlapped call implementation might help). But in practise I
can't see why one would attempt to use Dolphin to create a Very High
Performance server. For more moderate ambitions, I'd expect the new Sockets
stuff (plus some code to limit the number of concurrent connections) would
perform adequately in many cases.

-- chris

Jochen Riekhof-6

Re: Performance Issue with Runtime Image

In reply to this post by Chris Uppal-3

Hi Chris...

> That's highly
> unrealistic in one important way -- the server is never asked to process more
> than one request concurrently (I don't have enough machines to set up a
> realistic test with multiple clients, so I didn't even try).

I think you can easily try by starting several of your Java apps
simultaneously. No need to use a network, you can use loopback
(localhost) for all tests. This has also the benefit to load the server
application in the maximum possible way by mostly eliminating network
speed issues. Typically CPU load is not big even when several clients
and one server run on the same system.

Ciao

...Jochen

steve geringer-4

Re: Performance Issue with Runtime Image

In reply to this post by Schwab,Wilhelm K

Bill,

I am not sure if there is a context switching issue or not.
Both development and runtime image are being fed through http
using Apache web server. So conceivably both images need
are interrupted by a context switch.

Thanks,
Steve

Bill Schwab wrote:

> Steve,
>
>> Does anyone know why a packaged image would run so much slower?
>>
>> Note: in both cases I am driving the server
>> with test data from dolphin. (so in the second case
>> I have the packaged image and the development image
>> running on the same machine).
>
>
> If I am understanding you, you are comparing one image talking to itself
> with an image talking to a separate executable, with the latter running
> slower. That makes sense because of the context switching that would
> need to occur.
>
> Fair?
>
> Have a good one,
>
> Bill
>
>

steve geringer-4

Re: Performance Issue with Runtime Image

In reply to this post by Chris Uppal-3

Chris,

First, many thanks for your efforts.

see comments below...

Chris Uppal wrote:

> steve,
>
>
>>I am working on a server type system and am getting
>>an interesting performance problem.
>
>
> I've been running a few tests on D5 and D6, you might be interested in the
> results.
>
> The test was a simple server that accepts a connection and forks a new Dolphin
> Process which reads 512 bytes, sends the same 512 bytes back to the client,
> closes the connection, then dies. Not particularly realistic ;-). The client

Not that unrealistic. this is similar to what I need. I have
created a buffered work queue and the request comes in, all I do is
read it in, add it to the queue and go back to waiting for input.
(Another thread waits for work in queue then processes it)

> (written in Java to keep the test clean, and to ensure that the client wasn't a
> bottleneck) just sat in a tight loop, opening a connection, sending 512 bytes,
> reading 512 bytes back, and then closing the connection. That's highly
> unrealistic in one important way -- the server is never asked to process more
> than one request concurrently (I don't have enough machines to set up a
> realistic test with multiple clients, so I didn't even try).

I see what you mean, my test has a similar weakness (but may suffice for
the current requirements of the project).

>
> First off, I've been unable to duplicate the difference you saw between a
> deployed app and running in the image -- /except/ in the case where the client
> and server were running on the same machine. In that case the deployed version
> (in my case) was about half the speed of the same code running in the IDE. I
> have no idea at all why that should be, but it doesn't reproduce (for me) when
> the client and server are talking across the my network.

One theory I had was there may be some waiting involved for the dolphin
shared DLLs. ( I have tried the test driving it from a VBScript file
and the performance is still faster in development...but the difference
is much less dramatic.)

>
> I saw one other difference between deployed and IDE execution. On one of the
> three machines I tried (a Win2K box) running the server in D5 IDE worked fine,
> but when I tried the deployed version, it consistently failed after running for
> a little while -- the crash dump showed that an accept() operation failed, but
> Windows reported no error (the error code was zero). That didn't happen at all
> when running in the IDE. I have no idea what the problem is. FWIW, on a
> WinXP Pro box, I saw the same problem, but only very occasionally, in both the
> IDE and deployed. On another (significantly slower) Win2K box, the problem
> didn't occur at all in either configuration.

That is scary... my client sure would be mad if I delivered that.

>
> I've also tried the same tests with Dolphin 6 beta 1. Using the old Sockets
> package, I saw no significant differences in speed between D5 and D6, but the
> deployed server worked perfectly on all three machines. I don't know if that's
> because of changes to the Sockets implementation, changes to the VM and/or
> event loop, or was even just luck...
>
> Switching to the new Sockets implementation (a trivial change), produced a very
> significant increase in throughput -- about 2 or 3 times higher connections per
> second on all three machines (about 600 in the best case -- running the server
> on a 1.5 GHz laptop and the client on a cheap 2.4 GHz consumer-grade box).

That is fantastic ! This sounds like what we need.

> However, remember that my test setup was such that the server wasn't seeing a
> lot of concurrent requests. If the 600 requests per second had come from 600
> clients all sending one per second, then the server would have been running an
> infeasibly high number of OS-threads. A realistic server implementation, based

Would they be OS threads or dolphin processes? Also, there is a way to
to pool the threads to make this more efficient

> on the new sockets, and intended to take reasonably high loads, would probably
> have to be coded to restrict the number of concurrent connections in some way.
>
> -- chris
>
>
>

Thanks again for your investigation...Chris where do you work?

Chris Uppal-3

Re: Performance Issue with Runtime Image

In reply to this post by Jochen Riekhof-6

Jochen,

> > That's highly
> > unrealistic in one important way -- the server is never asked to
> > process more than one request concurrently (I don't have enough
> > machines to set up a realistic test with multiple clients, so I didn't
> > even try).
>
> I think you can easily try by starting several of your Java apps
> simultaneously.

Or, even easier, I can program the client to use several (OS) threads. The
problem with that is that then there is no /real/ asynchronous activity, as
there would be if the same overall density of requests were issued from the
same number of "real" machines. I don't know what effect that would have on
performance -- not even a guess as to which direction it would affect
performance in, let alone a feeling for how big a difference it would make.
There doesn't seem to be a lot of point in running a test when I don't know how
to interpret the results. So I didn't ;-) (Though I admit that if I'd had,
say, 6 real machines then I would have been willing to "pad" the load by
running several threads on each -- which is perhaps rather inconsistant of
me...)

> No need to use a network, you can use loopback
> (localhost) for all tests. This has also the benefit to load the server
> application in the maximum possible way by mostly eliminating network
> speed issues.

It depends on what you want to measure. If the situation is such that the
clients take very little CPU power, and the server is mostly limited by (say)
database access, then that might be true. Especially if you are taking the
network overheads as unavoidable "givens", and attempting to optimise the rest
of the server implementation (such as working out what to index in the DB).

That doesn't apply in this case, though, since what I was interested in was the
size of the networking overhead itself -- specifically when used via Dolphin's
current and upcoming sockets implementations.

In point of fact, the CPU load of the client is significant here. The D5
sockets implementation fails to max out the network (or my 100 mbit network,
anyway) and the server is CPU-limited rather than limited by network speed[*].
The new implementation appears to remove enough overhead to allow my simple
test server to run fast enough that it's limited by the network (primarily
connection setup/teardown latency), and CPU load does not reach much over 50%.
But that isn't the case with D5, so running /any/ other processing on the
server machine "steals" performance from the server, and so invalidates the
benchmark.

([*] but remember that my test server is very simple, a real server would
undoubtedly "do" more than just copy the data in and out, in which case it
might be perfectly legitimate for it to be CPU-limited.)

-- chris

rush

Re: Performance Issue with Runtime Image

In reply to this post by Chris Uppal-3

"Chris Uppal" <[hidden email]> wrote in message
news:42e0d00f$0$38040$[hidden email]...
> Switching to the new Sockets implementation (a trivial change), produced a
very
> significant increase in throughput -- about 2 or 3 times higher
connections per
> second on all three machines (about 600 in the best case -- running the
server
> on a 1.5 GHz laptop and the client on a cheap 2.4 GHz consumer-grade box).
> However, remember that my test setup was such that the server wasn't
seeing a
> lot of concurrent requests. If the 600 requests per second had come from
600
> clients all sending one per second, then the server would have been
running an
> infeasibly high number of OS-threads. A realistic server implementation,
based
> on the new sockets, and intended to take reasonably high loads, would
probably
> have to be coded to restrict the number of concurrent connections in some
way.

I would say 600 hundreads of simultaneus clients in reasonably designed
protocols and apps is jolly high load. For instance one reasonably high
traffic web site (cca 150.000 pages daily) usually experiences 10-20
simultaneous clients. And I guess sockets processing is only a small
fraction of time web server uses.

Also one thing in your benchamark that seems most different from reality is
that most reasonably designed protocols keep connections open for longer
time, since connection establishment/tear-up is relatively expensive
operation. One notable exception is http 1.0, but nowdays most of the
clients and servers support http1.1 which uses persistent connections.

rush
--
http://www.templatetamer.com/
http://www.folderscavenger.com/

Chris Uppal-3

Re: Performance Issue with Runtime Image

In reply to this post by steve geringer-4

steve,

> > The test was a simple server that accepts a connection and forks a new
> > Dolphin Process which reads 512 bytes, sends the same 512 bytes back to
> > the client, closes the connection, then dies. Not particularly
> > realistic ;-). The client
>
> Not that unrealistic. this is similar to what I need. I have
> created a buffered work queue and the request comes in, all I do is
> read it in, add it to the queue and go back to waiting for input.
> (Another thread waits for work in queue then processes it)

Ah, yes. Actually that's quite a different architecture, so my comments have
maybe been a bit irrelevant.

If I understand you correctly, that means that you will have one Dolphin
Process dedicated to reading in requests, and that it will finish reading each
request before even accepting the next incoming connection. Similarly you have
one worker Process, which will take a request off the queue, process it, and
send the answer back, before going on to consider the next request. That does
keep the number of Processes under strict control ;-) but you may find that it
limits the degree of concurrency you can support. You may not see any point in
allowing more than one request to be /processed/ at once, but a slow client can
effectively block any other client from making a connection, and then will also
prevent any other request from being processed while your worker Process
attempts to transmit the reply back to it.

Still, that may not be an issue in practise, and anyway it sounds easy for you
to change your architecture should it turn out to be necessary.

I'd be tempted to use a small pool of Processes reading incoming requests, a
similar pool of Processes sending back replies, and a single worker Process
that does the actual thinking. In D5, the I/O Processes would be implemented
entirely as lightweight "threads" inside the Dolphin VM, and would be invisible
to the Windows kernel. The (pseudo-) parallelism between them would be
implemented entirely by Dolphin using the (slow) asynchronous network API. In
D6 if you use the new Sockets package (and if my understanding is correct),
that would end up with a Windows thread permanently allocated to each of the
Processes in the IO pool. Since that pool is of fixed size, so will be the
number of threads -- so there's no problem with the thread number growing out
of control. You would have to do some testing to discover the optimum pool
size (if you think it's worth it at all, that is).

> > I saw one other difference between deployed and IDE execution. On one
> > of the three machines I tried (a Win2K box) running the server in D5
> > IDE worked fine, but when I tried the deployed version, it consistently
> > failed after running for a little while -- the crash dump showed that
> > an accept() operation failed, but Windows reported no error (the error
> > code was zero). That didn't happen at all when running in the IDE. I
> > have no idea what the problem is. FWIW, on a WinXP Pro box, I saw the
> > same problem, but only very occasionally, in both the IDE and deployed.
> > On another (significantly slower) Win2K box, the problem didn't occur
> > at all in either configuration.
>
> That is scary... my client sure would be mad if I delivered that.

I've investigated that a bit more. It turns out that the condition occurs
occasionally in all the D5 configurations I've tried (not sure about the
slowest machine -- it may be timing related). The good news is that it appears
to be easily fixable. It seems that Dolphin is sometimes deciding (for
whatever reason -- maybe Windows is lying to it) that there's an incoming
connection to be processed when there isn't one really. Dolphin attempts to
#accept it, and Windows (obviously) can't supply a corresponding socket, so an
error is triggered. If I put an error handler around #accept that traps
SocketError and ignores it if the #errorCode is zero, then everything seems to
work properly. The client program is unaffected so presumably the server is
not missing incoming connections.

There's discussion thread dating back to 2004-5-22 between Yar Hwee Boon and
Bill Dargel, entitled "subclassResponsibility (#onAsychSocketAccept) error when
unit testing sockets", that sounds as if it might be relevant here too. I'm
not certain, though -- if I were seeing the same scenario as they describe then
I don't think my simple fix would work.

> > However, remember that my test setup was such that the server wasn't
> > seeing a lot of concurrent requests. If the 600 requests per second
> > had come from 600 clients all sending one per second, then the server
> > would have been running an infeasibly high number of OS-threads. A
> > realistic server implementation, based
>
> Would they be OS threads or dolphin processes? Also, there is a way to
> to pool the threads to make this more efficient

They'd be OS threads /and/ Dolphin Processes since (I think) the new VM
dedicates a windows Thread to each Process that issues an overlapped call.
However that would only be a (potential) problem with the simple architecture I
was thinking of where each request is handled in its own Process, the
architecture you are considering (and extensions of it) don't suffer from the
same problem.

> Thanks again for your investigation...Chris where do you work?

You're welcome -- I get curious about things and can't (don't want to!) resist
the temptation to investigate ;-)

Actually, I don't work anywhere at the moment -- I've been taking some time out
to rest and pursue my own projects -- which is why I have time for random
investigations.

Unfortunately for me, the time is fast approaching when I'll have to start
earning my keep again ;-( So (I hope no one minds a small advertisement), if
anyone out there is looking to hire a British Smalltalk/Java/C++ programmer,
then I'd like to hear from you.

-- chris

Chris Uppal-3

Re: Performance Issue with Runtime Image

In reply to this post by rush

rush wrote:

> [me]
> > However, remember that my test
> > setup was such that the server wasn't seeing a lot of concurrent
> > requests. If the 600 requests per second had come from 600 clients all
> > sending one per second, then the server would have been running an
> > infeasibly high number of OS-threads. A realistic server
> > implementation, based on the new sockets, and intended to take
> > reasonably high loads, would probably have to be coded to restrict the
> > number of concurrent connections in some way.
>
> I would say 600 hundreads of simultaneus clients in reasonably designed
> protocols and apps is jolly high load. For instance one reasonably high
> traffic web site (cca 150.000 pages daily) usually experiences 10-20
> simultaneous clients. And I guess sockets processing is only a small
> fraction of time web server uses.

Oh, sure. I was only using the 600 figure as an example.

However there is an important point here. If the architecture is such that the
number of threads in use is proportional to the number of ongoing requests,
then you /have/ to limit the number of active connections. If not then (unless
the server is way over-specified for the load), random variations in the number
of connected clients will mean that at times it has many connections. When
that happens it will slow down more than linearly (because of using too many
threads). As a result, existing requests won't be serviced as quickly, but new
requests will still be coming in at the same (on average) rate. That will
further increase the number of concurrent requests, leading to even more
threads, even worse response time, and so even /more/ active connections... If
you aren't lucky you may see the server crash.

> Also one thing in your benchamark that seems most different from reality
> is that most reasonably designed protocols keep connections open for
> longer time, since connection establishment/tear-up is relatively
> expensive operation. One notable exception is http 1.0, but nowdays most
> of the clients and servers support http1.1 which uses persistent
> connections.

Very true; but I'd add it really depends on the nature of the data, not on the
protocol. If the data naturally comes in unrelated request/response pairs
(which may or may not be true of Steve's application) then the only sensible
thing is to use a "one-shot" protocol.

As an aside: I'm not totally convinced by HTTP1.1. I can see that the
connection re-use should help reduce set-up/tear-down overhead, but I'm not
completely sold on the idea that it's worth it in practice (given client-side
caching, etc). Of course, I don't have experience of running a web-server, and
I don't have any other source of believable figures. But still, HTTP1.1 adds a
/considerable/ extra burden of complexity (not so much to clients/servers, as
to proxies, firewalls, etc). That complexity manifest in the usual array of
bugs, incompatibilities, and -- unfortunately but inevitably -- security holes
which HTTP1.0 simply doesn't suffer from (or at least, not nearly as much). So
it'd be interesting to know how much HTTP1.1 actually reduces the load on
servers, or how much faster browsers can pull stuff down off the web. If I
understand correctly (I may not), then I think Mozilla have only recently --
the last few months -- turned on pipelining in their browser(s). If that's
true then it seems they don't think the performance gains are all that
important either.

But still, that's only speculation; and not all that well informed, either....

-- chris