Smalltalk › Gemtalk › GLASS

Time to responds varies very much (performance problems)

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

9 messages Options

GLASS mailing list

Time to responds varies very much (performance problems)

This is a typical "any idea" question :-)

I'm now in the process of doing heavy performance tests and I notice a strange effect - the time for responding a query varies very much and I've no idea how to find out, where the reason for the performance problems are.

I've a system of 8 responding topaz processes answering http requests (2 core license on a 4 core cpu witrh 8GB RAM). The load tests are around 50 transactions/second. Normally a specific query can be answered within 1-2 ms, but when this query is not executed for some time the time needed for an answer increases and I found answering times with up to 12000ms. The system s located on a SSD, has been defined to have a 2GB of cache. The system has lots of transactions (commit and abort). If I have no transactions the system answers the query within 1-2 ms. So I assume, that this association a thrown out of the shared cache (even though 2 GB cache is pretty much) - but how can I proove this ?

Any further idea with the statmonitor and/or how to interpret the results ?

Marten

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Time to responds varies very much (performance problems)

Considering the fact, that I am not an expert in interpretation of the statistics of the system I noticed, that the time is indeed needed for reloading pages to get access to all data needed for the computation (PageIOCount, PageLocateCount, PageReads)

In one case I sort around 300000 addresses and this needs at least 3 seconds - when lots of data has to be loaded (from disc) it goes up to more than 20 seconds (and this is only early experiences) - and this on a system without load. I think on a system with heavy load this will goes up even higher.

Marten

Marten Feldtmann via Glass <[hidden email]> hat am 27. November 2017 um 21:49 geschrieben:

This is a typical "any idea" question :-)
I'm now in the process of doing heavy performance tests and I notice a strange effect - the time for responding a query varies very much and I've no idea how to find out, where the reason for the performance problems are.
I've a system of 8 responding topaz processes answering http requests (2 core license on a 4 core cpu witrh 8GB RAM). The load tests are around 50 transactions/second. Normally a specific query can be answered within 1-2 ms, but when this query is not executed for some time the time needed for an answer increases and I found answering times with up to 12000ms. The system s located on a SSD, has been defined to have a 2GB of cache. The system has lots of transactions (commit and abort). If I have no transactions the system answers the query within 1-2 ms. So I assume, that this association a thrown out of the shared cache (even though 2 GB cache is pretty much) - but how can I proove this ?
Any further idea with the statmonitor and/or how to interpret the results ?
Marten

_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Time to responds varies very much (performance problems)

Marten, when you write "sort 300000 addresses", that is a good indicator that you may benefit from indexes on your collection(s). I think the Programming Guide has an entire chapter on indexes.

On Nov 29, 2017 09:00, "Marten Feldtmann via Glass" <[hidden email]> wrote:

Considering the fact, that I am not an expert in interpretation of the statistics of the system I noticed, that the time is indeed needed for reloading pages to get access to all data needed for the computation (PageIOCount, PageLocateCount, PageReads)
In one case I sort around 300000 addresses and this needs at least 3 seconds - when lots of data has to be loaded (from disc) it goes up to more than 20 seconds (and this is only early experiences) - and this on a system without load. I think on a system with heavy load this will goes up even higher.
Marten
Marten Feldtmann via Glass <[hidden email]> hat am 27. November 2017 um 21:49 geschrieben:

This is a typical "any idea" question :-)
I'm now in the process of doing heavy performance tests and I notice a strange effect - the time for responding a query varies very much and I've no idea how to find out, where the reason for the performance problems are.
I've a system of 8 responding topaz processes answering http requests (2 core license on a 4 core cpu witrh 8GB RAM). The load tests are around 50 transactions/second. Normally a specific query can be answered within 1-2 ms, but when this query is not executed for some time the time needed for an answer increases and I found answering times with up to 12000ms. The system s located on a SSD, has been defined to have a 2GB of cache. The system has lots of transactions (commit and abort). If I have no transactions the system answers the query within 1-2 ms. So I assume, that this association a thrown out of the shared cache (even though 2 GB cache is pretty much) - but how can I proove this ?
Any further idea with the statmonitor and/or how to interpret the results ?
Marten

_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Time to responds varies very much (performance problems)

Well, yes I could benefit from that - but considering, that I can sort that number of items in 3 seconds on my machine when nothing else happens - but the time gets up dramatically when working on other parts on the database and come back I do not assume, that index may help that much !?

Marten

Richard Sargent <[hidden email]> hat am 29. November 2017 um 18:27 geschrieben:

Marten, when you write "sort 300000 addresses", that is a good indicator that you may benefit from indexes on your collection(s). I think the Programming Guide has an entire chapter on indexes.

On Nov 29, 2017 09:00, "Marten Feldtmann via Glass" <[hidden email]> wrote:
Considering the fact, that I am not an expert in interpretation of the statistics of the system I noticed, that the time is indeed needed for reloading pages to get access to all data needed for the computation (PageIOCount, PageLocateCount, PageReads)
In one case I sort around 300000 addresses and this needs at least 3 seconds - when lots of data has to be loaded (from disc) it goes up to more than 20 seconds (and this is only early experiences) - and this on a system without load. I think on a system with heavy load this will goes up even higher.
Marten
Marten Feldtmann via Glass <[hidden email]> hat am 27. November 2017 um 21:49 geschrieben:

This is a typical "any idea" question :-)
I'm now in the process of doing heavy performance tests and I notice a strange effect - the time for responding a query varies very much and I've no idea how to find out, where the reason for the performance problems are.
I've a system of 8 responding topaz processes answering http requests (2 core license on a 4 core cpu witrh 8GB RAM). The load tests are around 50 transactions/second. Normally a specific query can be answered within 1-2 ms, but when this query is not executed for some time the time needed for an answer increases and I found answering times with up to 12000ms. The system s located on a SSD, has been defined to have a 2GB of cache. The system has lots of transactions (commit and abort). If I have no transactions the system answers the query within 1-2 ms. So I assume, that this association a thrown out of the shared cache (even though 2 GB cache is pretty much) - but how can I proove this ?
Any further idea with the statmonitor and/or how to interpret the results ?
Marten

_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Time to responds varies very much (performance problems)

On Wed, Nov 29, 2017 at 4:48 PM, Marten Feldtmann via Glass <[hidden email]> wrote:

Well, yes I could benefit from that - but considering, that I can sort that number of items in 3 seconds on my machine when nothing else happens - but the time gets up dramatically when working on other parts on the database and come back I do not assume, that index may help that much !?

I think so. Because the index is NOT only for sorting it but also to avoid page faulting the (complete?) underlying objects. Say you have 300k objects and you want to sort by a single string. As far as I understand, when using indexes, you do NOT need to object fault each of those 300k as you can access the indexes structures directly (you can avoid fetching the object completely). So I think that page faulting the indexes structures should be much cheaper than having to page fault ALL the underlying objects.

Does that make sense?

Marten
Richard Sargent <[hidden email]> hat am 29. November 2017 um 18:27 geschrieben:

Marten, when you write "sort 300000 addresses", that is a good indicator that you may benefit from indexes on your collection(s). I think the Programming Guide has an entire chapter on indexes.

On Nov 29, 2017 09:00, "Marten Feldtmann via Glass" <[hidden email]> wrote:
Considering the fact, that I am not an expert in interpretation of the statistics of the system I noticed, that the time is indeed needed for reloading pages to get access to all data needed for the computation (PageIOCount, PageLocateCount, PageReads)
In one case I sort around 300000 addresses and this needs at least 3 seconds - when lots of data has to be loaded (from disc) it goes up to more than 20 seconds (and this is only early experiences) - and this on a system without load. I think on a system with heavy load this will goes up even higher.
Marten
Marten Feldtmann via Glass <[hidden email]> hat am 27. November 2017 um 21:49 geschrieben:

This is a typical "any idea" question :-)
I'm now in the process of doing heavy performance tests and I notice a strange effect - the time for responding a query varies very much and I've no idea how to find out, where the reason for the performance problems are.
I've a system of 8 responding topaz processes answering http requests (2 core license on a 4 core cpu witrh 8GB RAM). The load tests are around 50 transactions/second. Normally a specific query can be answered within 1-2 ms, but when this query is not executed for some time the time needed for an answer increases and I found answering times with up to 12000ms. The system s located on a SSD, has been defined to have a 2GB of cache. The system has lots of transactions (commit and abort). If I have no transactions the system answers the query within 1-2 ms. So I assume, that this association a thrown out of the shared cache (even though 2 GB cache is pretty much) - but how can I proove this ?
Any further idea with the statmonitor and/or how to interpret the results ?
Marten

_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Time to responds varies very much (performance problems)

In reply to this post by GLASS mailing list

I tried a new license with 4GB cache, but that did not help at all. The extend is around 7GB large. I noticed, that the topaz processes are very often in the "D" state, which means, that IO is done - when I execute that statement very often, the responding topaz process does not need to go into "D" and I get the full expected speed. That's getting an interesting point of learning.

Marten

Marten Feldtmann via Glass <[hidden email]> hat am 29. November 2017 um 20:48 geschrieben:

Well, yes I could benefit from that - but considering, that I can sort that number of items in 3 seconds on my machine when nothing else happens - but the time gets up dramatically when working on other parts on the database and come back I do not assume, that index may help that much !?
Marten
Richard Sargent <[hidden email]> hat am 29. November 2017 um 18:27 geschrieben:

Marten, when you write "sort 300000 addresses", that is a good indicator that you may benefit from indexes on your collection(s). I think the Programming Guide has an entire chapter on indexes.

On Nov 29, 2017 09:00, "Marten Feldtmann via Glass" <[hidden email]> wrote:
Considering the fact, that I am not an expert in interpretation of the statistics of the system I noticed, that the time is indeed needed for reloading pages to get access to all data needed for the computation (PageIOCount, PageLocateCount, PageReads)
In one case I sort around 300000 addresses and this needs at least 3 seconds - when lots of data has to be loaded (from disc) it goes up to more than 20 seconds (and this is only early experiences) - and this on a system without load. I think on a system with heavy load this will goes up even higher.
Marten
Marten Feldtmann via Glass <[hidden email]> hat am 27. November 2017 um 21:49 geschrieben:

This is a typical "any idea" question :-)
I'm now in the process of doing heavy performance tests and I notice a strange effect - the time for responding a query varies very much and I've no idea how to find out, where the reason for the performance problems are.
I've a system of 8 responding topaz processes answering http requests (2 core license on a 4 core cpu witrh 8GB RAM). The load tests are around 50 transactions/second. Normally a specific query can be answered within 1-2 ms, but when this query is not executed for some time the time needed for an answer increases and I found answering times with up to 12000ms. The system s located on a SSD, has been defined to have a 2GB of cache. The system has lots of transactions (commit and abort). If I have no transactions the system answers the query within 1-2 ms. So I assume, that this association a thrown out of the shared cache (even though 2 GB cache is pretty much) - but how can I proove this ?
Any further idea with the statmonitor and/or how to interpret the results ?
Marten

_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Time to responds varies very much (performance problems)

Marten,

It really seems like you are running into disk i/o issues ... a detailed review of your statmon files, should provide a pretty clear portrait of exactly where your bottleneck(s) are occurring ... Unfortunately I don't really have the spare time to do a detailed analysis of your statmon files to pinpoint the bottlenecks ...

Assuming that disk i/o is the issue ... I would say that it is worth trying an 8GB cache to see if you problems are resolved --- at least disk i/o should be eliminated as a culprit, but there are always additional layers to the performance issues -- swap space, machine memory, number of disk partitions, etc.

Speaking of disk partitions, I have seen performance issues that were resolved by making sure that the tranlogs and extents are on separate disk paritions (even if those disk partitions are virtual partitions) .... the underlying issue has to do with the fact that Linux prioritizes disk writes over disk reads and a system that is doing commits at a fast pace will cause the disk reads to be load pacakges from sick into the SPC will be delayed --- and this phenomenon can be significant. A large SPC (at least as large as the DB ) should fix the problem, but simply putting the tranlogs and extents on separate partitions can also address the problem...

Another trick that may work is to increase the TOC for your Gems (you can have a TOC that is larger than your SPC) ... once the working set of objects have been faulted into a gem, there is no need to hit disk again to refresh the working (except to refresh those objects changed by other transations) ... so the veracity of this technique will be a function of how often the objects in your working set are changed by other transactions ... the downside to this approach is that it can be RAM hungry --- as I said there are no magic bullets and each approach has it's downsides ...

Dale

On 11/29/17 1:12 PM, Marten Feldtmann via Glass wrote:

I tried a new license with 4GB cache, but that did not help at all. The extend is around 7GB large. I noticed, that the topaz processes are very often in the "D" state, which means, that IO is done - when I execute that statement very often, the responding topaz process does not need to go into "D" and I get the full expected speed. That's getting an interesting point of learning.

Marten

Marten Feldtmann via Glass [hidden email] hat am 29. November 2017 um 20:48 geschrieben:

Well, yes I could benefit from that - but considering, that I can sort that number of items in 3 seconds on my machine when nothing else happens - but the time gets up dramatically when working on other parts on the database and come back I do not assume, that index may help that much !?

Marten

Richard Sargent [hidden email] hat am 29. November 2017 um 18:27 geschrieben:

Marten, when you write "sort 300000 addresses", that is a good indicator that you may benefit from indexes on your collection(s). I think the Programming Guide has an entire chapter on indexes.

On Nov 29, 2017 09:00, "Marten Feldtmann via Glass" <[hidden email]> wrote:

Considering the fact, that I am not an expert in interpretation of the statistics of the system I noticed, that the time is indeed needed for reloading pages to get access to all data needed for the computation (PageIOCount, PageLocateCount, PageReads)

In one case I sort around 300000 addresses and this needs at least 3 seconds - when lots of data has to be loaded (from disc) it goes up to more than 20 seconds (and this is only early experiences) - and this on a system without load. I think on a system with heavy load this will goes up even higher.

Marten

Marten Feldtmann via Glass <[hidden email]> hat am 27. November 2017 um 21:49 geschrieben:

This is a typical "any idea" question :-)

I'm now in the process of doing heavy performance tests and I notice a strange effect - the time for responding a query varies very much and I've no idea how to find out, where the reason for the performance problems are.

I've a system of 8 responding topaz processes answering http requests (2 core license on a 4 core cpu witrh 8GB RAM). The load tests are around 50 transactions/second. Normally a specific query can be answered within 1-2 ms, but when this query is not executed for some time the time needed for an answer increases and I found answering times with up to 12000ms. The system s located on a SSD, has been defined to have a 2GB of cache. The system has lots of transactions (commit and abort). If I have no transactions the system answers the query within 1-2 ms. So I assume, that this association a thrown out of the shared cache (even though 2 GB cache is pretty much) - but how can I proove this ?

Any further idea with the statmonitor and/or how to interpret the results ?

Marten

_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Time to responds varies very much (performance problems)

In reply to this post by GLASS mailing list

Well, it turned out, that the virtual machine (the database was running on) was ONE virtual machine out of four machines - and all were running Gemstone/S and all were running on an HDD based server system and the image only had 8GB. So nearly all considerations we made for our installations were broken (16GB, SSD, dedicated root server) - and the general question I had to answer - " 'with mysql' this configuration would work without problems - why does your database needs so much power".

Putting the database back to an old 4-year old Laptop with four cores and SSD and everything is ok.

We rolled out the software before Christmas - and some days before we added our first java-based application, connecting to our database via our API and the automatically generated Java-Model/API. Due to a little bug in that program, the 2-core license had to show, that it was able to handle 300 application-transactions/seconds - which I think is pretty good. Both cores were running with 100% over an hour before we noticed this bug.

One of the most difficult problems - in thr last stage of the rollout - was the correct load balancing using Apache2: to deliver the API calls to different topaz processes. The normal approach "lbmethod=byrequest" did not work very well - but "lbmethod=bybusyness" was much better.

The problem here is in general are the long-time-consuming-API calls (e.g. 10 Minutes), those calls make the scheduling very difficult. We rewrote the logic of our system to handle background jobs - so the API is not actually doing the work - but registers a background job - doing the work later.

Just some informations ...

Marten

Marten Feldtmann via Glass <[hidden email]> hat am 29. November 2017 um 22:12 geschrieben:

I tried a new license with 4GB cache, but that did not help at all. The extend is around 7GB large. I noticed, that the topaz processes are very often in the "D" state, which means, that IO is done - when I execute that statement very often, the responding topaz process does not need to go into "D" and I get the full expected speed. That's getting an interesting point of learning.
Marten

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Time to responds varies very much (performance problems)

That's great news Marten!

Thanks for the update.

On Dec 23, 2017 03:26, "Marten Feldtmann via Glass" <[hidden email]> wrote:

Well, it turned out, that the virtual machine (the database was running on) was ONE virtual machine out of four machines - and all were running Gemstone/S and all were running on an HDD based server system and the image only had 8GB. So nearly all considerations we made for our installations were broken (16GB, SSD, dedicated root server) - and the general question I had to answer - " 'with mysql' this configuration would work without problems - why does your database needs so much power".
Putting the database back to an old 4-year old Laptop with four cores and SSD and everything is ok.
We rolled out the software before Christmas - and some days before we added our first java-based application, connecting to our database via our API and the automatically generated Java-Model/API. Due to a little bug in that program, the 2-core license had to show, that it was able to handle 300 application-transactions/seconds - which I think is pretty good. Both cores were running with 100% over an hour before we noticed this bug.
One of the most difficult problems - in thr last stage of the rollout - was the correct load balancing using Apache2: to deliver the API calls to different topaz processes. The normal approach "lbmethod=byrequest" did not work very well - but "lbmethod=bybusyness" was much better.
The problem here is in general are the long-time-consuming-API calls (e.g. 10 Minutes), those calls make the scheduling very difficult. We rewrote the logic of our system to handle background jobs - so the API is not actually doing the work - but registers a background job - doing the work later.

Just some informations ...
Marten

Marten Feldtmann via Glass <[hidden email]> hat am 29. November 2017 um 22:12 geschrieben:

I tried a new license with 4GB cache, but that did not help at all. The extend is around 7GB large. I noticed, that the topaz processes are very often in the "D" state, which means, that IO is done - when I execute that statement very often, the responding topaz process does not need to go into "D" and I get the full expected speed. That's getting an interesting point of learning.
Marten

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass