Periodically test DB connections (#727)

Wojciech Kapcia (Tigase) opened 9 years ago

Periodically check connections
Check connections only if/when they have been idle for a specific time to avoid imposing additional load on the system when the check is not needed
make the feature configurable (on by default, interval 60 minutes);

It looks that some databases are closing connections that are being idle for a long time. Instead of changing database configuration we could introduce a mechanism which would periodically check DB connection pool and in case of detecting broken connection would reconnect to db (and possibly check remaining connections).

Right now we only check the connection right before we want to use when it was idle for longer time, which may result in occasional timeing out of user connection if re-establishing connection takes longer than usual.

Think of it as a watchdog for DB pool.

RFD

(Please check linked issue for more details)

Related
- Database Connectivity (#675) Open

Activities

Artur Hefczyc commented 9 years ago

To be honest I have mixed feelings about this. When does this really happen? Only when the whole installation is more or less idle. Very low activity results in DB connections timing out. Additionally, I am not convinced that this is the real root of the problem. The referenced ticket says about login taking 3 minutes. Reconnecting to DB after connection was lost due to inactivity timeout should be instantaneous. So I imagine, login time when a DB connection needs to be reopen may take up to maybe 3 seconds, not 3 minutes.

My understanding is that this 3 minutes delay is caused by the fact that we may have DB connection closed on the DB end, while it appears as still open on the Tigase end. Tigase attempts to execute query on the broken connection and it takes 3 minutes or maybe longer for the OS to signal back that the connection is broken.

If this is the case, than, the best solution would be to make sure that our DB query timeouts we have on the Tigase side are working correctly. If we do not get results from the DB query within a specified timeframe, we consider the query to be failed and should either signal back failure or attempt to reopen the connection and run the query again.

Please let me know what you think.
Wojciech Kapcia (Tigase) commented 9 years ago

OK. Yes and no :-)

Yes - the root problem is that DB closes the connection and then due to operating system settings it's detected with the delay. We already tweaked the system settings (because by default those are 2h) and now the disconnection is detected quicker, but it still has the delay and there is the time to re-establish the connection.

No, keeping idle open db connections not necessarily must be a good idea (utilizes resources and may impact the performance!).

Now back to the system at hand - they expect actually relatively idle tigase-db connectivity - once the 'client' is provisioned and authenticated it would only exchange information with other clients/external components (save for the broken connection once in a while yielding reconnection and re-login), thus chances of running into problems with closed db connection are higher.

You mentioned that we should depend on the timeout in tigase, but this 15s timeout to detect broken connection may be to long (from the company/client perspective - for example in this particular case they have 10s connection-login-timeout so failing to authenticate within this window would consider a 'failure'). Now imagine this is some critical system, and they have to wait 15s for data - this may be far from perfect, where the instant availability is crucial.

We are trying to get into IoT where we would have very similar scenario of devices in always-connected state which would yield similar problem - rare access to repository causing often delays when reconnecting the device.

The suggestion, while not perfect, would greatly improve the situation in this context.

I'm aware, that ideally MySQL closing connection should do it and communicate it properly (without dangling ESTABLISHED connection - this way Tigase would know perfectly the state - I've tried researching the issue but haven't come up with relevant solution, it's possibly related to #4145).

Also another thing to consider, while basic UserRepositories have the timeout set, PubSub for example doesn't (but it ~~should be~~ is addressed in 7.2.x).

The suggestion came from the simple observation that we run into the is the socket already closed? problem yet another time and as experience shows - while it should be detected quickly, it often isn't (in part due to TCP design, which should offer reliability and resilience). If we only rely on the state that OS reports, then we wouldn't have xmpp ping implemented nor many clients asking about detecting broken connections - it's quite similar, but from a bit different perspective.

(btw. those recent submissions of ideas are mainly for discussion as it's better to discus something and arrive at the conclusion that there are better options than avoiding suggestions altogether)
Artur Hefczyc commented 9 years ago
Ok, let's do it then. More comments inline below:

Wojciech Kapcia wrote:

OK. Yes and no :-)

Yes - the root problem is that DB closes the connection and then due to operating system settings it's detected with the delay. We already tweaked the system settings (because by default those are 2h) and now the disconnection is detected quicker, but it still has the delay and there is the time to re-establish the connection.

I understand why this happens and I believe tweaking OS settings is not the right approach. This is because we start to rely on something we have no control over. Different OSes, distributions and even versions may have different settings. Even if we somehow enforce proper OS settings during installation time, there is no guarantee that they will be preserved after OS update at later time.

That's we have all sorts of timeouts and pings inside Tigase to handle the problem within Tigase.

No, keeping idle open db connections not necessarily must be a good idea (utilizes resources and may impact the performance!).

Now back to the system at hand - they expect actually relatively idle tigase-db connectivity - once the 'client' is provisioned and authenticated it would only exchange information with other clients/external components (save for the broken connection once in a while yielding reconnection and re-login), thus chances of running into problems with closed db connection are higher.

Ok, this makes sense and I agree.

You mentioned that we should depend on the timeout in tigase, but this 15s timeout to detect broken connection may be to long (from the company/client perspective - for example in this particular case they have 10s connection-login-timeout so failing to authenticate within this window would consider a 'failure'). Now imagine this is some critical system, and they have to wait 15s for data - this may be far from perfect, where the instant availability is crucial.

There are a few remarks to this:

The timeout described above happens only during the first request, once DB connection is reestablished, all subsequent calls should be fast.

All our timeouts and in general, workflow, error handling was implemented with a typical XMPP use-case scenario and I think it works quite well.

Also, I still believe this problem manifests itself only on their development system which is really idle. Almost no connections, no usage, no traffic. On a real production system they expect to have 10k devices connected initially. There will be some DB traffic all the time. Devices will be disconnecting, reconnecting and other traffic will be there too. I expect for 10k devices there will be enough load to keep the DB connections active.

If I am not correct in my point above, then their setup is really a special use-case and we should do something about this. Testing DB connections is one thing to do, maybe we should also tweak Tigase internal timeouts for DB queries, for user authentication and others...

We are trying to get into IoT where we would have very similar scenario of devices in always-connected state which would yield similar problem - rare access to repository causing often delays when reconnecting the device. The suggestion, while not perfect, would greatly improve the situation in this context.

This really depends on a use-case but I agree there are some specific deployments on which the same problem may happen.

If we implement the DB connection testing in such a way they it is only run when connection is idle for a specific period of time, it should not cause any resource/performance problems. When the system runs with normal load, there will be no DB tests, when it is underloaded, then, there is no load anyway, so running DB check won't harm.

I'm aware, that ideally MySQL closing connection should do it and communicate it properly (without dangling ESTABLISHED connection - this way Tigase would know perfectly the state - I've tried researching the issue but haven't come up with relevant solution, it's possibly related to #4145).

My guess, is that this most likely happens because on their setup MySQL is behind some router, proxy or a firewall and that's why connection closure it not correctly propagated between DB and Tigase. But again, there is nothing we can do about this and we should expect this to happen. Actually it does/did happen many times on other installations.

Also another thing to consider, while basic UserRepositories have the timeout set, PubSub for example doesn't (but it ~~should be~~ is addressed in 7.2.x).

That's, probably, because PubSub does not use our DB API where all the mechanisms are included. So, I guess the same problem will be with testing DB connections. If we implement it within our DB API/Framework, only components which use it will benefit. We should make sure all our code uses the same way to access resources (DB and others).
Wojciech Kapcia (Tigase) commented 9 years ago

Artur Hefczyc wrote:

Ok, let's do it then. More comments inline below:

I've assigned it for 7.2.0 and set the due date. I've updated the description with the main points.

%kobit - couple of comments below.

Wojciech Kapcia wrote:

OK. Yes and no :-)

Yes - the root problem is that DB closes the connection and then due to operating system settings it's detected with the delay. We already tweaked the system settings (because by default those are 2h) and now the disconnection is detected quicker, but it still has the delay and there is the time to re-establish the connection.

I understand why this happens and I believe tweaking OS settings is not the right approach. This is because we start to rely on something we have no control over. Different OSes, distributions and even versions may have different settings. Even if we somehow enforce proper OS settings during installation time, there is no guarantee that they will be preserved after OS update at later time.

That's we have all sorts of timeouts and pings inside Tigase to handle the problem within Tigase.

Agreed, and this was one of the reasons for this suggestion - increasing timeout after MySQL would kill the connection ad infinitum didn't make sense and having periodic check/refresh in Tigase fit the bill.

You mentioned that we should depend on the timeout in tigase, but this 15s timeout to detect broken connection may be to long (from the company/client perspective - for example in this particular case they have 10s connection-login-timeout so failing to authenticate within this window would consider a 'failure'). Now imagine this is some critical system, and they have to wait 15s for data - this may be far from perfect, where the instant availability is crucial.

There are a few remarks to this:

The timeout described above happens only during the first request, once DB connection is reestablished, all subsequent calls should be fast.

This is correct, with small catch - we are using connection pool so the hash of subsequent JID could be different resulting in selecting of other connection (and we are now using connection pool size based on CPU count).

All our timeouts and in general, workflow, error handling was implemented with a typical XMPP use-case scenario and I think it works quite well.

Yes, but recently there is a shift in XMPP from user oriented to client/machine oriented, where the typical scenarios doesn't apply that much (less connections/reconnections, less or non presence broadcasts or roster usage).

Even in typical user-centric solutions, not-so-recent move to mobile-first caused that other xmpp features may have more priority (but in this case often reconnection are actually more common)

Also, I still believe this problem manifests itself only on their development system which is really idle. Almost no connections, no usage, no traffic. On a real production system they expect to have 10k devices connected initially. There will be some DB traffic all the time. Devices will be disconnecting, reconnecting and other traffic will be there too. I expect for 10k devices there will be enough load to keep the DB connections active.

For that I asked for the estimates and in 2016 they expect to have roughly 250 devices and around 40k devices in subsequent years. Yes - this should generate enough traffic (and they state that they expect at most 1 device disconnection per day, however it may happen that the device stay online for days/months), but instead of relaying on "should be enough" I think it's better to take active and pre-emptive steps that will help with the situation and won't rely that much on external conditions (adjusting mysql settings or OS settings)

If I am not correct in my point above, then their setup is really a special use-case and we should do something about this. Testing DB connections is one thing to do, maybe we should also tweak Tigase internal timeouts for DB queries, for user authentication and others...

Setup is not that special (8h mysql timeout is default), it's the use-case/requirements that's special (little DB access, low delays)

Also another thing to consider, while basic UserRepositories have the timeout set, PubSub for example doesn't (but it ~~should be~~ is addressed in 7.2.x).

That's, probably, because PubSub does not use our DB API where all the mechanisms are included. So, I guess the same problem will be with testing DB connections. If we implement it within our DB API/Framework, only components which use it will benefit. We should make sure all our code uses the same way to access resources (DB and others).

%andrzej.wojcik made a lot of unification in this aspect in the 7.2.0 so this should/will be handled correctly.
Artur Hefczyc commented 9 years ago

Thank you for additional comments, I agree to all of them. :-)
Artur Hefczyc commented 8 years ago

Wojciech, looks like you are overloaded with work for version 7.2.0. Do you think Andrzej could take this over?
Andrzej Wójcik (Tigase) commented 8 years ago
I've implemented this feature by a creation of ScheduledExecutor in DataSourceBean (bean handling all data source beans instances) and each DataSourceMDConfigBean registers its own task with custom watchdog frequency. This way we can have multiple data sources and each of them may have different watchdog frequency.

I've set the default to 1 hours, and it is possible to override this value by:

setting data source watchdog-frequency parameter to proper value, ie. here we set test data source watchdog to 30 minutes

dataSource { default () { uri = '....' } 'test' () { uri = '...' 'watchdog-frequency' = 'PT30M' } }

changing default watchdog frequency for all data sources to 15 minutes:

dataSource { default () { uri = '...' } 'watchdog-frequency' = 'PT15M' }

I decided not to implement watchdog mechanism for Mongo as it's driver has an internal connection pool and custom keep alive mechanisms. If it will be needed we can easily implement it by providing an implementation for a single method.
Wojciech Kapcia (Tigase) commented 8 years ago

Works ok. Dan - thanks for adding description to the documentation.
Login to comment

Type	New Feature
Priority	Normal
Assignee	Wojciech Kapcia (Tigase)
RedmineID	4687
Version	tigase-server-8.0.0
Estimation	0
Spent time	0

Issue Votes (0)

Watchers (0)

Reference

tigase/_server/server-core#727

The timeout described above happens only during the first request, once DB connection is reestablished, all subsequent calls should be fast.

All our timeouts and in general, workflow, error handling was implemented with a typical XMPP use-case scenario and I think it works quite well.

If I am not correct in my point above, then their setup is really a special use-case and we should do something about this. Testing DB connections is one thing to do, maybe we should also tweak Tigase internal timeouts for DB queries, for user authentication and others...