wojciech.kapcia@tigase.net opened 8 years ago
|
|
To be honest I have mixed feelings about this. When does this really happen? Only when the whole installation is more or less idle. Very low activity results in DB connections timing out. Additionally, I am not convinced that this is the real root of the problem. The referenced ticket says about login taking 3 minutes. Reconnecting to DB after connection was lost due to inactivity timeout should be instantaneous. So I imagine, login time when a DB connection needs to be reopen may take up to maybe 3 seconds, not 3 minutes. My understanding is that this 3 minutes delay is caused by the fact that we may have DB connection closed on the DB end, while it appears as still open on the Tigase end. Tigase attempts to execute query on the broken connection and it takes 3 minutes or maybe longer for the OS to signal back that the connection is broken. If this is the case, than, the best solution would be to make sure that our DB query timeouts we have on the Tigase side are working correctly. If we do not get results from the DB query within a specified timeframe, we consider the query to be failed and should either signal back failure or attempt to reopen the connection and run the query again. Please let me know what you think. |
|
OK. Yes and no :-) Yes - the root problem is that DB closes the connection and then due to operating system settings it's detected with the delay. We already tweaked the system settings (because by default those are 2h) and now the disconnection is detected quicker, but it still has the delay and there is the time to re-establish the connection. No, keeping idle open db connections not necessarily must be a good idea (utilizes resources and may impact the performance!). Now back to the system at hand - they expect actually relatively idle tigase-db connectivity - once the 'client' is provisioned and authenticated it would only exchange information with other clients/external components (save for the broken connection once in a while yielding reconnection and re-login), thus chances of running into problems with closed db connection are higher. You mentioned that we should depend on the timeout in tigase, but this 15s timeout to detect broken connection may be to long (from the company/client perspective - for example in this particular case they have 10s connection-login-timeout so failing to authenticate within this window would consider a 'failure'). Now imagine this is some critical system, and they have to wait 15s for data - this may be far from perfect, where the instant availability is crucial. We are trying to get into IoT where we would have very similar scenario of devices in always-connected state which would yield similar problem - rare access to repository causing often delays when reconnecting the device. The suggestion, while not perfect, would greatly improve the situation in this context. I'm aware, that ideally MySQL closing connection should do it and communicate it properly (without dangling Also another thing to consider, while basic UserRepositories have the timeout set, PubSub for example doesn't (but it The suggestion came from the simple observation that we run into the is the socket already closed? problem yet another time and as experience shows - while it should be detected quickly, it often isn't (in part due to TCP design, which should offer reliability and resilience). If we only rely on the state that OS reports, then we wouldn't have xmpp ping implemented nor many clients asking about detecting broken connections - it's quite similar, but from a bit different perspective. (btw. those recent submissions of ideas are mainly for discussion as it's better to discus something and arrive at the conclusion that there are better options than avoiding suggestions altogether) |
|
Ok, let's do it then. More comments inline below: Wojciech Kapcia wrote:
I understand why this happens and I believe tweaking OS settings is not the right approach. This is because we start to rely on something we have no control over. Different OSes, distributions and even versions may have different settings. Even if we somehow enforce proper OS settings during installation time, there is no guarantee that they will be preserved after OS update at later time. That's we have all sorts of timeouts and pings inside Tigase to handle the problem within Tigase.
Ok, this makes sense and I agree.
There are a few remarks to this:
This really depends on a use-case but I agree there are some specific deployments on which the same problem may happen. If we implement the DB connection testing in such a way they it is only run when connection is idle for a specific period of time, it should not cause any resource/performance problems. When the system runs with normal load, there will be no DB tests, when it is underloaded, then, there is no load anyway, so running DB check won't harm.
My guess, is that this most likely happens because on their setup MySQL is behind some router, proxy or a firewall and that's why connection closure it not correctly propagated between DB and Tigase. But again, there is nothing we can do about this and we should expect this to happen. Actually it does/did happen many times on other installations.
That's, probably, because PubSub does not use our DB API where all the mechanisms are included. So, I guess the same problem will be with testing DB connections. If we implement it within our DB API/Framework, only components which use it will benefit. We should make sure all our code uses the same way to access resources (DB and others). |
|
Artur Hefczyc wrote:
I've assigned it for 7.2.0 and set the due date. I've updated the description with the main points. %kobit - couple of comments below.
Agreed, and this was one of the reasons for this suggestion - increasing timeout after MySQL would kill the connection ad infinitum didn't make sense and having periodic check/refresh in Tigase fit the bill.
This is correct, with small catch - we are using connection pool so the hash of subsequent JID could be different resulting in selecting of other connection (and we are now using connection pool size based on CPU count).
Yes, but recently there is a shift in XMPP from user oriented to client/machine oriented, where the typical scenarios doesn't apply that much (less connections/reconnections, less or non presence broadcasts or roster usage). Even in typical user-centric solutions, not-so-recent move to mobile-first caused that other xmpp features may have more priority (but in this case often reconnection are actually more common)
For that I asked for the estimates and in 2016 they expect to have roughly 250 devices and around 40k devices in subsequent years. Yes - this should generate enough traffic (and they state that they expect at most 1 device disconnection per day, however it may happen that the device stay online for days/months), but instead of relaying on "should be enough" I think it's better to take active and pre-emptive steps that will help with the situation and won't rely that much on external conditions (adjusting mysql settings or OS settings)
Setup is not that special (8h mysql timeout is default), it's the use-case/requirements that's special (little DB access, low delays)
%andrzej.wojcik made a lot of unification in this aspect in the 7.2.0 so this should/will be handled correctly. |
|
Thank you for additional comments, I agree to all of them. :-) |
|
Wojciech, looks like you are overloaded with work for version 7.2.0. Do you think Andrzej could take this over? |
|
I've implemented this feature by a creation of I've set the default to 1 hours, and it is possible to override this value by:
I decided not to implement watchdog mechanism for Mongo as it's driver has an internal connection pool and custom keep alive mechanisms. If it will be needed we can easily implement it by providing an implementation for a single method. |
|
Works ok. Dan - thanks for adding description to the documentation. |
|
Referenced from commit 1 year ago
|
Type |
New Feature
|
Priority |
Normal
|
Assignee | |
RedmineID |
4687
|
Version |
tigase-server-8.0.0
|
Estimation |
16h
|
Spent time |
26h 15m
|
Periodically check connections
Check connections only if/when they have been idle for a specific time to avoid imposing additional load on the system when the check is not needed
make the feature configurable (on by default, interval 60 minutes);
It looks that some databases are closing connections that are being idle for a long time. Instead of changing database configuration we could introduce a mechanism which would periodically check DB connection pool and in case of detecting broken connection would reconnect to db (and possibly check remaining connections).
Right now we only check the connection right before we want to use when it was idle for longer time, which may result in occasional timeing out of user connection if re-establishing connection takes longer than usual.
Think of it as a watchdog for DB pool.
RFD
(Please check linked issue for more details)