Unknown opened 2 years ago
|
|
We already set this property: https://github.com/tigase/tigase-utils/blob/f681e263b95013886a983c9b0706c2c391cfd2ba/src/main/java/tigase/util/dns/DNSResolverDefault.java#L170-L170 Your instance should reconnect correctly after 2 minutes thuen. |
|
thank you for the feedback, unfortunately from my tests it seems that it does not work. I have done several tries and also after 10/20 minutes tigase does not reconnect. In the comment I see: "we are caching ourselves" what does this exactly mean? There is some other cache level somewhere else? Anyway could be a good idea to move the value of |
|
There is a cache for IP and SRV records (https://github.com/tigase/tigase-utils/blob/88bc1e649e281e3eeece1abc7acc32647a93a398/src/main/java/tigase/util/dns/DNSResolverDefault.java#L46-L46)
I've made a small change that will virtually respect change from |
|
ok but this should not interact / cache the dataSource uri, right?
ok thanks, have you ever done a failover test with AWS aurora mysql with success? Tomorrow I will try a newer version to see if the 8.2 or 8.3 have the same problem |
|
I have done a test upgrading to 8.2.2 but I have the same problem, I have also tried to add |
|
Correct, internal cache of DNS resolver is only used in XMPP layer. DNS resolution for data sources is done via JDBC driver.
Yes, but it was a couple of years back (when we changed the JVM DNS caching setting) and it was working correctly. Regarding your current issue - it seems that for some reason, during failover, Tigase gets disconnected from the database but it manages to reconnect to the older instance that was made slave/read-only instance. Which RDS endpoint address do you use (the cluster one?). Does it point to write replica after performing failover? Which type of failover do you perform? |
|
I'm triggering the failover from the console, please note that I'm using an AWS aurora cluster with mysql compatible with 5.7 (and not AWS RDS multi az that has a different failover strategy). In this scenario I just select the writer instance of the cluster and I click "failover", the cluster has just a writer and one replica. Yes I'm using the cluster endpoint and not the instance one. If I restart tigase all works fine. So it means that after restart without changing the conf it just resolve to new address. I do not think that during failover Tigase gets disconnected, I guess there is something cached and tigase and it is still use for some reason the old connection with host resolved before the failover, in the test config i use Is it possible that it is trying to use another connection from the pool already bounded to old mysql instance but not used yet? If I remember correctly tigase create all the connections of the db pool at startup so I guess that during the failover all the connections just remain attached to the old instances since aurora switch the DNS and I think doesn't close active connections, maybe the only solution is to manage explicitly that exception in tigase and re-create all connections of the db pool. The manual failover is useful to change the instance type with limited down and it is also used by AWS during the upgrade of mysql version |
|
Indeed, if there is no disconnection then the existing database connection pool will remain with all the connections. If you open mysql connection from the shell and do the failover does the connection stay open? Can you then operate on it or (as it seems) you are still connected to the same instance but it's made RO (just like in Tigase case)? |
|
I have made the test during the failover, the mysql client do not trigger any error but then if I make a query this is the result:
I have added to the test
after the failover
so it seems that the failover actually close the TCP connection, so I have then run another test with this script:
this is the output
as in the comment it seems that short after the failover the DNS still point to the old primary writer allowing connection but in read only, it last for some seconds, so I have done this test to confirm it:
and as expected this is the problem
with an update query I can trigger the "The MySQL server is running with the --read-only", after 5 secs it come back to the correct host, so this is what happening, tigase connects immediately to old writer instance in read only mode and stay stuck there forever, I guess you must manage explicitly that exception |
|
I have found the same problem described here: https://proxysql.com/blog/failover-comparison-in-aurora-mysql-2-10-0-using-proxysql-vs-auroras-cluster-endpoint/ |
|
So, they ran into the same issue but reading your previous comment:
I wonder if this isn't an issue with the RDS. I mean - if they do the failover they should coordinate the disconnection/DNS update/making instance available again. Because currently they currently do disconnect, make (now reader) instance available under cluster enpoint, update dns. They should switch the last two steps. Have you considered rising the issue with the AWS tech support? |
|
Yes it seems a bug on AWS, I will try to open a case also if I know it will take a long time and is not guarantee that they will fix it. Do you think you are going to manage it and make a workaround on your side? |
|
I filed a ticket for it (ref: server-1354) but it's not on our immediate agenda and doesn't have ETA. |
|
thank you, I will update here if I have some answers from AWS |
|
anyway I think should be also a quite important topic on your side. If I send a message to tigase to a user that is offline, and the DB is in that state the messages get lost without returning any error to the sender, it would better if tigase just shut down completely |
|
Hmm, most database exceptions (and this is just a SQLException in the end) should yield an error returned. Could you share ST/error log from the case resulting in message loss? |
|
This is the exception:
I do not receive any error on the XMPP submit, I guess that tigase is async and accept the message that needs to be delivered and sends an ack to the sender then it found that the recipient is offline and tries to store the message to the MsgRepository. If the recipient is online the delivery works also if mysql is not working but if is online this exception is triggered without any errors for the sender |
|
Tigase in general operates in async manner (it's based on message passing, so in a way it's event based). In this case there should be an error returned, we will add it in the future (ref: server-1355) |
|
I'm having a big argue with AWS support about this bug, basically they says:
linking me this article: https://repost.aws/knowledge-center/aurora-mysql-db-cluser-read-only-error anyway now I have another doubt, why at some point tigase do not start to resolve the correct DNS entry? Also if after the down it connects to the wrong instance and starts to issue the errors describe at the beginning:
after the Is it possible that there are other cache further than |
|
I have also found this article: https://aws.amazon.com/blogs/database/improve-application-availability-on-amazon-aurora/ that exactly explains this problem, is written by AWS developers, as described in the article: "the client kept trying to connect to the old writer, including a couple of seconds after that writer has already been demoted to a reader. " is a issue related to AWS aurora and the solution is to decrease the DNS TTL or use RDS proxy or maybe use AWS mysql drivers. Anyway as written in the previous comment I do not understand why tigase after the |
|
Unfortunately not. This exception is not connection exception but merely execution exception so Tigase successfully connected (TCP connection) to the instance (albeit wrong one) and persist the connection. So unless the connection is terminated one way or another (a restart in this case) it will keep those connections in the pool. So, adding handling of this situation and recreating connection pool would resolve it. But I'd argue, that it's still a bug on AWS, and they should update the failover procedure so the moment it happens, the endpoint already points to the correct instance. EDIT: sorry, missed your other reply.
Tigase does not start to (re-)connect to the old DNS address - when the failover happens and connection is terminated Tigase reconnects to the database at once (thus, within 5s window), which results in old DNS entry. After that there is no DNS resolution as there is no reconnection attempt. We do use official MySQL JDBC driver, though AWS seems to suggest using (forked) MariaDB JDBC driver, which supposedly has support for the use-case. |
|
got it
I have wrote a comment on their blog https://aws.amazon.com/blogs/database/improve-application-availability-on-amazon-aurora/ upvote or maybe comment yourself if you like |
|
The AWS developers has given an official response in the comments of: https://aws.amazon.com/blogs/database/improve-application-availability-on-amazon-aurora/ TL;DR -> They are not going to make any changes so must be handled from tigase side |
|
Thanks for the update. I'm not really thrilled by the AWS response, and not sure if it's exactly accurate... They could simply avoid making instance available again after failover before updating the DNS, they even highlight it in their example:
They could simply wait until DNS update and then bring both instances up... Or they could even use their RDS Aurora proxy internally, that would solve the issue and would increase HA... At any rate, we will probably add handling for this use-case, but there is no ETA. |
|
It is exactly what I commented on: wait until the DNS update and then bring both instances up. But I guess that by doing it this way, they have to introduce a longer downtime for those who use smart drivers. I think that is the reason why they have done it this way. For sure, they could use RDS Aurora proxy internally, but I guess they prefer that customers pay for it :). I do not think they will change it in the future, as pointed out in the comments the only solutions to this problem are:
So without modifying tigase the only external solution is to use RDS proxy |
|
As I said - we will probably add support for that, but in your case you can use suggested JDBC driver to mitigate the issue - either by swapping with one from tarball distribution package or modifying docker image (will require your own build though). |
|
Ok, thanks for the feedback, I will try to swap the JDBC driver |
|
I have tryed to swap the driver but I get:
from AWS:
how can I override database.driverClassName from startup command line or config.tdsl? |
|
Unfortunately, it turns out, we do have driver classes hardcoded, so the swapping won't work. I filed a ticket to remove it in the next version. |
|
Hi, I'm trying to fix this problem using AWS RDS proxy, it seems it works well, the only problem that I have is that the proxy force to close all the idle connection after 24 hours
and I get some error from in the tigase-console.log like |
|
Query validation query is not limited to only auth repository but it's used in DataSource altogether, see: I'm not familiar with |
|
I have tried to add in the config in
and in the
but in the logs grepping for |
|
I have read the code of the mysql driver and unfortunately there is no way to manage directly the retry of the query from the driver, there is the I have also explored the functions of Failover and Load Balancing using the same host (of the proxy) multiple time but also this does not work because the driver re-connect correctly to the "other" host for the next query but anyway throw back the exception to the client using the method dealWithInvocationException. This method is used both for Failover and Load Balancing configuration I hope there is a way on tigase to fix the watchdog to keep the connection alive. |
|
Have you configured it in
|
|
Ok thanks, I see now in the logs but it still doesn't work, can it be related to the pool of connections? the checkConnectivity test that just one connection is working or that all the connections in the pool are working? I do not understand what happens from the RDS proxy, at the moment the configuration of tigase for the DB is:
From the doc
What I have see is that after 24 hours of using the proxy I get the errors in the tigase-console.log coming from the JDBC driver of
I guess that the proxy close all the idle and for some reason tigase try to use one of that connection. But if the watchdog is running always maybe only the same or first connection of the pool it does not the problem, correct? What do you think? Is possible to run the watchdog on all the connections of the pool? |
|
It checks all the connections, see Could you try setting watchdog to 4M (considering RDS Proxy 5 minutes mark)? |
|
Good point! I will try and let you know. Via JMX I use "tigase.stats:type=StatisticsProvider getAllStats 0" to get all stats, there is a way to know how many of the connections of the pools are actually used? I do not see it in the stats list. |
|
What do you mean "used"? There is a pool of connections and its size is the only metric. There is no detailed statistics regarding usage of particular connections as the assumption is there should be used equally. |
|
With "used" I mean how many connections are running a query so I can know how many are "free" and how many are "used" |
|
We don't do statistics about that. You can get this info from mysql and it's processlist. |
|
unfortunately it didn't work, tigase still logs errors exactly after 24 hours for some queries failed with:
|
|
Do you get watchdog entries in logs (i.e. |
|
no, I do not see an entry for each connection of the pool with dataSource -> default ->
with dataSource -> default ->
The config is:
|
|
To try to understand the problem I have left an instance of the cluster running with 10 connections and without access from the clients logging all the traffic on 3306 with:
Then I have extracted the streams of each the 10 connections with tshark:
in this ways I have the exact logs of what happens on each mysql connection extracted from the wire. What I found is that on the idle connections of the pool, where there is no mysql traffic, the watchdog is working but it seems that it use the double of the values configured on watchdog-frequency, from the traffic I see the
instead in the stream of the connection where there is mysql traffic I see a 1:
2:
3:
|
|
Using Do you have any ETA for the new version that support driver swapping so I can use the AWS jdbc driver? |
|
bad news from AWS:
so at the moment tigase does not support in any way AWS aurora with HA, there is no way to make it working correctly, could you please re-think to the priority that you have assigned to ticket server-1358? An alternative fix can be to refresh the connections as described in the doc in the link of Aurora Serverless v1:
|
|
@davidemarrone could you share output of the query: |
|
on the slave - RO:
on the master - RW:
|
|
@davidemarrone I added handling of this case (checking if database read-only using JDBC API and also additional workaround for AWS RDS as they set different variable than the one that the MySQL Driver checks…). It's in the current |
|
Ok, thank you |
|
wojciech.kapcia@tigase.net added "Related" tigase/_server/tigase-utils#26 4 months ago
|
|
wojciech.kapcia@tigase.net referenced from other issue 4 months ago
|
|
wojciech.kapcia@tigase.net added "Related" #1354 3 months ago
|
|
wojciech.kapcia@tigase.net added "Related" #1355 3 months ago
|
|
wojciech.kapcia@tigase.net added "Related" #1358 3 months ago
|
|
|
|
OK, so there were a couple of issues that were discovered along the way and some were indeed released in 8.4.:
The other two were not released yet:
|
Type |
Question
|
Problem with DNS cache I was testing the AWS Aurora mysql failover procedure with tigase, reading the documentation I have found: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZSingleStandby.html scrolling to: "Setting the JVM TTL for DNS name lookups"
explains that is important to set a TTL for the java cache otherwise the failover does not work properly.
I was testing it on an instance of tigase and during my test I have found that tigase does not auto recover, after triggering a failover manually, in the logs there is:
this means that is still using the old endpoint of the DNS and it not switching to the new endpoint, I have left over 10 minutes always printing the same message.
Restarting the server resolves the problem but I need an automatic recovery.
I have change on the system in
/etc/java-11-openjdk/security/java.security
the value fornetworkaddress.cache.ttl
as suggested in the AWS doc but I have the same results. There is any other way to set this parameter? Do you know why tigase do not consider the system configuration?System info: