Projects tigase _server server-core Issues #1101
Enabling TLS1.3 causes s2s connections to fail (#1101)
wojciech.kapcia@tigase.net opened 5 years ago
Due Date
2021-02-09

No matter if other party supports or not TLS1.3, if we add it to "enabled protocols", Tigase will fail to establish connection to the remote server, and remote server will log this error:

20:09:27.205 [info] Closing inbound s2s connection tigase.im -> wojtek-local.tigase.eu: TLS failed: SSL_do_handshake failed: error:1408F119:SSL routines:ssl3_get_record:decryption failed or bad record mac
20:09:27.205 [warning] (tls|<0.521.0>) Failed to secure inbound s2s connection: TLS failed: SSL_do_handshake failed: error:1408F119:SSL routines:ssl3_get_record:decryption failed or bad record mac

Using same Java version and connecting using Java's SSLSocket instead of Tigase's network layer works fine,

Andrzej Wójcik (Tigase) commented 4 years ago

The issue is now fixed (or it looks like fixed) and tests from S2SConnManTest are passing.

The cause of the issue was the location where eventHandler.handshakeCompleted(this) was called in the JcaTLSWrapper in wrap() method.

Before TLS 1.3 for client connections (outgoing connections) handshake could be completed only on unwrap() method, so that was not causing any issues.

On server connections, it was not causing any issues as servers are waiting for requests from clients, so they are not sending data just after the handshake is completed.

Only in TLS 1.3 in client connections, it is possible to complete a handshake in wrap method (but server has to be TLS 1.3 compatible as well).

In this situation, we were calling eventHandler.handshakeCompleted(this) which was trying (on the same thread) to send <stream:stream/> and because we had TLS handshake data prepared (ready to be sent) but not queued yes, we queued TLS encryped <stream:stream/> before TLS handshake data - which caused issue.

Invalid flow:

  1. Starting TLS 1.3
  2. Begining handshake and finishing it (TLS 1.3 has support for a 1-trip handshake)
  3. Storing handshake data in a local variable
  4. Encrypting <stream:stream/>
  5. Enqueuing encrypted <stream:stream/> in the socket output buffer
  6. Enqueuing handshake data in the socket output buffer

The issue is with inverted steps 5 & 6. We should first enqueue handshake data and then encrypted <stream:stream/>.

I've solved it with a minor change to the workflow and (I hope) a small change in the API which should be backward compatible (added default method to the existing interface).

I've marked this method as deprecated as it should be replaced by proper handling of wrap result in TLSIO instead of relying on the private state of a new variable in JcaTLSWrapper informing it that notification should be sent. That would require API changes, so I've left it for 9.0.0.

@wojtek Please review those changes. I've not enabled TLS 1.3 but I think we could do that by default if we confirm that this fix is working properly.

wojciech.kapcia@tigase.net commented 4 years ago

As per our discussion: it seems that the fix doesn't work all that well (as it turned out after deployment to tigase.im).

Actions for now:

  1. I reverted both commits (with the fix and enabling tls1.3 by default) to have master branch in working state (so I would be able to update tigase.im)
  2. Possible work with enabling tls1.3 should go to the separate branch and got merged to the master only once we make sure that it's properly working with other server (i.e. running the unit tests, and possibly improving them as well)
  3. If there are additional requirements (i.e. "works with JVM build in the version at least x.y.z" please do include those details)
Andrzej Wójcik (Tigase) commented 4 years ago

Sorry for the trouble caused by recent changes to add support for TLS 1.3.

I've reviewed the code more deeply and after some debugging, I've found out that my changes were correct. The issue was actually caused by tlsHandshakeCompleted being called twice (SSLEngine result was still reporting handshake completed even that it was completed before).

Before my previous fix, the first call to tlsHandshakeCompleted was sending <stream:stream> before TLS data and due to that it was considered garbage and dropped on TLS lower than 1.3. On the newer version of TLS that was causing issues and was fix by my previous fix (fixed reordering of data).

However, due to tlsHandshakeCompleted being called twice, we ended up in <stream:stream> being sent twice causing confusion for some servers. After I've added protection (I've made sure that we will call tlsHandshakeCompleted only once), the issue is now gone and streams started to behave correctly.

After confirming that those changes work for domains for which they failed before (I've deployed a new build on my local installation with S2S and confirmed that I was able to query disco items and info from problematic domains), I've looked into test cases which were prepared for this issue.

I've changed them, to make them assume the actual behavior of the S2S connection:

  1. SSL certificate exists in the key store and is sent during handshake as a client certificate
  2. TLS handshake has to be completed successfully
  3. SASL-EXTERNAL may succeed and then dialback cannot be stated and the connection would be established
  4. SASL-EXTERNAL may fail and then dialback will kick in
  5. Dialback may succeed and connection will be authenticated or it may fail (most likely) and the connection would be terminated
  6. Dialback may be stuck at connecting if server decided not to answer (some timeout?) and we should consider test passed.

Note: We are testing TLS here (mostly 1.3 but just any TLS) and not full S2S connectivity (as we are testing it only 1-way), due to that I've focused tests on establishing TLS connection and verifying that data is exchanged over a secure connection properly (ie. we are sending stream-open, server responds with features, we are sending SASL EXTERNAL/DIALBACK, the server responds, etc.).

While working on that, I've found out that SASL-EXTERNAL is also tried if the local certificate is not trusted (ie. self-signed or expired). I've added a check, which will skip SASL-EXTERNAL if the local certificate is not valid. That should improve connectivity time (reduced number of roundtrips). Also, some servers may close connection if SASL-EXTERNAL fails, without trying dialback (that is allowed by the XEP).

With that in place, I was able to make S2S tests pass without any issues.

I've also encountered another issue with SaslExternal.canAddSaslToFeatures() method which based on checking LOCAL_CERT_CHECK_RESULT that was not set for outgoing connections. I've changed that, so it is now check for outgoing and for incoming connections.

All the tests and my installation used for testing were running on AdoptOpenJDK 11.0.10 (lastest stable LTS).

All changes are checked in tls13 branch of tigase-server repository. I would suggest reviewing my changes and checking if my installation (hi-low.eu) has S2S running correctly and if so, then we could merge that with master branch and deploy it on tigase.org for testing.

@wojtek You have credentials to the account on my installation, so you can use it for checking if S2S works as expected. If needed IBR is enabled as well, so you can create an account as well. If something will be failing for some specific domain I'm able to check it later on with a debugger connected to this installation which would be better than extending logs and restarting our clustered installations.

wojciech.kapcia@tigase.net commented 4 years ago

Thank you for your analysis. I checked the changes and they seem to look ok. I'd like to run some tests after migration to EB and then we could merge it.

wojciech.kapcia@tigase.net commented 4 years ago

I merged the branch. There were some issues due to reverting of the previous changes. I think I got everything.

Tomorrow I'll update our tigase.org instances so we'll be able to verify it.

wojciech.kapcia@tigase.net commented 3 years ago

It seems to be working now.

issue 1 of 1
Type
Bug
Priority
Normal
Assignee
Version
tigase-server-8.2.0
Spent time
26h
Issue Votes (0)
Watchers (0)
Reference
tigase/_server/server-core#1101
Please wait...
Page is in error, reload to recover