AWS CNAME dns bug
Not too long ago I ran into a weird issue where an EC2 instance would fail to SSH into another one.
The problem
I had an application running in a docker container on Host Master
that would attempt to SSH into Host worker
.
The ssh connection attempts would eventually timeout every time the app would run.
Service architecture
Initial thoughts
The first thing I did was check the security groups used for the worker hosts.
After confirming that the worker allowed SSH connections coming from master I tried live debugging the application.
I got a shell into the container running on Host Master
and attempted to SSH into the Host worker
using the same library as the app.
The first attempt succeeded, but then subsequent attempts would just time out.
Troubleshooting
- I confirmed that the security group rules of
Host worker
allowed ssh connections fromHost Master
. - I ensured that the container in
Host Master
had access to the identifyFile-v /home/ec2-user/.ssh:/root/.ssh
of the host OS - I manually tried SSHing from the container in
Host Master
->Host worker
to confirm that everything was working.
What the service was doing
The service would spawn EC2 instances, wait for them to start up, then SSH into them and deploy a docker-compose file.
Once the EC2 was created, it would use the aws SDK to call the servicediscovery create-service
and register-instance
actions to associate a cname record to the instance.
Doing this caused the public hostname of said instance to resolve to a private IP address when querying AWS’ nameservers.
Repeatedly running ssh -v
with verbose logging from the container led me to see that the public hostname would resolve to a private IP address after the register-instance API was called.
Solution
I updated the app to use the instance’s public IP address instead of it’s public hostname. I’m in the process of contacting AWS support to figure out if this is expected behavior from register-instance or not.