AWS CNAME dns bug

Not too long ago I ran into a weird issue where an EC2 instance would fail to SSH into another one.

The problem

I had an application running in a docker container on Host Master that would attempt to SSH into Host worker. The ssh connection attempts would eventually timeout every time the app would run.

Service architecture

diagram

Initial thoughts

The first thing I did was check the security groups used for the worker hosts. After confirming that the worker allowed SSH connections coming from master I tried live debugging the application. I got a shell into the container running on Host Master and attempted to SSH into the Host worker using the same library as the app. The first attempt succeeded, but then subsequent attempts would just time out.

Troubleshooting

  • I confirmed that the security group rules of Host worker allowed ssh connections from Host Master.
  • I ensured that the container in Host Master had access to the identifyFile -v /home/ec2-user/.ssh:/root/.ssh of the host OS
  • I manually tried SSHing from the container in Host Master -> Host worker to confirm that everything was working.

What the service was doing

The service would spawn EC2 instances, wait for them to start up, then SSH into them and deploy a docker-compose file. Once the EC2 was created, it would use the aws SDK to call the servicediscovery create-service and register-instance actions to associate a cname record to the instance. Doing this caused the public hostname of said instance to resolve to a private IP address when querying AWS’ nameservers.

Repeatedly running ssh -v with verbose logging from the container led me to see that the public hostname would resolve to a private IP address after the register-instance API was called.

Solution

I updated the app to use the instance’s public IP address instead of it’s public hostname. I’m in the process of contacting AWS support to figure out if this is expected behavior from register-instance or not.