ALB ECS service healthcheck failure

I ran into a issue recently where my AWS ALB health checks kept repeatedly failing on an ECS service.

The problem

I had a nodejs express app running as an ECS service. It appeared to be running fine, but after a couple minutes of being up the ALB would kill it because of repeated health check failures. When the task would eventually be stopped it always displayed the same failure reason.

Task failed ELB health checks in (target-group arn:aws:elasticloadbalancing:us-east-1:746600057361:targetgroup-api-alb-target-group/4b2972d507aa4edb)

I assumed this meant the task’s health check was returning a non 200 response, but in the end that was not the case.

Simplified Service architecture

diagram

Investigation

I looked at the ALB metrics for clues, but I couldn’t see any errors that would indicate a failed health check (i.e 5XX/4XX responses). Eventually I noticed there weren’t any 2XX reponses in the metrics either. That’s when I decided to check the container instance’s security groups (EC2 instance running the ECS service) and noticed that I hadn’t created any rules to allow the ALB to connect to the instance.

I assumed that the ALB health check would be smart enough to tell me that it was failing because it couldn’t even establish a connection to the target host. Apparently I was wrong in assuming that it would catch simple configuration mistakes.