Solving Mysterious 524 Timeouts: Istio and AWS Firewalls

A few months ago, we started getting 524s¹ occasionally from our origin. Not enough to be a major concern, but nonetheless we had no explanation for what caused them, which piqued our curiosity.

High-level architecture diagram

flowchart TD A[Client] -->|HTTP Request| B(Cloudflare Reverse Proxy) B --> C[AWS NLB] C --> D[Istio Ingress Gateway] D --> E[Service]

What we understood was happening.

sequenceDiagram Client->>+CloudflareReverseProxy: HTTP Request CloudflareReverseProxy->>+IstioIngressGateway: HTTP Request IstioIngressGateway->>+CloudflareReverseProxy: Origin Timeout (524)

What we figured out after digging

We spoke to AWS support, and they looked at our VPC flow logs. They informed us that there were rejected TCP connections between the nodes used for our Istio IngressGateways and our application servers.

With that information we looked at more traces, which indicated that requests made their way to the application pods. After a bit of searching through our traces we had a better understanding of the problem.

sequenceDiagram Client->>+CloudflareReverseProxy: HTTP Request box Purple Ingress Nodes participant IstioIngressGateway end CloudflareReverseProxy->>+IstioIngressGateway: HTTP Request IstioIngressGateway->>+ApplicationServer (Pods): HTTP Request box Blue Worker Nodes participant ApplicationServer (Pods) end ApplicationServer (Pods)->>+IstioIngressGateway: Failed to connect IstioIngressGateway->>+CloudflareReverseProxy: Origin Timeout (524)

AWS Stateful Firewall

Here’s a quick overview of how connection tracking works with EC2 security groups.

The Worker nodes security group allowed connections from the Ingress nodes, but not vice versa. When you send a request from an instance, the response traffic for that request is always allowed. Therefore, responses from the Worker Nodes should be able to flow back to the Ingress nodes.

However, this connection tracking state doesn’t last forever. The TCP timeout for connection tracking defaults to 5 days² for most EC2 instance types.

Finding the culprit

If you’re thinking that exceeding the 5-day connection tracking timeout is highly unlikely, then you’re right!

We discovered that this issue was only happening on a specific EC2 instance type, the C8gn.
This new instance had a 5-minute timeout³ on its stateful firewall. This meant that after 5 minutes of idling, Istio’s reused TCP connections would no longer be tracked in the Worker node’s stateful firewall, causing connection failures. Istio probably wasn’t sending keepalives very often, which would have kept the connections open in the firewall.

Why did we start seeing this issue so often out of the blue?

Because AWS had just released this new instance type. And our Karpenter config was configured to use all instances in the compute optimized family. So it happily started requesting these new node types.

As my co-worker suggested, this issue is probably hitting other AWS customers. Why AWS decided to switch from a 5-day to a 5-minute timeout without informing customers is another mystery.

Our origins weren’t actually returning 524 HTTP responses. It just means that the Cloudflare proxy timed out waiting for an HTTP response. ↩︎
This was confirmed by AWS support. It’s not exactly clear from the documentation. ↩︎
AWS support told us that this detail wasn’t included in any public documentation. ↩︎