Incorrect usage of Helm hooks combined with a controller deadlock caused our Kubernetes secrets to vanish. The underlying issue had existed for a long time, but normal deployment behavior masked it.

The Infra Link to heading

We deployed our workloads to EKS using Argo CD and external-secrets as our secret fetcher.

How not to use Helm hooks Link to heading

Do you see anything wrong1 with the ExternalSecret below?

apiVersion: kubernetes-client.io/v1
kind: ExternalSecret
metadata:
  name: aws-secretsmanager
  annotations:
    "helm.sh/hook": pre-upgrade
spec:
  backendType: secretsManager
  roleArn: arn:aws:iam::123412341234:role/let-other-account-access-secrets
  region: us-east-1
  data:
    - key: demo-service/credentials
      name: password
      property: password

This table describes how Helm hooks are mapped to their Argo CD counterpart.

What happens to our secrets during a deployment

pre-sync hook phase

flowchart LR A[Argo CD] -->|Delete| B[ExternalSecret] B -->D D[Garbage-Collector] -->|Delete| C[Secret]

sync phase

flowchart LR A[Argo CD] -->|Create| B[ExternalSecret] B -->|Create| C[Secret]

This creates a brief window during every deployment where no Kubernetes Secret exists. Since the secret was normally recreated very quickly, we never noticed this issue.

ExternalSecrets deadlock Link to heading

We were in the process of migrating to external-secrets-operator, since external-secrets was no longer maintained. However, we had many secrets across many projects to migrate, so we still had a few legacy ExternalSecret resources scattered around.

During a deadlock, the controller would stop doing any work. As a result, newly created ExternalSecret objects were never reconciled into Kubernetes Secrets. We eventually added a liveness probe and a PromQL alert to detect the event.

sync phase deadlock

flowchart LR A[Argo CD] -->|Create| B[ExternalSecret] B x--xSecret

Since the controller was stuck, the underlying secret was never recreated. New pods would end up in a CreateContainerConfigError state due to missing secrets. While old pods would keep running until their node (usually a spot instance) was reclaimed.

Conclusion Link to heading

The deadlock revealed that several legacy projects were incorrectly using Helm hooks. We fixed this by updating them to use Argo CD sync waves. Remember: don’t use hooks on long-lived resources.


  1. Ignore the fact that this resource maps to an archived project↩︎