Incorrect usage of Helm hooks combined with a controller deadlock caused our Kubernetes secrets to vanish. The underlying issue had existed for a long time, but normal deployment behavior masked it.
The Infra Link to heading
We deployed our workloads to EKS using Argo CD and external-secrets as our secret fetcher.
How not to use Helm hooks Link to heading
Do you see anything wrong1 with the ExternalSecret below?
apiVersion: kubernetes-client.io/v1
kind: ExternalSecret
metadata:
name: aws-secretsmanager
annotations:
"helm.sh/hook": pre-upgrade
spec:
backendType: secretsManager
roleArn: arn:aws:iam::123412341234:role/let-other-account-access-secrets
region: us-east-1
data:
- key: demo-service/credentials
name: password
property: password
This table describes how Helm hooks are mapped to their Argo CD counterpart.
What happens to our secrets during a deployment
pre-sync hook phase
sync phase
This creates a brief window during every deployment where no Kubernetes Secret exists. Since the secret was normally recreated very quickly, we never noticed this issue.
ExternalSecrets deadlock Link to heading
We were in the process of migrating to external-secrets-operator, since external-secrets was
no longer maintained. However, we had many secrets across many projects to migrate, so we still had a few
legacy ExternalSecret resources scattered around.
During a deadlock, the controller would stop doing any work.
As a result, newly created ExternalSecret objects were never reconciled into Kubernetes Secrets.
We eventually added a liveness probe and a PromQL alert to detect the event.
sync phase deadlock
Since the controller was stuck, the underlying secret was never recreated.
New pods would end up in a CreateContainerConfigError state due to missing secrets.
While old pods would keep running until their node (usually a spot instance) was reclaimed.
Conclusion Link to heading
The deadlock revealed that several legacy projects were incorrectly using Helm hooks. We fixed this by updating them to use Argo CD sync waves. Remember: don’t use hooks on long-lived resources.