Mysterious python ImportError in Kubernetes workload
Simple Python errors can be harder to debug when buried beneath many layers of abstraction and tooling.
This is a story about how I ran into a head-scratching deployment bug, related to a Python import error. Troubleshooting turned out to be more challenging than expected even though I consider myself quite proficient in Python.
I created a katacoda style scenario where you can solve this problem on your own. The source repo can be found here.
Table of Contents
Setting the scene
I was deploying a Python docker image to a GKE cluster. The application code looked something like this:
from fastapi import FastAPI
app = FastAPI()
@app.get("/")
async def root():
return {"message": "Hello World"}
The Dockerfile:
FROM python:3.10
WORKDIR /app
RUN useradd --create-home appuser
USER appuser
COPY requirements.txt ./
RUN pip install -r requirements.txt
COPY . .
And the Kubernetes manifest looked something like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: ghcr.io/devonhk/missing-python-import:fastapi
command:
- python
- -m
- uvicorn
- main:app
- --reload
After my CD pipeline applied the manifest, I noticed a Python import error in the logs.
$ kubectl logs -f deployment/app
/usr/local/bin/python: No module named uvicorn
My first thought was to run the container locally to reproduce the error. It should be straightforward as I’m using the same docker image as my deployment environment, right?
docker run --rm -it ghcr.io/devonhk/missing-python-import:fastapi python -m uvicorn main:app --reload
INFO: Will watch for changes in these directories: ['/app']
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO: Started reloader process [1] using WatchFiles
INFO: Started server process [8]
INFO: Waiting for application startup.
INFO: Application startup complete.
Hmm, no error? That doesn’t make sense as I should be using an identical copy of the image used in Kubernetes.
Let’s try to explore a bit more.
docker run --rm -it ghcr.io/devonhk/missing-python-import:fastapi
Python 3.10.6 (main, Aug 3 2022, 10:24:07) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import uvicorn
>>>
Still no error as I can clearly import the module locally…
Bewildered, I compare the sha256 checksums of my local image with the deployment image from the CI logs to find a mismatch.
Everything checks out. The images are identical.
Since I can’t repro the error locally, I need a way to debug my workload running in K8s…
I go for my trusty poor man’s breakpoint.
I edit the podspec with a tail -f /dev/null
which acts as a breakpoint so that I can get a shell into the container.
Here’s the patch file.
spec:
template:
spec:
containers:
- name: app
image: ghcr.io/devonhk/missing-python-import:fastapi
command:
- tail
- -f
- /dev/null
Apply the patch.
kubectl patch deployment app --patch-file patch.yaml
After applying the patch I exec into the container.
kubectl exec -it deployment/app -c app bash
I have no name!@app-74dfbdfbcc-czjdr:/app$
Well, that’s clearly the wrong user.
Let’s try importing the module again.
I have no name!@app-74dfbdfbcc-czjdr:/app$ python
Python 3.10.6 (main, Aug 3 2022, 10:24:07) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import uvicorn
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'uvicorn'
Now we’re getting somewhere.
When debugging an ImportError
it’s always a good idea to look at your sys.path
>>> from pprint import pprint; import sys
>>> pprint(sys.path)
['',
'/usr/local/lib/python310.zip',
'/usr/local/lib/python3.10',
'/usr/local/lib/python3.10/lib-dynload',
'/usr/local/lib/python3.10/site-packages']
>>>
Now let’s compare it with my local container:
docker run --rm -it ghcr.io/devonhk/missing-python-import:fastapi python -c 'import sys, pprint; pprint.pprint(sys.path)'
['',
'/usr/local/lib/python310.zip',
'/usr/local/lib/python3.10',
'/usr/local/lib/python3.10/lib-dynload',
'/home/appuser/.local/lib/python3.10/site-packages',
'/usr/local/lib/python3.10/site-packages']
The path to the user I defined in my Dockerfile is missing, but why? '/home/appuser/.local/lib/python3.10/site-packages'
Upon closer inspection of my deployment manifest, I found something interesting.
kubectl get -o yaml deployment/app
...
securityContext:
allowPrivilegeEscalation: false
privileged: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1001
...
The securityContext appeared to be overriding my user’s ID.
This explained the I have no name!
error as no user was mapped to ID 1001
. We can also see that appuser
was mapped to user 1000.
I have no name!@app-74dfbdfbcc-czjdr:/app$ grep 1001 /etc/passwd
I have no name!@app-74dfbdfbcc-czjdr:/app$ grep 1000 /etc/passwd
appuser:x:1000:1000::/home/appuser:/bin/sh
I have no name!@app-74dfbdfbcc-czjdr:/app$
This securityContext wasn’t part of my manifest in version control. So the only way it could have ended up there is if something or someone mutated the workload.
I searched for mutating webhooks and discovered the cluster had one to enforce certain security rules for new app workloads.
The workaround, for now, was to have the webhook use runAsUser: 1000
. By default, if you create a new user in a docker image, it should
be id 1000.
Why wasn’t this error detected earlier?
The majority of our workloads at the time were Go programs. Meaning they generally aren’t importing libraries at runtime. So this bug went under the radar for quite some time.
Conclusion
This error is interesting as it’s something that could have only really happened in dynamically typed languages like Python. This kind of error couldn’t occur in Golang for example, as the imports are resolved at build type rather than runtime.
Takeaways
Don’t blindly apply securityContexts because a blog post recommended them as best practice. Take the time to understand what they’re actually doing. This all ties into Linux fundamentals which is a skill that is constantly neglected in the cloud native/devops landscape.