Mysterious python ImportError in Kubernetes workload

August 27, 2022 5 minutes read

python • kubernetes • security

Simple Python errors can be harder to debug when buried beneath many layers of abstraction and tooling.

This is a story about how I ran into a head-scratching deployment bug, related to a Python import error. Troubleshooting turned out to be more challenging than expected even though I consider myself quite proficient in Python.

I created a katacoda style scenario where you can solve this problem on your own. The source repo can be found here.

Table of Contents

Setting the scene

I was deploying a Python docker image to a GKE cluster. The application code looked something like this:

Note this is a simplified version of the code to reproduce the error.

from fastapi import FastAPI

app = FastAPI()

@app.get("/")
async def root():
    return {"message": "Hello World"}

The Dockerfile:

FROM python:3.10

WORKDIR /app
RUN useradd --create-home appuser
USER appuser

COPY requirements.txt ./

RUN pip install -r requirements.txt

COPY . .

And the Kubernetes manifest looked something like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: app
          image: ghcr.io/devonhk/missing-python-import:fastapi
          command:
            - python
            - -m
            - uvicorn
            - main:app
            - --reload

After my CD pipeline applied the manifest, I noticed a Python import error in the logs.

$ kubectl logs -f deployment/app
/usr/local/bin/python: No module named uvicorn

My first thought was to run the container locally to reproduce the error. It should be straightforward as I’m using the same docker image as my deployment environment, right?

docker run --rm -it ghcr.io/devonhk/missing-python-import:fastapi python -m uvicorn main:app --reload
INFO:     Will watch for changes in these directories: ['/app']
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO:     Started reloader process [1] using WatchFiles
INFO:     Started server process [8]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Hmm, no error? That doesn’t make sense as I should be using an identical copy of the image used in Kubernetes.

Let’s try to explore a bit more.

docker run --rm -it ghcr.io/devonhk/missing-python-import:fastapi 
Python 3.10.6 (main, Aug  3 2022, 10:24:07) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import uvicorn
>>>

Still no error as I can clearly import the module locally…

Bewildered, I compare the sha256 checksums of my local image with the deployment image from the CI logs to find a mismatch.

Everything checks out. The images are identical.

Since I can’t repro the error locally, I need a way to debug my workload running in K8s…

I go for my trusty poor man’s breakpoint.

I edit the podspec with a tail -f /dev/null which acts as a breakpoint so that I can get a shell into the container.

Here’s the patch file.

spec:
  template:
    spec:
      containers:
        - name: app
          image: ghcr.io/devonhk/missing-python-import:fastapi
          command:
            - tail
            - -f
            - /dev/null

Apply the patch.

kubectl patch deployment app --patch-file patch.yaml

After applying the patch I exec into the container.

kubectl exec -it deployment/app -c app bash
I have no name!@app-74dfbdfbcc-czjdr:/app$

Well, that’s clearly the wrong user.

Let’s try importing the module again.

I have no name!@app-74dfbdfbcc-czjdr:/app$ python
Python 3.10.6 (main, Aug  3 2022, 10:24:07) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import uvicorn
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'uvicorn'

Now we’re getting somewhere.

When debugging an ImportError it’s always a good idea to look at your sys.path

>>> from pprint import pprint; import sys
>>> pprint(sys.path)
['',
 '/usr/local/lib/python310.zip',
 '/usr/local/lib/python3.10',
 '/usr/local/lib/python3.10/lib-dynload',
 '/usr/local/lib/python3.10/site-packages']
>>>

Now let’s compare it with my local container:

 docker run --rm -it ghcr.io/devonhk/missing-python-import:fastapi python -c 'import sys, pprint; pprint.pprint(sys.path)'
['',
 '/usr/local/lib/python310.zip',
 '/usr/local/lib/python3.10',
 '/usr/local/lib/python3.10/lib-dynload',
 '/home/appuser/.local/lib/python3.10/site-packages',
 '/usr/local/lib/python3.10/site-packages']

The path to the user I defined in my Dockerfile is missing, but why? '/home/appuser/.local/lib/python3.10/site-packages'

Upon closer inspection of my deployment manifest, I found something interesting.

kubectl get -o yaml deployment/app
...
securityContext:
    allowPrivilegeEscalation: false
    privileged: false
    readOnlyRootFilesystem: true
    runAsNonRoot: true
    runAsUser: 1001
...

The securityContext appeared to be overriding my user’s ID. This explained the I have no name! error as no user was mapped to ID 1001. We can also see that appuser was mapped to user 1000.

I have no name!@app-74dfbdfbcc-czjdr:/app$ grep 1001 /etc/passwd
I have no name!@app-74dfbdfbcc-czjdr:/app$ grep 1000 /etc/passwd
appuser:x:1000:1000::/home/appuser:/bin/sh
I have no name!@app-74dfbdfbcc-czjdr:/app$

But wait... The securityContext wasn't part of the original deployment manifest. So where did it come from?

This securityContext wasn’t part of my manifest in version control. So the only way it could have ended up there is if something or someone mutated the workload.

I searched for mutating webhooks and discovered the cluster had one to enforce certain security rules for new app workloads.

The workaround, for now, was to have the webhook use runAsUser: 1000. By default, if you create a new user in a docker image, it should be id 1000.

Why wasn’t this error detected earlier?

The majority of our workloads at the time were Go programs. Meaning they generally aren’t importing libraries at runtime. So this bug went under the radar for quite some time.

Conclusion

This error is interesting as it’s something that could have only really happened in dynamically typed languages like Python. This kind of error couldn’t occur in Golang for example, as the imports are resolved at build type rather than runtime.

Takeaways

Don’t blindly apply securityContexts because a blog post recommended them as best practice. Take the time to understand what they’re actually doing. This all ties into Linux fundamentals which is a skill that is constantly neglected in the cloud native/devops landscape.