Fix race condition in AWS CLI cache creation during parallel KubernetesPodOperator auth (#60943) by Vamsi-klu · Pull Request #61935 · apache/airflow

Vamsi-klu · 2026-02-15T04:01:08Z

Summary

Pre-create ~/.aws/cli/cache directory in KubernetesHook.get_conn() to prevent a FileExistsError race condition when multiple KPO tasks authenticate via aws eks get-token concurrently on the same Celery worker
Older botocore versions (<1.40.2) call os.makedirs() without exist_ok=True, causing intermittent task failures before pod creation

Root Cause

When parallel KubernetesPodOperator tasks invoke exec-based EKS authentication on the same worker, the AWS CLI races to create ~/.aws/cli/cache. The losing process gets FileExistsError (errno 17), which surfaces as a 403 Forbidden from the Kubernetes API — the task fails before the pod is even created.

Fixed upstream in botocore 1.40.2, but this defensive fix protects users on older versions.

Why this approach (and not something else)

We considered several alternatives before landing on defensive directory pre-creation:

Approach	Why we rejected it
Retry on 403 in `generic_api_retry`	403 is normally a permanent permissions error. Adding it to `TRANSIENT_STATUS_CODES` would mask real auth failures and add retry latency to every legitimate 403. Distinguishing transient exec-auth 403s from real permission denials is not reliably possible — the Kubernetes client's ExecProvider silently swallows the subprocess error and proceeds with a bad token, so the 403 looks identical to a genuine RBAC denial.
`threading.Lock` around config loading	The exec plugin (`aws eks get-token`) runs lazily during the first API call, not during `config.load_kube_config()`. A lock around config loading wouldn't prevent the race. Locking around every API call would serialize all K8s operations — unacceptable for performance.
Parse kubeconfig to detect exec-based auth	Over-engineered for a one-line fix. Would add complexity, fragile YAML parsing, and still need per-tool knowledge of which cache dirs to create.
Pin `botocore >= 1.40.2` as a dependency	The Kubernetes provider has no direct dependency on botocore and shouldn't. AWS is just one of many possible exec-based auth backends.
Documentation-only (recommend botocore upgrade)	Doesn't help users who can't control their botocore version (e.g., managed Airflow platforms like Astronomer).

Why pre-creation wins:

It's a single os.makedirs(..., exist_ok=True) call — the exact same fix botocore 1.40.2 applied, just done earlier in the call chain
exist_ok=True is inherently safe for concurrent invocations — no race between our pre-creation and the AWS CLI
Zero performance overhead (one syscall, idempotent)
Zero risk of masking real errors — we don't change retry behavior or error handling
Protects all users regardless of their botocore version

Changes

hooks/kubernetes.py: Added _ensure_exec_plugin_cache_dirs() function called from get_conn() before any kube config loading. Uses os.makedirs(..., exist_ok=True) to pre-create the cache directory.
test_kubernetes.py: 3 new test cases verifying directory creation, idempotency, and integration with get_conn().

closes: #60943

Test plan

New unit tests verify directory creation, idempotency, and integration
Manual: Run parallel KPO tasks on same Celery worker with EKS auth and botocore < 1.40.2

Note to users: Upgrading to botocore >= 1.40.2 also resolves this at the source. This fix provides a safety net for environments that cannot upgrade immediately.

…esPodOperator auth (apache#60943) Pre-create ~/.aws/cli/cache directory in KubernetesHook.get_conn() to prevent a FileExistsError race condition when multiple KPO tasks authenticate via `aws eks get-token` concurrently on the same Celery worker. Older botocore versions (<1.40.2) call os.makedirs() without exist_ok=True, causing intermittent task failures before pod creation. closes: apache#60943

Vamsi-klu requested review from hussein-awala, jedcunningham and jscheffl as code owners February 15, 2026 04:01

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Feb 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in AWS CLI cache creation during parallel KubernetesPodOperator auth (#60943)#61935

Fix race condition in AWS CLI cache creation during parallel KubernetesPodOperator auth (#60943)#61935
Vamsi-klu wants to merge 1 commit intoapache:mainfrom
Vamsi-klu:fix/aws-cli-cache-race-condition-60943

Vamsi-klu commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Vamsi-klu commented Feb 15, 2026

Summary

Root Cause

Why this approach (and not something else)

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant