Fix race condition in AWS CLI cache creation during parallel KubernetesPodOperator auth (#60943)#61935
Open
Vamsi-klu wants to merge 1 commit intoapache:mainfrom
Open
Conversation
…esPodOperator auth (apache#60943) Pre-create ~/.aws/cli/cache directory in KubernetesHook.get_conn() to prevent a FileExistsError race condition when multiple KPO tasks authenticate via `aws eks get-token` concurrently on the same Celery worker. Older botocore versions (<1.40.2) call os.makedirs() without exist_ok=True, causing intermittent task failures before pod creation. closes: apache#60943
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
~/.aws/cli/cachedirectory inKubernetesHook.get_conn()to prevent aFileExistsErrorrace condition when multiple KPO tasks authenticate viaaws eks get-tokenconcurrently on the same Celery workeros.makedirs()withoutexist_ok=True, causing intermittent task failures before pod creationRoot Cause
When parallel KubernetesPodOperator tasks invoke exec-based EKS authentication on the same worker, the AWS CLI races to create
~/.aws/cli/cache. The losing process getsFileExistsError(errno 17), which surfaces as a 403 Forbidden from the Kubernetes API — the task fails before the pod is even created.Fixed upstream in botocore 1.40.2, but this defensive fix protects users on older versions.
Why this approach (and not something else)
We considered several alternatives before landing on defensive directory pre-creation:
generic_api_retryTRANSIENT_STATUS_CODESwould mask real auth failures and add retry latency to every legitimate 403. Distinguishing transient exec-auth 403s from real permission denials is not reliably possible — the Kubernetes client's ExecProvider silently swallows the subprocess error and proceeds with a bad token, so the 403 looks identical to a genuine RBAC denial.threading.Lockaround config loadingaws eks get-token) runs lazily during the first API call, not duringconfig.load_kube_config(). A lock around config loading wouldn't prevent the race. Locking around every API call would serialize all K8s operations — unacceptable for performance.botocore >= 1.40.2as a dependencyWhy pre-creation wins:
os.makedirs(..., exist_ok=True)call — the exact same fix botocore 1.40.2 applied, just done earlier in the call chainexist_ok=Trueis inherently safe for concurrent invocations — no race between our pre-creation and the AWS CLIChanges
hooks/kubernetes.py: Added_ensure_exec_plugin_cache_dirs()function called fromget_conn()before any kube config loading. Usesos.makedirs(..., exist_ok=True)to pre-create the cache directory.test_kubernetes.py: 3 new test cases verifying directory creation, idempotency, and integration withget_conn().closes: #60943
Test plan