Skip to content

TempDirectoryManager race condition: cancelled Worker wipes _temp directory used by concurrently spawned new Worker #4357

@allanrogerr

Description

@allanrogerr

Describe the bug

When the runner receives a new job while a previous worker process is still running, it cancels the old worker and immediately spawns a new one. Both worker processes share the same _temp directory (orgs_<org>_work/_temp). The cancelled worker's TempDirectoryManager cleanup runs after the new worker has already created its _runner_file_commands pipes in that shared directory, deleting them out from under the active job. This causes the new job to fail with:

Missing file at path: .../_temp/_runner_file_commands/set_output_<uuid>

The root cause is that JobDispatcher spawns the new Worker process immediately upon receiving the new job request — it does not wait for the previous Worker process to fully exit and complete its TempDirectoryManager cleanup. This creates a window (17 seconds in our case) where two Worker PIDs are alive and operating on the same _temp directory.

To Reproduce

This is a race condition that requires two jobs to be dispatched to the same non-ephemeral self-hosted runner in quick succession. The exact sequence:

  1. Runner is executing Job A (a long-running job, e.g. integration tests)
  2. GitHub dispatches Job B to the same runner while Job A is still actively running and being renewed
  3. Runner acknowledges Job B, logs "We are not yet checking the state of jobrequest <Job A ID>... Cancel running worker right away."
  4. Runner sends cancellation to Job A's Worker and immediately spawns Job B's Worker — both PIDs are now alive
  5. Job B's Worker initializes, creates _runner_file_commands/set_output_<uuid> and step_summary_<uuid> files in the shared _temp directory
  6. Job B begins executing its first action step (e.g. actions/checkout@v6)
  7. Job A's Worker finishes its cancellation teardown and calls TempDirectoryManager: Cleaning runner temp folder: <shared _temp path> — this deletes the entire _temp directory contents, including Job B's active file command pipes
  8. Job B's action step fails because its set_output and step_summary files no longer exist
  9. Job B exits with code 102 (runner infrastructure failure)

In our case, the gap between Job B starting (11:55:19Z) and Job A's cleanup running (11:55:36Z) was 17 seconds — plenty of time for Job B to have created and started using the file command pipes.

Expected behavior

The runner should ensure the previous Worker process has fully exited (including TempDirectoryManager cleanup) before spawning a new Worker process that uses the same _temp directory. Alternatively, each Worker should use an isolated temp directory scoped to its job ID rather than sharing a single _temp path.

Runner Version and Platform

  • Runner version: 2.333.1 (latest as of 2026-04-20)
  • OS: Ubuntu 22.04 LTS (running as an LXC VM on a self-hosted node)
  • Architecture: x86_64
  • Runner mode: Non-ephemeral, organization-level self-hosted runner

What's not working?

When two Worker processes overlap on the same runner, the exiting Worker's TempDirectoryManager cleanup deletes the _runner_file_commands directory that the new Worker is actively using, causing the new job to fail with:

Error: Missing file at path: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp/_runner_file_commands/set_output_b0988204-d5c8-4571-861f-7028374312ee

The new job (Job B) exits with code 102. The previous job (Job A) also fails to report completion, receiving HTTP 404 / TaskOrchestrationJobNotFoundException from the run service.

Job Log Output

Job B (the victim job) log output during the checkout step:

2026-04-20T11:55:22.8993364Z ##[group]Run actions/checkout@v6
2026-04-20T11:55:23.0693218Z Syncing repository: miniohq/eos
2026-04-20T11:55:23.0696859Z ##[group]Getting Git version info
2026-04-20T11:55:23.0698918Z Working directory is '/home/ubuntu/actions-runner/orgs_miniohq_work/eos/eos'
2026-04-20T11:55:23.0701302Z [command]/usr/bin/git version
2026-04-20T11:55:23.0702235Z git version 2.43.0
...
(checkout proceeds normally for ~22 seconds, then fails)
...
Error: Missing file at path: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp/_runner_file_commands/set_output_b0988204-d5c8-4571-861f-7028374312ee

Runner and Worker's Diagnostic Logs

Runner Log — Job dispatch overlap (Runner_20260410-225245-utc.log)

Shows Job A (42b445e6) actively renewing, then Job B (b665e3b4) arriving and the runner immediately spawning a new Worker without waiting for the old one to exit:

[2026-04-20 11:55:17Z INFO JobDispatcher] Successfully renew job 42b445e6-82dd-5f8b-a498-a9859d5322d2, job is valid till 4/20/2026 12:04:36 PM
[2026-04-20 11:55:17Z INFO BrokerMessageListener] Acknowledging runner request 'b665e3b4-2377-5230-9563-f043505754b8'.
[2026-04-20 11:55:19Z INFO JobDispatcher] Job request 0 for plan 93b2502b-9a4c-460a-8f49-4ae31685f3a7 job b665e3b4-2377-5230-9563-f043505754b8 received.
[2026-04-20 11:55:19Z ERR  JobDispatcher] We are not yet checking the state of jobrequest 42b445e6-82dd-5f8b-a498-a9859d5322d2 status. Cancel running worker right away.
[2026-04-20 11:55:19Z INFO JobDispatcher] Send job cancellation message to worker for job 42b445e6-82dd-5f8b-a498-a9859d5322d2.
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper] Starting process:
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper]   File name: '/home/ubuntu/actions-runner/bin.2.333.1/Runner.Worker'
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper]   Arguments: 'spawnclient 160 164'
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper] Process started with process id 1449281, waiting for process exit.
[2026-04-20 11:55:19Z INFO JobDispatcher] Send job request message to worker for job b665e3b4-2377-5230-9563-f043505754b8.

At this point, PID 1431550 (Job A) and PID 1449281 (Job B) are both running simultaneously.

Worker Log — Job A's cleanup wipes shared _temp (Worker_20260420-114836-utc.log)

Job A receives cancellation, tears down, then runs TempDirectoryManager at 11:55:36Z — 17 seconds after Job B's Worker started:

[2026-04-20 11:55:19Z INFO Worker] Cancellation/Shutdown message received.
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper] Waiting for process exit or 7.5 seconds after SIGINT signal fired.
[2026-04-20 11:55:26Z INFO ProcessInvokerWrapper] Waiting for process exit or 2.5 seconds after SIGTERM signal fired.
[2026-04-20 11:55:31Z INFO ProcessInvokerWrapper] Process Cancellation finished.
[2026-04-20 11:55:36Z INFO TempDirectoryManager] Cleaning runner temp folder: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp
[2026-04-20 11:55:36Z INFO JobRunner] Raising job completed against run service
[2026-04-20 11:55:36Z ERR  GitHubActionsService] POST request to https://run-actions-1-azure-eastus.actions.githubusercontent.com/176/completejob failed. HTTP Status: NotFound
[2026-04-20 11:55:36Z ERR  JobRunner] GitHub.DistributedTask.WebApi.TaskOrchestrationJobNotFoundException: Job not found: 42b445e6-82dd-5f8b-a498-a9859d5322d2. workflow instance not found

Worker Log — Job B fails because its files were deleted (Worker_20260420-115519-utc.log)

Job B initialized _temp at 11:55:20Z, started checkout at 11:55:22Z, but its file command pipes were wiped at 11:55:36Z by Job A's cleanup:

[2026-04-20 11:55:20Z INFO HostContext] Well known directory 'Temp': '/home/ubuntu/actions-runner/orgs_miniohq_work/_temp'
[2026-04-20 11:55:22Z INFO ProcessInvokerWrapper] Starting process:
[2026-04-20 11:55:22Z INFO ProcessInvokerWrapper]   File name: '/home/ubuntu/actions-runner/externals/node24/bin/node'
[2026-04-20 11:55:22Z INFO ProcessInvokerWrapper]   Arguments: '"/home/ubuntu/actions-runner/orgs_miniohq_work/_actions/actions/checkout/v6/dist/index.js"'
[2026-04-20 11:55:22Z INFO ProcessInvokerWrapper] Process started with process id 1449368, waiting for process exit.
[2026-04-20 11:55:45Z INFO ProcessInvokerWrapper] Finished process 1449368 with exit code 1, and elapsed time 00:00:22.9692284.
[2026-04-20 11:55:45Z INFO CreateStepSummaryCommand] Step Summary file (/home/ubuntu/actions-runner/orgs_miniohq_work/_temp/_runner_file_commands/step_summary_b0988204-d5c8-4571-861f-7028374312ee) does not exist; skipping attachment upload
[2026-04-20 11:55:45Z INFO ExecutionContext] errorMessages: ["Missing file at path: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp/_runner_file_commands/set_output_b0988204-d5c8-4571-861f-7028374312ee"]
[2026-04-20 11:55:47Z INFO JobRunner] Job result after all job steps finish: Failed
[2026-04-20 11:55:49Z INFO TempDirectoryManager] Cleaning runner temp folder: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp
[2026-04-20 11:55:49Z INFO Worker] Job completed.

Runner reports: Worker finished for job b665e3b4... Code: 102

Timeline Summary

Time (UTC) Event
11:48:36 Job A (PID 1431550) starts — run-tables-tests (spark)
11:55:17 Job A still renewing successfully (valid till 12:04:36)
11:55:17 Job B acknowledged by runner while Job A is active
11:55:19 Runner: "Cancel running worker right away" — sends cancel to Job A
11:55:19 Job B (PID 1449281) spawned immediately — two PIDs now alive
11:55:20 Job B initializes, uses shared _temp directory
11:55:22 Job B creates set_output_b0988204... and starts checkout
11:55:36 Job A runs TempDirectoryManager — wipes shared _temp including Job B's files
11:55:45 Job B checkout fails: Missing file at path: .../set_output_b0988204...
11:55:49 Job B exits code 102 (Failed)

Suggested Fix

Either:

  1. JobDispatcher should await the previous Worker process exit before spawning the new Worker, OR
  2. Each Worker should use a job-scoped temp directory (e.g. _temp/<job-id>/) instead of sharing a single _temp path, OR
  3. TempDirectoryManager should check whether another Worker is active before cleaning _temp

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions