Skip to content

BuildKit client hangs after successful image push (v0.23.2) #6131

@wujunwei

Description

@wujunwei

Contributing guidelines and issue reporting guide

Well-formed report checklist

  • I have found a bug that the documentation does not mention anything about my problem
  • I have found a bug that there are no open or closed issues that are related to my problem
  • I have provided version/information about my environment and done my best to provide a reproducer

Description of bug

BuildKit client hangs after successful image push (v0.23.2)

Summary

BuildKit client (buildctl) hangs indefinitely after successfully pushing an image to a registry, even though the push operation completes successfully (receives HTTP 201 Created response). The client process never exits and must be forcefully terminated.

Environment

  • BuildKit Version: v0.23.2 (official Docker Hub image)
  • BuildKit Image: moby/buildkit:v0.23.2 from Docker Hub
  • Platform: Linux (Kubernetes environment)
  • Container Runtime: containerd
  • Registry: Harbor (private registry)
  • Worker: containerd worker
  • Network: Host networking mode
  • Deployment: Kubernetes using official Docker Hub image

Steps to Reproduce

  1. Execute a build with push to registry:
buildctl --addr tcp://buildkitd-service:12346 build \
    --frontend dockerfile.v0 \
    --no-cache \
    --output type=image,name=registry.example.com/repo/image:tag,push=true \
    --opt platform=linux/amd64 \
    --opt force-network-mode=host \
    --allow network.host \
    --local context=. \
    --local dockerfile=. \
    --progress plain
  1. Observe that:
    • Build completes successfully
    • Image layers are pushed successfully
    • Manifest is pushed successfully (HTTP 201 Created)
    • Image appears in containerd: ctr -n buildkit images ls
    • Image appears in registry
    • But buildctl process never exits

Expected Behavior

The buildctl process should exit normally after the push operation completes successfully.

Actual Behavior

The buildctl process hangs indefinitely and must be killed with SIGKILL. The process appears to be waiting for some internal operation to complete.

Debugging Information

Client Process Analysis

Using strace on the hanging client process shows:

[pid 20119] futex(0x198e420, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>

Multiple threads are stuck in FUTEX_WAIT_PRIVATE, suggesting the main thread is waiting in errgroup.Wait().

Daemon Process Analysis

The BuildKit daemon (official v0.23.2 Docker Hub image) shows:

[pid 29510] futex(0x2f4d920, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>

Network connections remain established:

tcp6  0  0  daemon:12346  client:57532  ESTABLISHED

But no data is being transmitted (Recv-Q and Send-Q are both 0).

Log Analysis - Key Finding

Client logs show the push completing successfully:

11:48:51  #17 pushing manifest for harbor.XXX.io/test/XXXXX:main-271313406-18@sha256:1444138e6cb07041a256d102a9ec7f27628f2e4afd691e24340c94a03d68ff6e
11:48:52  #17 pushing manifest for harbor.XXX.io/test/XXXXX:main-271313406-18@sha256:1444138e6cb07041a256d102a9ec7f27628f2e4afd691e24340c94a03d68ff6e 0.4s done
11:48:52  #17 DONE 1.3s

Daemon logs (from official Docker Hub image) stop abruptly after the successful push:

time="2025-08-11T11:16:43Z" level=debug msg="fetch response received" 
response.status="201 Created" span="exporting to image"

🔍 Critical Observation: The buildkitd logs show no corresponding "session finished" message after the push completes. In normal operations, we would expect to see:

time="2025-08-11T11:16:43Z" level=debug msg="session finished: <nil>" spanID=xxx traceID=xxx

This missing "session finished" log entry strongly suggests that the session cleanup process is stuck or never initiated, which explains why the client remains hanging.

Session State Analysis

The absence of the "session finished" log indicates:

  1. Session cleanup never started - The session termination process was never triggered
  2. Session cleanup stuck - The cleanup process started but got stuck somewhere
  3. Session reference leak - The session object is still being referenced, preventing cleanup

This correlates with the resource state showing Usage count: 0 but Reclaimable: false.

Resource State

Using buildctl du shows:

ID:             1jjc57vswpm46fex2yvsj97jt
Created at:     2025-08-11 11:16:22.613823175 +0000 UTC
Mutable:        false
Reclaimable:    false
Shared:         false
Size:           0B
Description:    local source for context
Usage count:    0

The resource has Usage count: 0 but Reclaimable: false, indicating a resource leak that correlates with the missing session cleanup.

Additional Context

  • Using the official open-source Docker Hub image moby/buildkit:v0.23.2
  • Key diagnostic: Missing "session finished" log entries in daemon logs
  • Issue occurs in Kubernetes environment with official image
  • The problem appears to be in the session cleanup mechanism after successful operations
  • Session state becomes inconsistent (usage count 0 but not reclaimable)
  • The problem is reproducible but intermittent

Note: Happy to provide additional debugging information if needed.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions