Timeouts if chat completion takes longer than 1 minutes #232

muzzah · 2025-08-06T21:51:09Z

muzzah
Aug 6, 2025

Describe the bug
I have some models running behind llama swap and if the generation of the first token takes longer than a minute I get an initial 504 timeout error and later start receiving the response.
Ive confirmed my proxy infront of llama swap has a long timeout configured.

Expected behaviour
A timeout should not occur for long generation times.

Operating system and version

Asahi linux
Mac Os Studio M1

# llama-swap YAML configuration example
# -------------------------------------
#
# - Below are all the available configuration options for llama-swap.
# - Settings with a default value, or noted as optional can be omitted.
# - Settings that are marked required must be in your configuration file

# healthCheckTimeout: number of seconds to wait for a model to be ready to serve requests
# - optional, default: 120
# - minimum value is 15 seconds, anything less will be set to this value
healthCheckTimeout: 30

# logLevel: sets the logging value
# - optional, default: info
# - Valid log levels: debug, info, warn, error
logLevel: debug

# startPort: sets the starting port number for the automatic ${PORT} macro.
# - optional, default: 5800
# - the ${PORT} macro can be used in model.cmd and model.proxy settings
# - it is automatically incremented for every model that uses it
startPort: 10001

# macros: sets a dictionary of string:string pairs
# - optional, default: empty dictionary
# - these are reusable snippets
# - used in a model's cmd, cmdStop, proxy and checkEndpoint
# - useful for reducing common configuration settings
macros:
  "base-cmd": /opt/bin/model -p ${PORT} -a

# models: a dictionary of model configurations
# - required
# - each key is the model's ID, used in API requests
# - model settings have default values that are used if they are not defined here
# - below are examples of the various settings a model can have:
# - available model settings: env, cmd, cmdStop, proxy, aliases, checkEndpoint, ttl, unlisted
models:
  "qwen3-14b":
    unlisted: false
    cmd: ${base-cmd} -m qwen3:14b -n qw3_14b
    cmdStop: /opt/bin/cstop qw3_14b

  "gm3-12b":
    unlisted: false
    cmd: ${base-cmd} -m gemma3:12b -n gm3-12b
    cmdStop: /opt/bin/cstop gm3-12b

  "gm3-4b":
    unlisted: false
    cmd: ${base-cmd} -m gemma3:4b -n gm3-4b
    cmdStop: /opt/bin/cstop gm3-4b

  "dsr1-14b":
    unlisted: false
    cmd: ${base-cmd} -m deepseek-r1:14b -n dsr1-14b
    cmdStop: /opt/bin/cstop dsr1-14b

  "dscv2-16b":
    unlisted: false
    cmd: ${base-cmd} -m deepseek-coder-v2:16b -c 128000 -n dscv2-16b
    cmdStop: /opt/bin/cstop dscv2-16b

  "gptoss-20b":
    unlisted: false
    cmd: ${base-cmd} -m gpt-oss:20b -n gptoss-20b
    cmdStop: /opt/bin/cstop gptoss-20b

"/opt/docker/llama-swap-config.yml" 129L, 4483B                                                                                                                                                                  61,55         Top

Proxy Logs
Example log from my nginx proxy

{"msec": "1754515326.846", "connection": "28926", "connection_requests": "205", "pid": "32", "request_id": "373e1a1a81415421f49690c0a0e26155", "request_length": "16517", "remote_addr": "192.168.2.11", "remote_user": "", "remote_port": "64159", "time_local": "06/Aug/2025:21:22:06 +0000", "time_iso8601": "2025-08-06T21:22:06+00:00", "request": "POST /api/chat/completions HTTP/2.0", "request_uri": "/api/chat/completions", "args": "", "status": "504", "body_bytes_sent": "569", "bytes_sent": "673", "http_referer": "https://llmchat.internal.lol/", "http_user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36", "http_x_forwarded_for": "", "http_host": "llmchat.internal.lol", "server_name": "llmchat.internal.lol", "request_time": "60.006", "upstream": "172.18.0.9:8080", "upstream_connect_time": "0.001", "upstream_header_time": "-", "upstream_response_time": "60.006", "upstream_response_length": "0", "upstream_cache_status": "", "ssl_protocol": "TLSv1.3", "ssl_cipher": "TLS_AES_128_GCM_SHA256", "scheme": "https", "request_method": "POST", "server_protocol": "HTTP/2.0", "pipe": ".", "gzip_ratio": "", "http_cf_ray": ""}

Notice "request_time": "60.006"

mostlygeek · 2025-08-07T03:22:47Z

mostlygeek
Aug 7, 2025
Maintainer

Share the proxy/upstream logs from llama-swap as well. llama-swap doesn’t have a 60s request. My guess is that it may be in nginx to prevent long running requests.

0 replies

muzzah · 2025-08-08T11:26:23Z

muzzah
Aug 8, 2025
Author

I can later when I have access however I did not see anything in there pointing to a timeout.
My proxy infront of llama-swap has a large timeout both for the client (20minutes) and upstream. (1d). I wasnt experiencing any timeouts until I started using llama-swap. Are you sure even the default values for the networking/http code isnt too low?

server {
        keepalive_timeout 1200;
        server_name llm.internal.lol;
        listen 443 ssl http2 ;
        access_log /var/log/nginx/access.log grafana_logs;
        ssl_session_timeout 5m;
        ssl_session_cache shared:SSL:50m;
        ssl_session_tickets off;
        ssl_certificate /etc/nginx/certs/cert.lol.crt;
        ssl_certificate_key /etc/nginx/certs/cert.lol.key;
        ssl_stapling on;
        ssl_stapling_verify on;
        ssl_trusted_certificate /etc/nginx/certs/cert.lol.chain.pem;
        add_header Strict-Transport-Security "max-age=31536000" always;
        location / {
                proxy_read_timeout 1d;
                proxy_pass http://llm;
        }
}

0 replies

mostlygeek · 2025-08-08T16:45:21Z

mostlygeek
Aug 8, 2025
Maintainer

On my box I have llama-swap running and clients talk directly to it. I have plenty of requests that go into several minutes and do not abort.

$ journalctl -u llama-server.service  | grep '.INFO.*POST /v1/chat' | awk '{print $NF}'  | grep '^[234]*m'
3m4.722423537s
4m36.825515673s
2m14.704385495s
1m4.870498529s
1m26.38833573s
1m16.902430331s
2m16.178908435s
1m26.860134572s
1m13.796182234s
1m15.411022443s
1m35.277181011s
1m46.658418719s
1m7.519343004s
1m48.85188196s
1m39.735225097s
1m41.837557545s
1m3.857345124s
1m35.903498546s
1m25.944330973s
2m47.544163953s
2m43.564163346s
2m55.235167411s
3m29.564153799s
3m9.741365747s

0 replies

mostlygeek · 2025-08-11T05:16:16Z

mostlygeek
Aug 11, 2025
Maintainer

Pretty sure this is caused by proxy_buffering in nginx. The turning it off: proxy_buffering off;

See #237 #236

0 replies

muzzah · 2025-08-14T16:58:53Z

muzzah
Aug 14, 2025
Author

ok got it. will give it a try thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Timeouts if chat completion takes longer than 1 minutes #232

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Timeouts if chat completion takes longer than 1 minutes #232

Uh oh!

muzzah Aug 6, 2025

Replies: 5 comments

Uh oh!

mostlygeek Aug 7, 2025 Maintainer

Uh oh!

Uh oh!

muzzah Aug 8, 2025 Author

Uh oh!

mostlygeek Aug 8, 2025 Maintainer

Uh oh!

mostlygeek Aug 11, 2025 Maintainer

Uh oh!

muzzah Aug 14, 2025 Author

muzzah
Aug 6, 2025

mostlygeek
Aug 7, 2025
Maintainer

muzzah
Aug 8, 2025
Author

mostlygeek
Aug 8, 2025
Maintainer

mostlygeek
Aug 11, 2025
Maintainer

muzzah
Aug 14, 2025
Author