-
Notifications
You must be signed in to change notification settings - Fork 12.7k
common : use common_chat_templates for add_bos and add_eos #15326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
You're saying double BOS is being added to the instruction tuned model, but only without jinja? I can't verify the model config as it's gated, though looking at f.ex. MLX versions it seems BOS is So that doesn't make much sense, the |
Here is my understanding:
Maybe I don't understand the logic completely, but this seems very confusing. I can't tell when |
Sorry about the confusion, it was late yesterday and I was a little rushed creating this PR. I've not looked at this part of the code base much, but I'll take a closer look today and try to understand this issue better. |
@CISC Maybe this is the root of the problem - I'm pretty sure that when I tested yesterday with |
Here is a repro using $ huggingface-cli download google/gemma-3-270m-it --local-dir google/gemma-3-270m-it
$ python3 convert_hf_to_gguf.py google/gemma-3-270m-it/ --outfile ./models/gemma-3-270m-it/ggml-model-bf16.gguf --outtype bf16
$ ./bin/llama-cli -m ../models/gemma-3-270m-it/ggml-model-bf16.gguf -c 0 -fa --jinja -p "Test" --verbose-prompt
...
0.00.118.683 I llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 0
0.00.118.684 I llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true
0.00.118.684 I llama_model_loader: - kv 35: tokenizer.ggml.add_sep_token bool = false
0.00.118.686 I llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
...
0.00.354.187 I
0.00.354.385 W tokenize: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
0.00.354.389 I main: prompt: 'Test'
0.00.354.389 I main: number of tokens in prompt = 11
0.00.354.390 I 2 -> '<bos>'
0.00.354.390 I 2 -> '<bos>'
0.00.354.390 I 105 -> '<start_of_turn>'
0.00.354.392 I 2364 -> 'user'
0.00.354.392 I 107 -> '
'
0.00.354.394 I 3694 -> 'Test'
0.00.354.394 I 106 -> '<end_of_turn>'
0.00.354.394 I 107 -> '
'
0.00.354.394 I 105 -> '<start_of_turn>'
0.00.354.395 I 4368 -> 'model'
0.00.354.395 I 107 -> '
'
0.00.354.395 I
0.00.354.397 I main: interactive mode on.
0.00.354.410 I sampler seed: 3041241033
... In this case The
|
I noticed that the instruction tuned model has the following: (venv) $ head ~/work/ai/models/gemma-3-270m-it/tokenizer_config.json
{
"add_bos_token": true,
"add_eos_token": false,
"added_tokens_decoder": {
... The pretrained/base model also |
Ok, that should not happen, it should have been removed here: Lines 791 to 800 in f75b830
|
It should, the problem is just that for some reason it's not automatically removed from the chat template (which technically is the wrong approach, we really should disable |
Ah, wait, is perhaps the problem that the token is tokenized with a prepended space? Lines 574 to 585 in f75b830
Edit: Nope, |
This does not seem to happen for all templates, for example in Line 258 in 4227c9b
This will be false (assuming that we are not using the workaround in this PR, I reverted it locally). And then later we have: Lines 293 to 299 in 4227c9b
But this is not setting in the diff --git a/tools/main/main.cpp b/tools/main/main.cpp
index dc776f59e..04379201e 100644
--- a/tools/main/main.cpp
+++ b/tools/main/main.cpp
@@ -255,7 +255,7 @@ int main(int argc, char ** argv) {
}
}
- const bool add_bos = llama_vocab_get_add_bos(vocab) && !params.use_jinja;
+ const bool add_bos = llama_vocab_get_add_bos(vocab);
if (!llama_model_has_encoder(model)) {
GGML_ASSERT(!llama_vocab_get_add_eos(vocab));
}
@@ -294,6 +294,7 @@ int main(int argc, char ** argv) {
common_chat_templates_inputs inputs;
inputs.use_jinja = g_params->use_jinja;
inputs.messages = chat_msgs;
+ inputs.add_bos = add_bos;
inputs.add_generation_prompt = !params.prompt.empty();
prompt = common_chat_templates_apply(chat_templates.get(), inputs).prompt;
|
@danbev Yep, you are right, I overlooked this codepath. |
@danbev Mind adding a new PR after testing? Don't forget to pass |
It might need fixing elsewhere too: https://github.com/search?q=repo%3Aggml-org%2Fllama.cpp%20common_chat_templates_apply&type=code |
@danbev Actually, looking at this more closely I think I made a mistake in #15086 Edit: Sorry for the back-and-forth, but I think this is the only change that needs to be done, the cause of the problem is here: Lines 2064 to 2065 in f75b830
|
No worries at all! Sounds good, I'll try that, thanks! |
This commit updates common_chat_templates_apply_jinja to use the the add_bos and add_eos parameters from the chat template instead of the inputs. The motivation for this is that currently if the `add_bos` and `add_eos` from the input parameters are used it is possible to there will be a missmatch between the model and the chat template which can lead to the the removal of duplicate BOS/EOS tokens in chat.cpp `apply` to not happen leading to two BOS tokens being added to the template.
fcc2931
to
b4d28e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, everything works now with/without jinja?
Yes, I think this looks good now: llama-cli and llama-server outputs
(venv) $ build/bin/llama-cli -m models/gemma-3-270m-it.gguf -c 0 -fa --jinja -p "Test" --verbose-prompt
...
main: prompt: 'Test'
main: number of tokens in prompt = 10
2 -> '<bos>'
105 -> '<start_of_turn>'
2364 -> 'user'
107 -> '
'
3694 -> 'Test'
106 -> '<end_of_turn>'
107 -> '
'
105 -> '<start_of_turn>'
4368 -> 'model'
107 -> '
And (venv) $ build/bin/llama-cli -m models/gemma-3-270m-it.gguf -c 0 -fa -p "Test" --verbose-prompt
...
main: prompt: 'Test'
main: number of tokens in prompt = 10
2 -> '<bos>'
105 -> '<start_of_turn>'
2364 -> 'user'
107 -> '
'
3694 -> 'Test'
106 -> '<end_of_turn>'
107 -> '
'
105 -> '<start_of_turn>'
4368 -> 'model'
107 -> '
' And (venv) $ build/bin/llama-server -m models/gemma-3-270m-it.gguf -c 0 -fa --verbose-prompt -t 1 --threads-http 1
...
main: chat template, chat_template: {{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
{%- if messages[0]['content'] is string -%}
{%- set first_user_prefix = messages[0]['content'] + '
' -%}
{%- else -%}
{%- set first_user_prefix = messages[0]['content'][0]['text'] + '
' -%}
{%- endif -%}
{%- set loop_messages = messages[1:] -%}
{%- else -%}
{%- set first_user_prefix = "" -%}
{%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
{%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
{{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
{%- endif -%}
{%- if (message['role'] == 'assistant') -%}
{%- set role = "model" -%}
{%- else -%}
{%- set role = message['role'] -%}
{%- endif -%}
{{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
{%- if message['content'] is string -%}
{{ message['content'] | trim }}
{%- elif message['content'] is iterable -%}
{%- for item in message['content'] -%}
{%- if item['type'] == 'image' -%}
{{ '<start_of_image>' }}
{%- elif item['type'] == 'text' -%}
{{ item['text'] | trim }}
{%- endif -%}
{%- endfor -%}
{%- else -%}
{{ raise_exception("Invalid content type") }}
{%- endif -%}
{{ '<end_of_turn>
' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{'<start_of_turn>model
'}}
{%- endif -%}
, example_format: '<start_of_turn>user
You are a helpful assistant
Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 23
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 23, n_tokens = 23, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 23, n_tokens = 23
slot update_slots: id 0 | task 0 | SWA checkpoint create, pos_min = 0, pos_max = 22, size = 0.338 MiB, total = 1/3 (0.338 MiB)
slot release: id 0 | task 0 | stop processing: n_past = 32, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 97.37 ms / 23 tokens ( 4.23 ms per token, 236.21 tokens per second)
eval time = 275.07 ms / 10 tokens ( 27.51 ms per token, 36.35 tokens per second)
total time = 372.44 ms / 33 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200 And (venv) $ build/bin/llama-server -m models/gemma-3-270m-it.gguf -c 0 -fa --verbose-prompt -t 1 --threads-http 1
...
main: chat template, chat_template: {{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
{%- if messages[0]['content'] is string -%}
{%- set first_user_prefix = messages[0]['content'] + '
' -%}
{%- else -%}
{%- set first_user_prefix = messages[0]['content'][0]['text'] + '
' -%}
{%- endif -%}
{%- set loop_messages = messages[1:] -%}
{%- else -%}
{%- set first_user_prefix = "" -%}
{%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
{%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
{{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
{%- endif -%}
{%- if (message['role'] == 'assistant') -%}
{%- set role = "model" -%}
{%- else -%}
{%- set role = message['role'] -%}
{%- endif -%}
{{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
{%- if message['content'] is string -%}
{{ message['content'] | trim }}
{%- elif message['content'] is iterable -%}
{%- for item in message['content'] -%}
{%- if item['type'] == 'image' -%}
{{ '<start_of_image>' }}
{%- elif item['type'] == 'text' -%}
{{ item['text'] | trim }}
{%- endif -%}
{%- endfor -%}
{%- else -%}
{{ raise_exception("Invalid content type") }}
{%- endif -%}
{{ '<end_of_turn>
' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{'<start_of_turn>model
'}}
{%- endif -%}
, example_format: '<start_of_turn>user
You are a helpful assistant
Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET / 127.0.0.1 200
srv log_server_r: request: GET /props 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 23
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 23, n_tokens = 23, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 23, n_tokens = 23
slot update_slots: id 0 | task 0 | SWA checkpoint create, pos_min = 0, pos_max = 22, size = 0.338 MiB, total = 1/3 (0.338 MiB)
slot release: id 0 | task 0 | stop processing: n_past = 32, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 108.48 ms / 23 tokens ( 4.72 ms per token, 212.03 tokens per second)
eval time = 326.19 ms / 10 tokens ( 32.62 ms per token, 30.66 tokens per second)
total time = 434.66 ms / 33 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200 Let me know if there is anything else I should test to verify this. |
Perfect, thanks again! :) |
This commit updates common_chat_templates_apply_jinja to use the
the add_bos and add_eos parameters from the chat template instead of
the inputs.
The motivation for this is that currently if the
add_bos
andadd_eos
from the input parameters are used it is possible to there will be a
missmatch between the model and the chat template which can lead to the
the removal of duplicate BOS/EOS tokens in chat.cpp
apply
to nothappen leading to two BOS tokens being added to the template.
I've tried this using new converted models and the bos duplication is not there. If this solution is accepted then I'll re-convert the instruction tuned models and upload them to ggml-org.