Skip to content

common : use common_chat_templates for add_bos and add_eos #15326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

danbev
Copy link
Member

@danbev danbev commented Aug 14, 2025

This commit updates common_chat_templates_apply_jinja to use the
the add_bos and add_eos parameters from the chat template instead of
the inputs.

The motivation for this is that currently if the add_bos and add_eos
from the input parameters are used it is possible to there will be a
missmatch between the model and the chat template which can lead to the
the removal of duplicate BOS/EOS tokens in chat.cpp apply to not
happen leading to two BOS tokens being added to the template.


I've tried this using new converted models and the bos duplication is not there. If this solution is accepted then I'll re-convert the instruction tuned models and upload them to ggml-org.

@CISC
Copy link
Collaborator

CISC commented Aug 14, 2025

You're saying double BOS is being added to the instruction tuned model, but only without jinja?

I can't verify the model config as it's gated, though looking at f.ex. MLX versions it seems BOS is <bos>

So that doesn't make much sense, the gemma chat template doesn't have BOS, and SPM has add_bos set by default, meaning there should be only one BOS being added. For jinja there's a BOS in the chat template, but as long as add_bos is true this should be automatically removed.

@github-actions github-actions bot added the python python script changes label Aug 14, 2025
@ggerganov
Copy link
Member

Here is my understanding:

  • Without this change, if you run IT model without --jinja it uses single BOS (correct). But if you add --jinja it will have 2 BOS (wrong)
  • With this change, if you run the IT model without --jinja it uses zero BOS (wrong) and with --jinja it uses 1 BOS (correct)

Maybe I don't understand the logic completely, but this seems very confusing. I can't tell when --jinja should be used and when it should not. Can we improve this somehow?

@danbev
Copy link
Member Author

danbev commented Aug 15, 2025

Sorry about the confusion, it was late yesterday and I was a little rushed creating this PR. I've not looked at this part of the code base much, but I'll take a closer look today and try to understand this issue better.

@ggerganov
Copy link
Member

ggerganov commented Aug 15, 2025

For jinja there's a BOS in the chat template, but as long as add_bos is true this should be automatically removed.

@CISC Maybe this is the root of the problem - I'm pretty sure that when I tested yesterday with --jinja and without the patch from this PR, the second BOS was not removed. Will double check now to confirm.

@ggerganov
Copy link
Member

Here is a repro using master:

$ huggingface-cli download google/gemma-3-270m-it --local-dir google/gemma-3-270m-it

$ python3 convert_hf_to_gguf.py google/gemma-3-270m-it/ --outfile ./models/gemma-3-270m-it/ggml-model-bf16.gguf --outtype bf16

$ ./bin/llama-cli -m ../models/gemma-3-270m-it/ggml-model-bf16.gguf -c 0 -fa --jinja -p "Test" --verbose-prompt

...

0.00.118.683 I llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
0.00.118.684 I llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
0.00.118.684 I llama_model_loader: - kv  35:               tokenizer.ggml.add_sep_token bool             = false
0.00.118.686 I llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false

...

0.00.354.187 I 
0.00.354.385 W tokenize: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
0.00.354.389 I main: prompt: 'Test'
0.00.354.389 I main: number of tokens in prompt = 11
0.00.354.390 I      2 -> '<bos>'
0.00.354.390 I      2 -> '<bos>'
0.00.354.390 I    105 -> '<start_of_turn>'
0.00.354.392 I   2364 -> 'user'
0.00.354.392 I    107 -> '
'
0.00.354.394 I   3694 -> 'Test'
0.00.354.394 I    106 -> '<end_of_turn>'
0.00.354.394 I    107 -> '
'
0.00.354.394 I    105 -> '<start_of_turn>'
0.00.354.395 I   4368 -> 'model'
0.00.354.395 I    107 -> '
'
0.00.354.395 I 
0.00.354.397 I main: interactive mode on.
0.00.354.410 I sampler seed: 3041241033

...

In this case add_bos == true and the Jinja template has BOS, which results in 2 BOSes with the --jinja flag.

The llama-server ... --jinja behaves the same way:

0.11.182.122 D ubatch_print:   token     = [
0.11.182.123 D ubatch_print:     0: id =      2 (           <bos>), pos =    0, n_seq_id =  1, seq_id = [0], output = 0
0.11.182.124 D ubatch_print:     1: id =      2 (           <bos>), pos =    1, n_seq_id =  1, seq_id = [0], output = 0
0.11.182.126 D ubatch_print:     2: id =    105 ( <start_of_turn>), pos =    2, n_seq_id =  1, seq_id = [0], output = 0
0.11.182.127 D ubatch_print:     3: id =   2364 (            user), pos =    3, n_seq_id =  1, seq_id = [0], output = 0

@danbev
Copy link
Member Author

danbev commented Aug 15, 2025

I noticed that the instruction tuned model has the following:

(venv) $ head ~/work/ai/models/gemma-3-270m-it/tokenizer_config.json
{
  "add_bos_token": true,
  "add_eos_token": false,
  "added_tokens_decoder": {
   ...

The pretrained/base model also add_bos_token set to true which I think is correct, but I don't think this should be true for the instruction tuned model?

@CISC
Copy link
Collaborator

CISC commented Aug 15, 2025

...
0.00.118.684 I llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
...
0.00.354.390 I      2 -> '<bos>'
0.00.354.390 I      2 -> '<bos>'
0.00.354.390 I    105 -> '<start_of_turn>'
...

Ok, that should not happen, it should have been removed here:

llama.cpp/common/chat.cpp

Lines 791 to 800 in f75b830

// To avoid double BOS / EOS tokens, we're manually removing begining / trailing tokens
// instead of using `chat_template_options.use_bos_token = false`, since these tokens
// may be needed inside the template / between messages too.
auto result = tmpl.apply(tmpl_inputs, tmpl_opts);
if (inputs.add_bos && string_starts_with(result, tmpl.bos_token())) {
result = result.substr(tmpl.bos_token().size());
}
if (inputs.add_eos && string_ends_with(result, tmpl.eos_token())) {
result = result.substr(0, result.size() - tmpl.eos_token().size());
}

@CISC
Copy link
Collaborator

CISC commented Aug 15, 2025

The pretrained/base model also add_bos_token set to true which I think is correct, but I don't think this should be true for the instruction tuned model?

It should, the problem is just that for some reason it's not automatically removed from the chat template (which technically is the wrong approach, we really should disable add_bos/add_eos when using jinja chat templates instead).

@CISC
Copy link
Collaborator

CISC commented Aug 15, 2025

Ah, wait, is perhaps the problem that the token is tokenized with a prepended space?

llama.cpp/common/chat.cpp

Lines 574 to 585 in f75b830

const auto get_token = [&](llama_token token, const char * name, const char * jinja_variable_name) {
if (token == LLAMA_TOKEN_NULL) {
if (default_template_src.find(jinja_variable_name) != std::string::npos
|| template_tool_use_src.find(jinja_variable_name) != std::string::npos) {
LOG_WRN("common_chat_templates_init: warning: vocab does not have a %s token, jinja template won't work as intended.\n", name);
}
return std::string();
}
return common_token_to_piece(vocab, token, true);
};
token_bos = get_token(llama_vocab_bos(vocab), "BOS", "bos_token");
token_eos = get_token(llama_vocab_eos(vocab), "EOS", "eos_token");

Edit: Nope, add_space_prefix is false.

@danbev
Copy link
Member Author

danbev commented Aug 15, 2025

Ok, that should not happen, it should have been removed here:

This does not seem to happen for all templates, for example in

const bool add_bos = llama_vocab_get_add_bos(vocab) && !params.use_jinja;

This will be false (assuming that we are not using the workaround in this PR, I reverted it locally). And then later we have:

if (!params.system_prompt.empty() || !params.prompt.empty()) {
common_chat_templates_inputs inputs;
inputs.use_jinja = g_params->use_jinja;
inputs.messages = chat_msgs;
inputs.add_generation_prompt = !params.prompt.empty();
prompt = common_chat_templates_apply(chat_templates.get(), inputs).prompt;

But this is not setting in the add_bos on inputs so it will be false. Perhaps this should be:

diff --git a/tools/main/main.cpp b/tools/main/main.cpp
index dc776f59e..04379201e 100644
--- a/tools/main/main.cpp
+++ b/tools/main/main.cpp
@@ -255,7 +255,7 @@ int main(int argc, char ** argv) {
         }
     }
 
-    const bool add_bos = llama_vocab_get_add_bos(vocab) && !params.use_jinja;
+    const bool add_bos = llama_vocab_get_add_bos(vocab);
     if (!llama_model_has_encoder(model)) {
         GGML_ASSERT(!llama_vocab_get_add_eos(vocab));
     }
@@ -294,6 +294,7 @@ int main(int argc, char ** argv) {
                 common_chat_templates_inputs inputs;
                 inputs.use_jinja = g_params->use_jinja;
                 inputs.messages = chat_msgs;
+                inputs.add_bos = add_bos;
                 inputs.add_generation_prompt = !params.prompt.empty();
 
                 prompt = common_chat_templates_apply(chat_templates.get(), inputs).prompt;

@CISC
Copy link
Collaborator

CISC commented Aug 15, 2025

@danbev Yep, you are right, I overlooked this codepath.

@CISC
Copy link
Collaborator

CISC commented Aug 15, 2025

@danbev Mind adding a new PR after testing? Don't forget to pass add_eos too.

@CISC
Copy link
Collaborator

CISC commented Aug 15, 2025

It might need fixing elsewhere too: https://github.com/search?q=repo%3Aggml-org%2Fllama.cpp%20common_chat_templates_apply&type=code

@danbev danbev requested a review from ngxson as a code owner August 15, 2025 09:01
@danbev danbev changed the title convert : add bos token for Gemma 3 base models llama : pass add_bos and add_eos to common_chat_templates_apply Aug 15, 2025
@CISC
Copy link
Collaborator

CISC commented Aug 15, 2025

@danbev Actually, looking at this more closely I think I made a mistake in #15086 common_chat_templates_apply_jinja shouldn't get those params from inputs, but from tmpls.

Edit: Sorry for the back-and-forth, but I think this is the only change that needs to be done, the cause of the problem is here:

llama.cpp/common/chat.cpp

Lines 2064 to 2065 in f75b830

params.add_bos = inputs.add_bos;
params.add_eos = inputs.add_eos;

@CISC CISC requested review from CISC and removed request for ngxson August 15, 2025 09:21
@danbev
Copy link
Member Author

danbev commented Aug 15, 2025

Edit: Sorry for the back-and-forth, but I think this is the only change that needs to be done, the cause of the problem is here:

No worries at all! Sounds good, I'll try that, thanks!

This commit updates common_chat_templates_apply_jinja to use the
the add_bos and add_eos parameters from the chat template instead of
the inputs.

The motivation for this is that currently if the `add_bos` and `add_eos`
from the input parameters are used it is possible to there will be a
missmatch between the model and the chat template which can lead to the
the removal of duplicate BOS/EOS tokens in chat.cpp `apply` to not
happen leading to two BOS tokens being added to the template.
@danbev danbev force-pushed the gemma-3-convert-add_bos branch from fcc2931 to b4d28e9 Compare August 15, 2025 10:25
@danbev danbev changed the title llama : pass add_bos and add_eos to common_chat_templates_apply common : use common_chat_templates for add_bos and add_eos Aug 15, 2025
@danbev danbev requested a review from ggerganov August 15, 2025 10:34
Copy link
Collaborator

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, everything works now with/without jinja?

@danbev danbev removed the python python script changes label Aug 15, 2025
@danbev
Copy link
Member Author

danbev commented Aug 15, 2025

Thanks, everything works now with/without jinja?

Yes, I think this looks good now:

llama-cli and llama-server outputs

llama-cli with --jinja:

(venv) $ build/bin/llama-cli -m models/gemma-3-270m-it.gguf -c 0 -fa --jinja -p "Test" --verbose-prompt
...
main: prompt: 'Test'
main: number of tokens in prompt = 10
     2 -> '<bos>'
   105 -> '<start_of_turn>'
  2364 -> 'user'
   107 -> '
'
  3694 -> 'Test'
   106 -> '<end_of_turn>'
   107 -> '
'
   105 -> '<start_of_turn>'
  4368 -> 'model'
   107 -> '

And llama-cli without --jinja:

(venv) $ build/bin/llama-cli -m models/gemma-3-270m-it.gguf -c 0 -fa -p "Test" --verbose-prompt
...
main: prompt: 'Test'
main: number of tokens in prompt = 10
     2 -> '<bos>'
   105 -> '<start_of_turn>'
  2364 -> 'user'
   107 -> '
'
  3694 -> 'Test'
   106 -> '<end_of_turn>'
   107 -> '
'
   105 -> '<start_of_turn>'
  4368 -> 'model'
   107 -> '
'

And llama-server with --jinja:

(venv) $ build/bin/llama-server -m models/gemma-3-270m-it.gguf -c 0 -fa --verbose-prompt -t 1 --threads-http 1
...
main: chat template, chat_template: {{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
    {%- if messages[0]['content'] is string -%}
        {%- set first_user_prefix = messages[0]['content'] + '

' -%}
    {%- else -%}
        {%- set first_user_prefix = messages[0]['content'][0]['text'] + '

' -%}
    {%- endif -%}
    {%- set loop_messages = messages[1:] -%}
{%- else -%}
    {%- set first_user_prefix = "" -%}
    {%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
        {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
    {%- endif -%}
    {%- if (message['role'] == 'assistant') -%}
        {%- set role = "model" -%}
    {%- else -%}
        {%- set role = message['role'] -%}
    {%- endif -%}
    {{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
    {%- if message['content'] is string -%}
        {{ message['content'] | trim }}
    {%- elif message['content'] is iterable -%}
        {%- for item in message['content'] -%}
            {%- if item['type'] == 'image' -%}
                {{ '<start_of_image>' }}
            {%- elif item['type'] == 'text' -%}
                {{ item['text'] | trim }}
            {%- endif -%}
        {%- endfor -%}
    {%- else -%}
        {{ raise_exception("Invalid content type") }}
    {%- endif -%}
    {{ '<end_of_turn>
' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
    {{'<start_of_turn>model
'}}
{%- endif -%}
, example_format: '<start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 23
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 23, n_tokens = 23, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 23, n_tokens = 23
slot update_slots: id  0 | task 0 | SWA checkpoint create, pos_min = 0, pos_max = 22, size = 0.338 MiB, total = 1/3 (0.338 MiB)
slot      release: id  0 | task 0 | stop processing: n_past = 32, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =      97.37 ms /    23 tokens (    4.23 ms per token,   236.21 tokens per second)
       eval time =     275.07 ms /    10 tokens (   27.51 ms per token,    36.35 tokens per second)
      total time =     372.44 ms /    33 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

And llama-server without --jinja:

(venv) $ build/bin/llama-server -m models/gemma-3-270m-it.gguf -c 0 -fa --verbose-prompt -t 1 --threads-http 1
...
main: chat template, chat_template: {{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
    {%- if messages[0]['content'] is string -%}
        {%- set first_user_prefix = messages[0]['content'] + '

' -%}
    {%- else -%}
        {%- set first_user_prefix = messages[0]['content'][0]['text'] + '

' -%}
    {%- endif -%}
    {%- set loop_messages = messages[1:] -%}
{%- else -%}
    {%- set first_user_prefix = "" -%}
    {%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
        {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
    {%- endif -%}
    {%- if (message['role'] == 'assistant') -%}
        {%- set role = "model" -%}
    {%- else -%}
        {%- set role = message['role'] -%}
    {%- endif -%}
    {{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
    {%- if message['content'] is string -%}
        {{ message['content'] | trim }}
    {%- elif message['content'] is iterable -%}
        {%- for item in message['content'] -%}
            {%- if item['type'] == 'image' -%}
                {{ '<start_of_image>' }}
            {%- elif item['type'] == 'text' -%}
                {{ item['text'] | trim }}
            {%- endif -%}
        {%- endfor -%}
    {%- else -%}
        {{ raise_exception("Invalid content type") }}
    {%- endif -%}
    {{ '<end_of_turn>
' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
    {{'<start_of_turn>model
'}}
{%- endif -%}
, example_format: '<start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle

srv  log_server_r: request: GET / 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 23
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 23, n_tokens = 23, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 23, n_tokens = 23
slot update_slots: id  0 | task 0 | SWA checkpoint create, pos_min = 0, pos_max = 22, size = 0.338 MiB, total = 1/3 (0.338 MiB)
slot      release: id  0 | task 0 | stop processing: n_past = 32, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =     108.48 ms /    23 tokens (    4.72 ms per token,   212.03 tokens per second)
       eval time =     326.19 ms /    10 tokens (   32.62 ms per token,    30.66 tokens per second)
      total time =     434.66 ms /    33 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

Let me know if there is anything else I should test to verify this.

@CISC
Copy link
Collaborator

CISC commented Aug 15, 2025

Thanks, everything works now with/without jinja?

Yes, I think this looks good now:

Perfect, thanks again! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants