common : use common_chat_templates for add_bos and add_eos #15326

danbev · 2025-08-14T18:09:20Z

This commit updates common_chat_templates_apply_jinja to use the
the add_bos and add_eos parameters from the chat template instead of
the inputs.

The motivation for this is that currently if the add_bos and add_eos
from the input parameters are used it is possible to there will be a
missmatch between the model and the chat template which can lead to the
the removal of duplicate BOS/EOS tokens in chat.cpp apply to not
happen leading to two BOS tokens being added to the template.

I've tried this using new converted models and the bos duplication is not there. If this solution is accepted then I'll re-convert the instruction tuned models and upload them to ggml-org.

CISC · 2025-08-14T18:51:08Z

You're saying double BOS is being added to the instruction tuned model, but only without jinja?

I can't verify the model config as it's gated, though looking at f.ex. MLX versions it seems BOS is <bos>

So that doesn't make much sense, the gemma chat template doesn't have BOS, and SPM has add_bos set by default, meaning there should be only one BOS being added. For jinja there's a BOS in the chat template, but as long as add_bos is true this should be automatically removed.

ggerganov · 2025-08-15T05:14:54Z

Here is my understanding:

Without this change, if you run IT model without --jinja it uses single BOS (correct). But if you add --jinja it will have 2 BOS (wrong)
With this change, if you run the IT model without --jinja it uses zero BOS (wrong) and with --jinja it uses 1 BOS (correct)

Maybe I don't understand the logic completely, but this seems very confusing. I can't tell when --jinja should be used and when it should not. Can we improve this somehow?

danbev · 2025-08-15T05:30:31Z

Sorry about the confusion, it was late yesterday and I was a little rushed creating this PR. I've not looked at this part of the code base much, but I'll take a closer look today and try to understand this issue better.

ggerganov · 2025-08-15T05:41:39Z

For jinja there's a BOS in the chat template, but as long as add_bos is true this should be automatically removed.

@CISC Maybe this is the root of the problem - I'm pretty sure that when I tested yesterday with --jinja and without the patch from this PR, the second BOS was not removed. Will double check now to confirm.

ggerganov · 2025-08-15T05:50:36Z

Here is a repro using master:

$ huggingface-cli download google/gemma-3-270m-it --local-dir google/gemma-3-270m-it

$ python3 convert_hf_to_gguf.py google/gemma-3-270m-it/ --outfile ./models/gemma-3-270m-it/ggml-model-bf16.gguf --outtype bf16

$ ./bin/llama-cli -m ../models/gemma-3-270m-it/ggml-model-bf16.gguf -c 0 -fa --jinja -p "Test" --verbose-prompt

...

0.00.118.683 I llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
0.00.118.684 I llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
0.00.118.684 I llama_model_loader: - kv  35:               tokenizer.ggml.add_sep_token bool             = false
0.00.118.686 I llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false

...

0.00.354.187 I 
0.00.354.385 W tokenize: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
0.00.354.389 I main: prompt: 'Test'
0.00.354.389 I main: number of tokens in prompt = 11
0.00.354.390 I      2 -> '<bos>'
0.00.354.390 I      2 -> '<bos>'
0.00.354.390 I    105 -> '<start_of_turn>'
0.00.354.392 I   2364 -> 'user'
0.00.354.392 I    107 -> '
'
0.00.354.394 I   3694 -> 'Test'
0.00.354.394 I    106 -> '<end_of_turn>'
0.00.354.394 I    107 -> '
'
0.00.354.394 I    105 -> '<start_of_turn>'
0.00.354.395 I   4368 -> 'model'
0.00.354.395 I    107 -> '
'
0.00.354.395 I 
0.00.354.397 I main: interactive mode on.
0.00.354.410 I sampler seed: 3041241033

...

In this case add_bos == true and the Jinja template has BOS, which results in 2 BOSes with the --jinja flag.

The llama-server ... --jinja behaves the same way:

0.11.182.122 D ubatch_print:   token     = [
0.11.182.123 D ubatch_print:     0: id =      2 (           <bos>), pos =    0, n_seq_id =  1, seq_id = [0], output = 0
0.11.182.124 D ubatch_print:     1: id =      2 (           <bos>), pos =    1, n_seq_id =  1, seq_id = [0], output = 0
0.11.182.126 D ubatch_print:     2: id =    105 ( <start_of_turn>), pos =    2, n_seq_id =  1, seq_id = [0], output = 0
0.11.182.127 D ubatch_print:     3: id =   2364 (            user), pos =    3, n_seq_id =  1, seq_id = [0], output = 0

danbev · 2025-08-15T05:58:48Z

I noticed that the instruction tuned model has the following:

(venv) $ head ~/work/ai/models/gemma-3-270m-it/tokenizer_config.json
{
  "add_bos_token": true,
  "add_eos_token": false,
  "added_tokens_decoder": {
   ...

The pretrained/base model also add_bos_token set to true which I think is correct, but I don't think this should be true for the instruction tuned model?

CISC · 2025-08-15T06:51:45Z

...
0.00.118.684 I llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
...
0.00.354.390 I      2 -> '<bos>'
0.00.354.390 I      2 -> '<bos>'
0.00.354.390 I    105 -> '<start_of_turn>'
...

Ok, that should not happen, it should have been removed here:

llama.cpp/common/chat.cpp

Lines 791 to 800 in f75b830

    
           // To avoid double BOS / EOS tokens, we're manually removing begining / trailing tokens 
        
           // instead of using `chat_template_options.use_bos_token = false`, since these tokens 
        
           // may be needed inside the template / between messages too. 
        
           auto result = tmpl.apply(tmpl_inputs, tmpl_opts); 
        
           if (inputs.add_bos && string_starts_with(result, tmpl.bos_token())) { 
        
               result = result.substr(tmpl.bos_token().size()); 
        
           } 
        
           if (inputs.add_eos && string_ends_with(result, tmpl.eos_token())) { 
        
               result = result.substr(0, result.size() - tmpl.eos_token().size()); 
        
           }

CISC · 2025-08-15T06:56:07Z

The pretrained/base model also add_bos_token set to true which I think is correct, but I don't think this should be true for the instruction tuned model?

It should, the problem is just that for some reason it's not automatically removed from the chat template (which technically is the wrong approach, we really should disable add_bos/add_eos when using jinja chat templates instead).

CISC · 2025-08-15T07:40:13Z

Ah, wait, is perhaps the problem that the token is tokenized with a prepended space?

llama.cpp/common/chat.cpp

Lines 574 to 585 in f75b830

    
           const auto get_token = [&](llama_token token, const char * name, const char * jinja_variable_name) { 
        
               if (token == LLAMA_TOKEN_NULL) { 
        
                   if (default_template_src.find(jinja_variable_name) != std::string::npos 
        
                       || template_tool_use_src.find(jinja_variable_name) != std::string::npos) { 
        
                       LOG_WRN("common_chat_templates_init: warning: vocab does not have a %s token, jinja template won't work as intended.\n", name); 
        
                   } 
        
                   return std::string(); 
        
               } 
        
               return common_token_to_piece(vocab, token, true); 
        
           }; 
        
           token_bos = get_token(llama_vocab_bos(vocab), "BOS", "bos_token"); 
        
           token_eos = get_token(llama_vocab_eos(vocab), "EOS", "eos_token");

Edit: Nope, add_space_prefix is false.

danbev · 2025-08-15T07:46:27Z

Ok, that should not happen, it should have been removed here:

This does not seem to happen for all templates, for example in

llama.cpp/tools/main/main.cpp

Line 258 in 4227c9b

const bool add_bos = llama_vocab_get_add_bos(vocab) && !params.use_jinja;

This will be false (assuming that we are not using the workaround in this PR, I reverted it locally). And then later we have:

llama.cpp/tools/main/main.cpp

Lines 293 to 299 in 4227c9b

    
           if (!params.system_prompt.empty() || !params.prompt.empty()) { 
        
               common_chat_templates_inputs inputs; 
        
               inputs.use_jinja = g_params->use_jinja; 
        
               inputs.messages = chat_msgs; 
        
               inputs.add_generation_prompt = !params.prompt.empty(); 
        
               prompt = common_chat_templates_apply(chat_templates.get(), inputs).prompt;

But this is not setting in the add_bos on inputs so it will be false. Perhaps this should be:

diff --git a/tools/main/main.cpp b/tools/main/main.cpp
index dc776f59e..04379201e 100644
--- a/tools/main/main.cpp
+++ b/tools/main/main.cpp
@@ -255,7 +255,7 @@ int main(int argc, char ** argv) {
         }
     }
 
-    const bool add_bos = llama_vocab_get_add_bos(vocab) && !params.use_jinja;
+    const bool add_bos = llama_vocab_get_add_bos(vocab);
     if (!llama_model_has_encoder(model)) {
         GGML_ASSERT(!llama_vocab_get_add_eos(vocab));
     }
@@ -294,6 +294,7 @@ int main(int argc, char ** argv) {
                 common_chat_templates_inputs inputs;
                 inputs.use_jinja = g_params->use_jinja;
                 inputs.messages = chat_msgs;
+                inputs.add_bos = add_bos;
                 inputs.add_generation_prompt = !params.prompt.empty();
 
                 prompt = common_chat_templates_apply(chat_templates.get(), inputs).prompt;

CISC · 2025-08-15T07:54:20Z

@danbev Yep, you are right, I overlooked this codepath.

CISC · 2025-08-15T07:58:39Z

@danbev Mind adding a new PR after testing? Don't forget to pass add_eos too.

CISC · 2025-08-15T08:02:39Z

It might need fixing elsewhere too: https://github.com/search?q=repo%3Aggml-org%2Fllama.cpp%20common_chat_templates_apply&type=code

CISC · 2025-08-15T09:06:22Z

@danbev Actually, looking at this more closely I think I made a mistake in #15086 common_chat_templates_apply_jinja shouldn't get those params from inputs, but from tmpls.

Edit: Sorry for the back-and-forth, but I think this is the only change that needs to be done, the cause of the problem is here:

llama.cpp/common/chat.cpp

Lines 2064 to 2065 in f75b830

    
           params.add_bos = inputs.add_bos; 
        
           params.add_eos = inputs.add_eos;

danbev · 2025-08-15T10:06:36Z

Edit: Sorry for the back-and-forth, but I think this is the only change that needs to be done, the cause of the problem is here:

No worries at all! Sounds good, I'll try that, thanks!

This commit updates common_chat_templates_apply_jinja to use the the add_bos and add_eos parameters from the chat template instead of the inputs. The motivation for this is that currently if the `add_bos` and `add_eos` from the input parameters are used it is possible to there will be a missmatch between the model and the chat template which can lead to the the removal of duplicate BOS/EOS tokens in chat.cpp `apply` to not happen leading to two BOS tokens being added to the template.

CISC

Thanks, everything works now with/without jinja?

danbev · 2025-08-15T12:39:33Z

Thanks, everything works now with/without jinja?

Yes, I think this looks good now:

llama-cli and llama-server outputs

llama-cli with --jinja:

(venv) $ build/bin/llama-cli -m models/gemma-3-270m-it.gguf -c 0 -fa --jinja -p "Test" --verbose-prompt
...
main: prompt: 'Test'
main: number of tokens in prompt = 10
     2 -> '<bos>'
   105 -> '<start_of_turn>'
  2364 -> 'user'
   107 -> '
'
  3694 -> 'Test'
   106 -> '<end_of_turn>'
   107 -> '
'
   105 -> '<start_of_turn>'
  4368 -> 'model'
   107 -> '

And llama-cli without --jinja:

(venv) $ build/bin/llama-cli -m models/gemma-3-270m-it.gguf -c 0 -fa -p "Test" --verbose-prompt
...
main: prompt: 'Test'
main: number of tokens in prompt = 10
     2 -> '<bos>'
   105 -> '<start_of_turn>'
  2364 -> 'user'
   107 -> '
'
  3694 -> 'Test'
   106 -> '<end_of_turn>'
   107 -> '
'
   105 -> '<start_of_turn>'
  4368 -> 'model'
   107 -> '
'

And llama-server with --jinja:

(venv) $ build/bin/llama-server -m models/gemma-3-270m-it.gguf -c 0 -fa --verbose-prompt -t 1 --threads-http 1
...
main: chat template, chat_template: {{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
    {%- if messages[0]['content'] is string -%}
        {%- set first_user_prefix = messages[0]['content'] + '

' -%}
    {%- else -%}
        {%- set first_user_prefix = messages[0]['content'][0]['text'] + '

' -%}
    {%- endif -%}
    {%- set loop_messages = messages[1:] -%}
{%- else -%}
    {%- set first_user_prefix = "" -%}
    {%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
        {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
    {%- endif -%}
    {%- if (message['role'] == 'assistant') -%}
        {%- set role = "model" -%}
    {%- else -%}
        {%- set role = message['role'] -%}
    {%- endif -%}
    {{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
    {%- if message['content'] is string -%}
        {{ message['content'] | trim }}
    {%- elif message['content'] is iterable -%}
        {%- for item in message['content'] -%}
            {%- if item['type'] == 'image' -%}
                {{ '<start_of_image>' }}
            {%- elif item['type'] == 'text' -%}
                {{ item['text'] | trim }}
            {%- endif -%}
        {%- endfor -%}
    {%- else -%}
        {{ raise_exception("Invalid content type") }}
    {%- endif -%}
    {{ '<end_of_turn>
' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
    {{'<start_of_turn>model
'}}
{%- endif -%}
, example_format: '<start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 23
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 23, n_tokens = 23, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 23, n_tokens = 23
slot update_slots: id  0 | task 0 | SWA checkpoint create, pos_min = 0, pos_max = 22, size = 0.338 MiB, total = 1/3 (0.338 MiB)
slot      release: id  0 | task 0 | stop processing: n_past = 32, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =      97.37 ms /    23 tokens (    4.23 ms per token,   236.21 tokens per second)
       eval time =     275.07 ms /    10 tokens (   27.51 ms per token,    36.35 tokens per second)
      total time =     372.44 ms /    33 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

And llama-server without --jinja:

(venv) $ build/bin/llama-server -m models/gemma-3-270m-it.gguf -c 0 -fa --verbose-prompt -t 1 --threads-http 1
...
main: chat template, chat_template: {{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
    {%- if messages[0]['content'] is string -%}
        {%- set first_user_prefix = messages[0]['content'] + '

' -%}
    {%- else -%}
        {%- set first_user_prefix = messages[0]['content'][0]['text'] + '

' -%}
    {%- endif -%}
    {%- set loop_messages = messages[1:] -%}
{%- else -%}
    {%- set first_user_prefix = "" -%}
    {%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
        {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
    {%- endif -%}
    {%- if (message['role'] == 'assistant') -%}
        {%- set role = "model" -%}
    {%- else -%}
        {%- set role = message['role'] -%}
    {%- endif -%}
    {{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
    {%- if message['content'] is string -%}
        {{ message['content'] | trim }}
    {%- elif message['content'] is iterable -%}
        {%- for item in message['content'] -%}
            {%- if item['type'] == 'image' -%}
                {{ '<start_of_image>' }}
            {%- elif item['type'] == 'text' -%}
                {{ item['text'] | trim }}
            {%- endif -%}
        {%- endfor -%}
    {%- else -%}
        {{ raise_exception("Invalid content type") }}
    {%- endif -%}
    {{ '<end_of_turn>
' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
    {{'<start_of_turn>model
'}}
{%- endif -%}
, example_format: '<start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle

srv  log_server_r: request: GET / 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 23
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 23, n_tokens = 23, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 23, n_tokens = 23
slot update_slots: id  0 | task 0 | SWA checkpoint create, pos_min = 0, pos_max = 22, size = 0.338 MiB, total = 1/3 (0.338 MiB)
slot      release: id  0 | task 0 | stop processing: n_past = 32, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =     108.48 ms /    23 tokens (    4.72 ms per token,   212.03 tokens per second)
       eval time =     326.19 ms /    10 tokens (   32.62 ms per token,    30.66 tokens per second)
      total time =     434.66 ms /    33 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

Let me know if there is anything else I should test to verify this.

CISC · 2025-08-15T12:43:11Z

Thanks, everything works now with/without jinja?

Yes, I think this looks good now:

Perfect, thanks again! :)

github-actions bot added the python python script changes label Aug 14, 2025

danbev requested a review from ngxson as a code owner August 15, 2025 09:01

danbev changed the title ~~convert : add bos token for Gemma 3 base models~~ llama : pass add_bos and add_eos to common_chat_templates_apply Aug 15, 2025

github-actions bot added examples server labels Aug 15, 2025

CISC requested review from CISC and removed request for ngxson August 15, 2025 09:21

danbev force-pushed the gemma-3-convert-add_bos branch from fcc2931 to b4d28e9 Compare August 15, 2025 10:25

danbev changed the title ~~llama : pass add_bos and add_eos to common_chat_templates_apply~~ common : use common_chat_templates for add_bos and add_eos Aug 15, 2025

danbev requested a review from ggerganov August 15, 2025 10:34

CISC approved these changes Aug 15, 2025

View reviewed changes

danbev removed the python python script changes label Aug 15, 2025

common : use common_chat_templates for add_bos and add_eos #15326

Are you sure you want to change the base?

common : use common_chat_templates for add_bos and add_eos #15326

Conversation

danbev commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Aug 14, 2025

Uh oh!

ggerganov commented Aug 15, 2025

Uh oh!

danbev commented Aug 15, 2025

Uh oh!

ggerganov commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Aug 15, 2025

Uh oh!

danbev commented Aug 15, 2025

Uh oh!

CISC commented Aug 15, 2025

Uh oh!

CISC commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danbev commented Aug 15, 2025

Uh oh!

CISC commented Aug 15, 2025

Uh oh!

CISC commented Aug 15, 2025

Uh oh!

CISC commented Aug 15, 2025

Uh oh!

CISC commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danbev commented Aug 15, 2025

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

danbev commented Aug 15, 2025

Uh oh!

CISC commented Aug 15, 2025

Uh oh!

Uh oh!

danbev commented Aug 14, 2025 •

edited

Loading

ggerganov commented Aug 15, 2025 •

edited

Loading

CISC commented Aug 15, 2025 •

edited

Loading

CISC commented Aug 15, 2025 •

edited

Loading

CISC commented Aug 15, 2025 •

edited

Loading