๐ค Hugging Face ย | ย ๐ Blog ย | ย ๐ Documentation
We introduce EXAONE Deep, which exhibits superior capabilities in various reasoning tasks including math and coding benchmarks, ranging from 2.4B to 32B parameters developed and released by LG AI Research. Evaluation results show that 1) EXAONE Deep 2.4B outperforms other models of comparable size, 2) EXAONE Deep 7.8B outperforms not only open-weight models of comparable scale but also a proprietary reasoning model OpenAI o1-mini, and 3) EXAONE Deep 32B demonstrates competitive performance against leading open-weight models.
Our documentation consists of the following sections:
- Performance: Experimental results of EXAONE Deep models.
- Quickstart: A basic guide to using EXAONE Deep models with Transformers.
- Quantized Models: An explanation of quantized EXAONE Deep weights in
AWQ
andGGUF
format. - Run Locally: A guide to running EXAONE Deep models locally with
llama.cpp
andOllama
frameworks. - Deployment: A guide to running EXAONE Deep models with
TensorRT-LLM
,vLLM
, andSGLang
deployment frameworks. - Usage Guideline: A guide to utilizing EXAONE Deep models to achieve the expected performance.
- 2025.03.18: We release the EXAONE Deep, reasoning enhanced language models, including 2.4B, 7.8B, and 32B. Check out the ๐ Documentation!
Some experimental results are shown below. The full evaluation results can be found in the Documentation.
Models | MATH-500 (pass@1) | AIME 2024 (pass@1 / cons@64) | AIME 2025 (pass@1 / cons@64) | CSAT Math 2025 (pass@1) | GPQA Diamond (pass@1) | Live Code Bench (pass@1) |
---|---|---|---|---|---|---|
EXAONE Deep 32B | 95.7 | 72.1 / 90.0 | 65.8 / 80.0 | 94.5 | 66.1 | 59.5 |
DeepSeek-R1-Distill-Qwen-32B | 94.3 | 72.6 / 83.3 | 55.2 / 73.3 | 84.1 | 62.1 | 57.2 |
QwQ-32B | 95.5 | 79.5 / 86.7 | 67.1 / 76.7 | 94.4 | 63.3 | 63.4 |
DeepSeek-R1-Distill-Llama-70B | 94.5 | 70.0 / 86.7 | 53.9 / 66.7 | 88.8 | 65.2 | 57.5 |
DeepSeek-R1 (671B) | 97.3 | 79.8 / 86.7 | 66.8 / 80.0 | 89.9 | 71.5 | 65.9 |
EXAONE Deep 7.8B | 94.8 | 70.0 / 83.3 | 59.6 / 76.7 | 89.9 | 62.6 | 55.2 |
DeepSeek-R1-Distill-Qwen-7B | 92.8 | 55.5 / 83.3 | 38.5 / 56.7 | 79.7 | 49.1 | 37.6 |
DeepSeek-R1-Distill-Llama-8B | 89.1 | 50.4 / 80.0 | 33.6 / 53.3 | 74.1 | 49.0 | 39.6 |
OpenAI o1-mini | 90.0 | 63.6 / 80.0 | 54.8 / 66.7 | 84.4 | 60.0 | 53.8 |
EXAONE Deep 2.4B | 92.3 | 52.5 / 76.7 | 47.9 / 73.3 | 79.2 | 54.3 | 46.6 |
DeepSeek-R1-Distill-Qwen-1.5B | 83.9 | 28.9 / 52.7 | 23.9 / 36.7 | 65.6 | 33.8 | 16.9 |
- You need to install
transformers>=4.43.1
for the EXAONE Deep models. The latest version is recommended to use.
Here is the example code to show how to use EXAONE Deep models.
Tip
In all the examples below, you can use another size model by changing 7.8B to 32B or 2.4B.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from threading import Thread
model_name = "LGAI-EXAONE/EXAONE-Deep-7.8B"
streaming = True # choose the streaming option
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Choose your prompt:
# Math example (AIME 2024)
prompt = r"""Let $x,y$ and $z$ be positive real numbers that satisfy the following system of equations:
\[\log_2\left({x \over yz}\right) = {1 \over 2}\]\[\log_2\left({y \over xz}\right) = {1 \over 3}\]\[\log_2\left({z \over xy}\right) = {1 \over 4}\]
Then the value of $\left|\log_2(x^4y^3z^2)\right|$ is $\tfrac{m}{n}$ where $m$ and $n$ are relatively prime positive integers. Find $m+n$.
Please reason step by step, and put your final answer within \boxed{}."""
# Korean MCQA example (CSAT Math 2025)
prompt = r"""Question : $a_1 = 2$์ธ ์์ด $\{a_n\}$๊ณผ $b_1 = 2$์ธ ๋ฑ์ฐจ์์ด $\{b_n\}$์ด ๋ชจ๋ ์์ฐ์ $n$์ ๋ํ์ฌ\[\sum_{k=1}^{n} \frac{a_k}{b_{k+1}} = \frac{1}{2} n^2\]์ ๋ง์กฑ์ํฌ ๋, $\sum_{k=1}^{5} a_k$์ ๊ฐ์ ๊ตฌํ์ฌ๋ผ.
Options :
A) 120
B) 125
C) 130
D) 135
E) 140
Please reason step by step, and you should write the correct option alphabet (A, B, C, D or E) within \\boxed{}."""
messages = [
{"role": "user", "content": prompt}
]
input_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
)
if streaming:
streamer = TextIteratorStreamer(tokenizer)
thread = Thread(target=model.generate, kwargs=dict(
input_ids=input_ids.to("cuda"),
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=32768,
do_sample=True,
temperature=0.6,
top_p=0.95,
streamer=streamer
))
thread.start()
for text in streamer:
print(text, end="", flush=True)
else:
output = model.generate(
input_ids.to("cuda"),
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=32768,
do_sample=True,
temperature=0.6,
top_p=0.95,
)
print(tokenizer.decode(output[0]))
Important
The EXAONE Deep models are trained with an optimized configuration, so we recommend following the Usage Guideline section to achieve optimal performance.
We introduce a series of quantized weights of EXAONE Deep models.
We provide AWQ-quantized weights of EXAONE Deep models, quantized using AutoAWQ
library. Please refer to the EXAONE Deep collection for pre-quantized weights, and the AutoAWQ documentation for more details.
You need to install the latest version of AutoAWQ library (autoawq>=0.2.8
) to load the AWQ-quantized version of EXAONE Deep models.
pip install autoawq
You can load the model in similar ways to the original models, only changing the model name. It automatically loads with AWQ configuration of the model. Please check the Quickstart section above for more details.
We provide weights in BF16
format and quantized weights in Q8_0
, Q6_K
, Q5_K_M
, Q4_K_M
, IQ4_XS
.
The example below is for the 7.8B model in BF16 format. Please refer to the EXAONE Deep collection to find quantized models. You may need to install huggingface_hub
to download the GGUF weights.
# (optional) install huggingface_hub
pip install huggingface_hub
# Download the GGUF weights
huggingface-cli download LGAI-EXAONE/EXAONE-Deep-7.8B-GGUF \
--include "EXAONE-Deep-7.8B-BF16*.gguf" \
--local-dir .
For end users, we introduce two ways to run EXAONE Deep models locally.
Note
We highly recommend to use repetition penalty not exceeding 1.0 for better generation quality.
You can run EXAONE models with llama.cpp as follows:
-
Install llama.cpp. Please refer to the llama.cpp repository for more details.
-
Download EXAONE Deep model in GGUF format.
huggingface-cli download LGAI-EXAONE/EXAONE-Deep-7.8B-GGUF \
--include "EXAONE-Deep-7.8B-BF16*.gguf" \
--local-dir .
- Run the model with llama.cpp in conversational mode. We set chat-template explicitly to handle reasoning steps properly.
llama-cli -m ./EXAONE-Deep-7.8B-BF16.gguf \
-sys "" \
-c 32768 \
--temp 0.6 \
--top-p 0.95 \
--jinja \
--chat-template "{% for message in messages %}{% if loop.first and message['role'] != 'system' %}{{ '[|system|][|endofturn|]\n' }}{% endif %}{% set content = message['content'] %}{% if '</thought>' in content %}{% set content = content.split('</thought>')[-1].lstrip('\\n') %}{% endif %}{{ '[|' + message['role'] + '|]' + content }}{% if not message['role'] == 'user' %}{{ '[|endofturn|]' }}{% endif %}{% if not loop.last %}{{ '\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '\n[|assistant|]<thought>\n' }}{% endif %}"
-
In case of using EXAONE Deep 32B model with BF16 precision, you may need to download all split files and merge them before running the model.
# Download all split files huggingface-cli download LGAI-EXAONE/EXAONE-Deep-32B-GGUF \ --include "EXAONE-Deep-32B-BF16*.gguf" \ --local-dir . # Merge all split files llama-gguf-split --merge \ ./EXAONE-Deep-32B-BF16-00001-of-00002.gguf \ ./EXAONE-Deep-32B-BF16.gguf
EXAONE Deep models are uploaded to Ollama model library. You can easily use EXAONE models as follows:
-
Install Ollama. Please refer to the Ollama repository for more details.
-
Run EXAONE Deep model as follows:
ollama run exaone-deep:7.8b
Note
In above example, the model exaone-deep:7.8b
is quantized in Q4_K_M
. If you would like to know a list of available models,
please refer to the EXAONE Deep Ollama page for more details.
Or, you can create and run EXAONE Deep models with GGUF format for customizing.
-
Install Ollama. Please refer to the Ollama repository for more details.
-
Download EXAONE Deep model in GGUF format. Please refer to the GGUF section for more details.
-
Write the
Modelfile
for EXAONE Deep.
# Model path (choose appropriate GGUF weights on your own)
FROM ./EXAONE-Deep-7.8B-BF16.gguf
# Parameter values
PARAMETER stop "[|endofturn|]"
PARAMETER repeat_penalty 1.0
PARAMETER num_ctx 32768
PARAMETER temperature 0.6
PARAMETER top_p 0.95
# Chat template
# Note: currently there is no feature of removing `<thought></thought>` steps from context
# because ollama does not support yet. We will update when according feature is available.
TEMPLATE """{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{ if eq .Role "system" }}[|system|]{{ .Content }}[|endofturn|]
{{ continue }}
{{ else if eq .Role "user" }}[|user|]{{ .Content }}
{{ else if eq .Role "assistant" }}[|assistant|]{{ .Content }}[|endofturn|]
{{ end }}
{{- if and (ne .Role "assistant") $last }}[|assistant|]<thought>
{{ end }}
{{- end -}}"""
# System prompt
SYSTEM """"""
# License
LICENSE """EXAONE AI Model License Agreement 1.1 - NC """
- Convert the model to Ollama.
ollama create exaone -f Modelfile
- Run the model with Ollama.
ollama run exaone
You can run EXAONE Deep models on your device with LM-Studio.
-
Install LM-Studio. Please refer to the LM-Studio Page for more details.
-
Download EXAONE Deep model in GGUF format. You can search and find proper model at
Model Search
. -
Configure the prompt setting.
- Set "Reasoning Section Parsing" as
<thought>
and</thought>
- Set "Template (Jinja)" as that of EXAONE 3.5.
{% for message in messages %}{% if loop.first and message['role'] != 'system' %}{{ '[|system|][|endofturn|]\n' }}{% endif %}{{ '[|' + message['role'] + '|]' + message['content'] }}{% if message['role'] == 'user' %}{{ '\n' }}{% else %}{{ '[|endofturn|]\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '[|assistant|]' }}{% endif %}
- Set "Reasoning Section Parsing" as
Or, you can use custom prompt as below.
EXAONE Deep models have been integrated into various deployment frameworks.
Important
Before your deployment of EXAONE Deep models, we recommend following the Usage Guideline section to achieve the expected performance.
TensorRT-LLM has supported EXAONE language models since EXAONE 3.0. We recommend using TensorRT-LLM for the best performance. You can run EXAONE Deep models with TensorRT-LLM by following the instructions on TensorRT-LLM EXAONE Example.
Note
When you convert EXAONE Deep models to TensorRT-LLM format, you may need to set the environment variable TRTLLM_DISABLE_UNIFIED_CONVERTER=1
.
Note
TensorRT-LLM also supports AWQ on their own methods. If you want to use AWQ with TensorRT-LLM, please refer to the AWQ section in TensorRT-LLM EXAONE Example.
You can easily run EXAONE Deep models with vLLM.
- Install vLLM (
vllm>=0.6.0
). Please refer to the vLLM quickstart guide for more details.
pip install vllm
- Run the models with vLLM.
vllm serve LGAI-EXAONE/EXAONE-Deep-7.8B
- Send a request with the following curl command after the server starts.
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "LGAI-EXAONE/EXAONE-Deep-7.8B",
"messages": [
{"role": "user", "content": "How many golf balls can fit in a school bus?"}
],
"max_tokens": 30720,
"temperature": 0.6,
"top_p": 0.95
}'
Note
If you want to serve GGUF quantized models with vLLM, please refer to the vLLM GGUF documentation.
You can also run EXAONE Deep models with SGLang.
-
Install SGLang. Please refer to the SGLang documentation for more details.
-
Run the server with the following command.
python -m sglang.launch_server --model-path LGAI-EXAONE/EXAONE-Deep-7.8B \
--port 30000 --host 0.0.0.0
Note
In case of using EXAONE Deep 2.4B model, you need to install sglang>=0.3.6 and use --attention-backend triton
option.
Additionally, we are currently working on a PR to fix the incompatibility of flashinfer with EXAONE Deep 2.4B. Please refer to the PR for details.
- Send a request with the following curl command after the server starts.
curl -s http://0.0.0.0:30000/v1/chat/completions \
-d '{
"model": "LGAI-EXAONE/EXAONE-Deep-7.8B",
"messages": [
{"role": "user", "content": "How many golf balls can fit in a school bus?"}
],
"max_tokens": 30720,
"temperature": 0.6,
"top_p": 0.95,
}'
To achieve the expected performance, we recommend using the following configurations:
- Ensure the model starts with
<thought>\n
for reasoning steps. The model's output quality may be degraded when you omit it. You can easily apply this feature by usingtokenizer.apply_chat_template()
withadd_generation_prompt=True
. Please check the example code on Quickstart section. - The reasoning steps of EXAONE Deep models enclosed by
<thought>\n...\n</thought>
usually have lots of tokens, so previous reasoning steps may be necessary to be removed in multi-turn situation. The provided tokenizer handles this automatically. - Avoid using system prompt, and build the instruction on the user prompt.
- Additional instructions help the models reason more deeply, so that the models generate better output.
- For math problems, the instructions "Please reason step by step, and put your final answer within \boxed{}." are helpful.
- For more information on our evaluation setting including prompts, please refer to our Documentation.
- In our evaluation, we use
temperature=0.6
andtop_p=0.95
for generation. - When evaluating the models, it is recommended to test multiple times to assess the expected performance accurately.
The EXAONE language model has certain limitations and may occasionally generate inappropriate responses. The language model generates responses based on the output probability of tokens, and it is determined during learning from training data. While we have made every effort to exclude personal, harmful, and biased information from the training data, some problematic content may still be included, potentially leading to undesirable responses. Please note that the text generated by EXAONE language model does not reflects the views of LG AI Research.
- Inappropriate answers may be generated, which contain personal, harmful or other inappropriate information.
- Biased responses may be generated, which are associated with age, gender, race, and so on.
- The generated responses rely heavily on statistics from the training data, which can result in the generation of semantically or syntactically incorrect sentences.
- Since the model does not reflect the latest information, the responses may be false or contradictory.
LG AI Research strives to reduce potential risks that may arise from EXAONE language models. Users are not allowed to engage in any malicious activities (e.g., keying in illegal information) that may induce the creation of inappropriate outputs violating LG AI's ethical principles when using EXAONE language models.
The model is licensed under EXAONE AI Model License Agreement 1.1 - NC
@article{exaone-deep,
title={EXAONE Deep: Reasoning Enhanced Language Models},
author={{LG AI Research}},
journal={arXiv preprint arXiv:2503.12524},
year={2025}
}
LG AI Research Technical Support: [email protected]