200字
从头开始手写一个Qwen的Naive CUDA Kernel推理引擎(一)
2026-03-25
2026-03-27

在上学期的课程中,我们的任务是对修改后的llama2.c项目(移除了kv-cache和编译器优化)进行优化。因为这个项目强制按组分配,所以我没有机会完整的从头来编写并优化LLM的推理引擎。因此在这里尝试手动从头使用CPU,Naive CUDA Kernel,写一个Qwen的推理引擎。

在这里我们首先选取Qwen2.5-1.5B模型,因为对于Qwen2.5推理引擎来说,这是一个标准的decoder-only Transformer。而Qwen3.5则是Hybrid架构,这个我们将在之后处理。

LLM推理引擎结构

+----------------------------------------------------------------------------------+

| Qwen2.5-1.5B-Instruct.gguf |

+----------------------------------------------------------------------------------+

| Header |

| - magic/version |

| - tensor_count |

| - metadata_kv_count |

+----------------------------------------------------------------------------------+

| Metadata KV |

| - general.architecture = qwen2 |

| - general.name |

| - qwen2.block_count = 28 |

| - qwen2.embedding_length |

| - qwen2.attention.head_count = 12 |

| - qwen2.attention.head_count_kv = 2 |

| - qwen2.rope.freq_base |

| - qwen2.context_length = 32768 |

| - tokenizer.ggml.tokens |

| - tokenizer.ggml.merges / special tokens |

| - general.file_type / quantization info |

+----------------------------------------------------------------------------------+

| Tensor Directory |

| - tensor name |

| - n_dims |

| - shape[] |

| - ggml_type |

| - data_offset |

+----------------------------------------------------------------------------------+

| Tensor Data |

| - token_embd.weight |

| - blk.0.attn_norm.weight |

| - blk.0.attn_q.bias / weight |

| - blk.0.attn_k.bias / weight |

| - blk.0.attn_v.bias / weight |

| - blk.0.attn_output.weight |

| - blk.0.ffn_norm.weight |

| - blk.0.ffn_gate.weight |

| - blk.0.ffn_up.weight |

| - blk.0.ffn_down.weight |

| - ... |

| - blk.27.* |

| - output_norm.weight |

| - output.weight (或 tied 到 embedding) |

+----------------------------------------------------------------------------------+

Qwen2.5模型的gguf格式的内容结构。而要运行这个模型的推理,下面的基本逻辑图是如下的样子:

在这里我不做完整的Transformer每个步骤的详解,只做部分的过程解析:

RMSNorm

RMSNorm 是在网络前向传播中对 hidden state 做尺度归一化,以保证深层 Transformer 在训练和推理时的数值稳定。公式如下

y=γx1dxi2+ϵ{ y = \gamma \cdot \frac{x}{\sqrt{\frac{1}{d}\sum x_i^2 + \epsilon}} }

简单来说,就是用均方根归一化 hidden state 的尺度。

Rope

Rotary Position Embedding, 旋转位置编码

大模型在讲token转化为高维向量之后,会有一个问题,那就是乱序盲,在它眼里,一个句子你将词语随机打乱生成一个乱序的语句,那么新的语句和之前的语句实际上Token是完全一致的。为了赋予大模型时序感,我们引入Rope.

位置信息本质上是Q和K之间的一种距离关系,而不是V自带的属性。Rope通过绝对位置编码来实现相对位置编码。具体来说就是将位置编码表示为二维空间中的旋转操作。这一变换保留了原始向量的性质,但通过旋转引入了位置信息。如果一个二维向量旋转了 m 弧度,另一个旋转了 n 弧度,那么这两个向量之间的夹角就是 |m - n|。基于这样的思想,我们可以有效地引入相对位置的信息。

SwiGLU

门控神经网络激活结构,

SwiGLU(x)=(Wax)Swish(Wbx)SwiGLU(x) = (W_a x) \odot Swish(W_b x)

用来替代普通的激活函数。

Sampler

决定怎么选,如Greedy(选择最大),Top-K(在概率最大的k个token中随机采样)

Metadata读取

首先我们要获取到gguf的模型metadata信息,这里我们可以参考huggingsface上能够看到的metadata信息:

在这里,我们可以将metadata拆为两个部分,header和metadata_kv,其中header包含一个GGUF_Magic(用来标识这个是gguf),还有version, tensor_count, kv_count;而剩下的general, qwen2,tokenizer我们放到一个kv vector中。

首先我们设计一下这里的结构体:

// define gguf type
enum gguf_type {
    GGUF_TYPE_UINT8   = 0,
    GGUF_TYPE_INT8    = 1,
    GGUF_TYPE_UINT16  = 2,
    GGUF_TYPE_INT16   = 3,
    GGUF_TYPE_UINT32  = 4,
    GGUF_TYPE_INT32   = 5,
    GGUF_TYPE_FLOAT32 = 6,
    GGUF_TYPE_BOOL    = 7,
    GGUF_TYPE_STRING  = 8,
    GGUF_TYPE_ARRAY   = 9,
    GGUF_TYPE_UINT64  = 10,
    GGUF_TYPE_INT64   = 11,
    GGUF_TYPE_FLOAT64 = 12,
    GGUF_TYPE_COUNT,
};

struct gguf_header {
    uint32_t magic;
    uint32_t version;
    uint64_t tensor_count;
    uint64_t metadata_kv_count;
};

struct gguf_string {
    uint64_t len;
    std::string data;
};

struct gguf_metadata_array;

using gguf_metadata_payload = std::variant<
    std::monostate,
    uint8_t,
    int8_t,
    uint16_t,
    int16_t,
    uint32_t,
    int32_t,
    float,
    bool,
    gguf_string,
    std::shared_ptr<gguf_metadata_array>,
    uint64_t,
    int64_t,
    double
>;

struct gguf_metadata_value {
    gguf_type type = GGUF_TYPE_COUNT;
    gguf_metadata_payload data{};
};

struct gguf_metadata_array {
    gguf_type element_type = GGUF_TYPE_COUNT;
    uint64_t len = 0;
    std::vector<gguf_metadata_value> values;
};

struct gguf_metadata_kv {
    gguf_string key;
    gguf_metadata_value value;
};

struct gguf_string_hash {
    size_t operator()(const std::string &value) const noexcept {
        // FNV Hash
        constexpr size_t kOffset = 1469598103934665603ULL;
        constexpr size_t kPrime = 1099511628211ULL;

        size_t hash = kOffset;
        for (unsigned char ch : value) {
            hash ^= static_cast<size_t>(ch);
            hash *= kPrime;
        }
        return hash;
    }
};

struct gguf_metadata {
    gguf_header header;
    std::vector<gguf_metadata_kv> kvs;
    std::unordered_map<std::string, size_t, gguf_string_hash> kvs_map;
};

gguf_string read_gguf_string(std::ifstream &input);
gguf_metadata_value read_gguf_metadata_value(std::ifstream &input);
gguf_metadata_kv read_gguf_metadata_kv(std::ifstream &input);
void print_gguf_metadata(const gguf_metadata &meta);
const char * gguf_type_name(gguf_type type);

enum ggml_type : uint32_t {
    GGML_TYPE_F32 = 0,
    GGML_TYPE_F16 = 1,
};

struct gguf_tensor_info {
    gguf_string name;
    uint32_t n_dimensions = 0;
    std::vector<uint64_t> dimensions;
    ggml_type type = GGML_TYPE_F32;
    uint64_t offset = 0;
};

struct gguf_model {
    gguf_metadata metadata;
    std::vector<gguf_tensor_info> tensor_infos;
    uint64_t tensor_data_offset = 0;
    uint32_t alignment = 32;
    std::string file_path;
};

struct gguf_tensor_data {
    gguf_tensor_info info;
    std::vector<uint8_t> raw_data;
};

gguf_tensor_info read_gguf_tensor_info(std::ifstream &input);
gguf_model load_gguf_model(const std::string &path);
void print_gguf_tensor_info(const gguf_tensor_info &info);
void print_gguf_tensor_overview(const gguf_model &model, size_t limit = 8);
gguf_tensor_data load_gguf_tensor_data(const gguf_model &model, const std::string &tensor_name);
size_t ggml_type_size(ggml_type type);
size_t ggml_blck_size(ggml_type type);
const char *ggml_type_name(ggml_type type);

下面是对应的实现:

gguf_type read_gguf_type(std::ifstream &input) {
    const uint32_t raw_type = read_u32(input);
    if (raw_type >= static_cast<uint32_t>(GGUF_TYPE_COUNT)) {
        throw std::runtime_error("invalid gguf metadata type: " + std::to_string(raw_type));
    }
    return static_cast<gguf_type>(raw_type);
}

ggml_type read_ggml_type(std::ifstream &input) {
    const uint32_t raw_type = read_u32(input);
    switch (raw_type) {
        case GGML_TYPE_F32:
        case GGML_TYPE_F16:
            return static_cast<ggml_type>(raw_type);
        default:
            throw std::runtime_error("unsupported ggml tensor type: " + std::to_string(raw_type));
    }
}

uint32_t metadata_u32_or_default(const gguf_metadata &meta, const std::string &key, uint32_t default_value) {
    const auto it = meta.kvs_map.find(key);
    if (it == meta.kvs_map.end()) {
        return default_value;
    }

    const gguf_metadata_kv &kv = meta.kvs[it->second];
    if (kv.value.type != GGUF_TYPE_UINT32) {
        throw std::runtime_error("metadata key '" + key + "' is not UINT32");
    }

    return std::get<uint32_t>(kv.value.data);
}

void build_metadata_index(gguf_metadata &meta) {
    meta.kvs_map.clear();
    meta.kvs_map.reserve(meta.kvs.size());

    for (size_t i = 0; i < meta.kvs.size(); ++i) {
        const std::string &key = meta.kvs[i].key.data;
        const auto [_, inserted] = meta.kvs_map.emplace(key, i);
        if (!inserted) {
            throw std::runtime_error("duplicate metadata key: " + key);
        }
    }
}

uint64_t tensor_element_count(const gguf_tensor_info &info) {
    if (info.dimensions.empty()) {
        return 0;
    }

    return std::accumulate(
        info.dimensions.begin(),
        info.dimensions.end(),
        uint64_t{1},
        [](uint64_t lhs, uint64_t rhs) { return lhs * rhs; }
    );
}

gguf_metadata_value read_gguf_metadata_value_of_type(std::ifstream &input, gguf_type type);

void print_gguf_metadata_value(std::ostream &output, const gguf_metadata_value &value);

void print_gguf_metadata_array(std::ostream &output, const std::shared_ptr<gguf_metadata_array> &array) {
    if (array == nullptr) {
        output << "[]";
        return;
    }

    output << '[';
    size_t printed = 0;
    for (auto it = array->values.cbegin();
         it != array->values.cend() && printed < kMetadataArrayPreviewCount;
         ++it, ++printed) {
        if (it != array->values.cbegin()) {
            output << ", ";
        }
        print_gguf_metadata_value(output, *it);
    }
    if (array->values.size() > kMetadataArrayPreviewCount) {
        output << ", ...";
    }
    output << ']';
}

std::shared_ptr<gguf_metadata_array> read_gguf_metadata_array(std::ifstream &input) {
    auto array = std::make_shared<gguf_metadata_array>();
    array->element_type = read_gguf_type(input);
    array->len = read_u64(input);

    const size_t count = checked_size(array->len, "metadata array length");
    array->values.reserve(count);
    for (size_t i = 0; i < count; ++i) {
        array->values.push_back(read_gguf_metadata_value_of_type(input, array->element_type));
    }

    return array;
}

gguf_metadata_value read_gguf_metadata_value_of_type(std::ifstream &input, gguf_type type) {
    gguf_metadata_value value{};
    value.type = type;

    switch (type) {
        case GGUF_TYPE_UINT8:
            value.data.emplace<uint8_t>(read_u8(input));
            break;
        case GGUF_TYPE_INT8:
            value.data.emplace<int8_t>(read_i8(input));
            break;
        case GGUF_TYPE_UINT16:
            value.data.emplace<uint16_t>(read_u16(input));
            break;
        case GGUF_TYPE_INT16:
            value.data.emplace<int16_t>(read_i16(input));
            break;
        case GGUF_TYPE_UINT32:
            value.data.emplace<uint32_t>(read_u32(input));
            break;
        case GGUF_TYPE_INT32:
            value.data.emplace<int32_t>(read_i32(input));
            break;
        case GGUF_TYPE_FLOAT32:
            value.data.emplace<float>(read_f32(input));
            break;
        case GGUF_TYPE_BOOL:
            value.data.emplace<bool>(read_u8(input) != 0);
            break;
        case GGUF_TYPE_STRING:
            value.data.emplace<gguf_string>(read_gguf_string(input));
            break;
        case GGUF_TYPE_ARRAY:
            value.data.emplace<std::shared_ptr<gguf_metadata_array>>(read_gguf_metadata_array(input));
            break;
        case GGUF_TYPE_UINT64:
            value.data.emplace<uint64_t>(read_u64(input));
            break;
        case GGUF_TYPE_INT64:
            value.data.emplace<int64_t>(read_i64(input));
            break;
        case GGUF_TYPE_FLOAT64:
            value.data.emplace<double>(read_f64(input));
            break;
        case GGUF_TYPE_COUNT:
        default:
            throw std::runtime_error("unsupported gguf metadata type");
    }

    return value;
}

void print_gguf_metadata_value(std::ostream &output, const gguf_metadata_value &value) {
    switch (value.type) {
        case GGUF_TYPE_UINT8:
            output << static_cast<uint32_t>(std::get<uint8_t>(value.data));
            break;
        case GGUF_TYPE_INT8:
            output << static_cast<int32_t>(std::get<int8_t>(value.data));
            break;
        case GGUF_TYPE_UINT16:
            output << std::get<uint16_t>(value.data);
            break;
        case GGUF_TYPE_INT16:
            output << std::get<int16_t>(value.data);
            break;
        case GGUF_TYPE_UINT32:
            output << std::get<uint32_t>(value.data);
            break;
        case GGUF_TYPE_INT32:
            output << std::get<int32_t>(value.data);
            break;
        case GGUF_TYPE_FLOAT32:
            output << std::get<float>(value.data);
            break;
        case GGUF_TYPE_BOOL:
            output << (std::get<bool>(value.data) ? "true" : "false");
            break;
        case GGUF_TYPE_STRING:
            output << std::get<gguf_string>(value.data).data;
            break;
        case GGUF_TYPE_ARRAY:
            print_gguf_metadata_array(output, std::get<std::shared_ptr<gguf_metadata_array>>(value.data));
            break;
        case GGUF_TYPE_UINT64:
            output << std::get<uint64_t>(value.data);
            break;
        case GGUF_TYPE_INT64:
            output << std::get<int64_t>(value.data);
            break;
        case GGUF_TYPE_FLOAT64:
            output << std::get<double>(value.data);
            break;
        case GGUF_TYPE_COUNT:
        default:
            output << "<unknown>";
            break;
    }
}

}  // namespace

gguf_string read_gguf_string(std::ifstream &input) {
    gguf_string value{};
    value.len = read_u64(input);
    value.data.resize(checked_size(value.len, "gguf string length"));

    if (!value.data.empty()) {
        input.read(value.data.data(), static_cast<std::streamsize>(value.data.size()));
        if (!input) {
            throw std::runtime_error("failed to read gguf string payload");
        }
    }

    return value;
}

gguf_metadata_value read_gguf_metadata_value(std::ifstream &input) {
    return read_gguf_metadata_value_of_type(input, read_gguf_type(input));
}

gguf_metadata_kv read_gguf_metadata_kv(std::ifstream &input) {
    gguf_metadata_kv kv{};
    kv.key = read_gguf_string(input);
    kv.value = read_gguf_metadata_value(input);
    return kv;
}

gguf_tensor_info read_gguf_tensor_info(std::ifstream &input) {
    gguf_tensor_info info{};
    info.name = read_gguf_string(input);
    info.n_dimensions = read_u32(input);
    info.dimensions.reserve(info.n_dimensions);

    for (uint32_t i = 0; i < info.n_dimensions; ++i) {
        info.dimensions.push_back(read_u64(input));
    }

    info.type = read_ggml_type(input);
    info.offset = read_u64(input);
    return info;
}

gguf_model load_gguf_model(const std::string &path) {
    std::ifstream input(path, std::ios::binary);
    if (!input.is_open()) {
        throw std::runtime_error("failed to open GGUF file: " + path);
    }

    gguf_model model{};
    model.file_path = path;

    model.metadata.header.magic = read_u32(input);
    model.metadata.header.version = read_u32(input);
    model.metadata.header.tensor_count = read_u64(input);
    model.metadata.header.metadata_kv_count = read_u64(input);

    if (model.metadata.header.magic != kGgufMagic) {
        throw std::runtime_error("invalid GGUF magic");
    }

    model.metadata.kvs.reserve(checked_size(model.metadata.header.metadata_kv_count, "metadata kv count"));
    for (uint64_t i = 0; i < model.metadata.header.metadata_kv_count; ++i) {
        model.metadata.kvs.push_back(read_gguf_metadata_kv(input));
    }

    build_metadata_index(model.metadata);

    model.alignment = metadata_u32_or_default(model.metadata, "general.alignment", 32);

    model.tensor_infos.reserve(checked_size(model.metadata.header.tensor_count, "tensor count"));
    for (uint64_t i = 0; i < model.metadata.header.tensor_count; ++i) {
        model.tensor_infos.push_back(read_gguf_tensor_info(input));
    }

    const uint64_t info_end = static_cast<uint64_t>(input.tellg());
    model.tensor_data_offset = align_offset(info_end, model.alignment);
    return model;
}

void print_gguf_metadata(const gguf_metadata &meta) {
    const gguf_header &header = meta.header;
    const char magic_text[5] = {
        static_cast<char>(header.magic & 0xFF),
        static_cast<char>((header.magic >> 8) & 0xFF),
        static_cast<char>((header.magic >> 16) & 0xFF),
        static_cast<char>((header.magic >> 24) & 0xFF),
        '\0'
    };

    std::cout << "[header]" << '\n';
    std::cout << "magic: 0x"
              << std::hex << std::uppercase << header.magic
              << std::dec << " (" << magic_text << ")" << '\n';
    std::cout << "version: " << header.version << '\n';
    std::cout << "tensor_count: " << header.tensor_count << '\n';
    std::cout << "metadata_kv_count: " << header.metadata_kv_count << '\n';

    std::cout << "[metadata]" << '\n';
    for (auto it = meta.kvs.cbegin(); it != meta.kvs.cend(); ++it) {
        std::cout << it->key.data << ": ";
        print_gguf_metadata_value(std::cout, it->value);
        std::cout << '\n';
    }
}

void print_gguf_tensor_info(const gguf_tensor_info &info) {
    std::cout << info.name.data << " | dims=[";
    for (size_t i = 0; i < info.dimensions.size(); ++i) {
        if (i != 0) {
            std::cout << ", ";
        }
        std::cout << info.dimensions[i];
    }
    std::cout << "]"
              << " | type=" << ggml_type_name(info.type)
              << " | offset=" << info.offset
              << '\n';
}

void print_gguf_tensor_overview(const gguf_model &model, size_t limit) {
    std::cout << "[tensors]" << '\n';
    std::cout << "alignment: " << model.alignment << '\n';
    std::cout << "tensor_data_offset: " << model.tensor_data_offset << '\n';

    const size_t count = std::min(limit, model.tensor_infos.size());
    for (size_t i = 0; i < count; ++i) {
        print_gguf_tensor_info(model.tensor_infos[i]);
    }

    if (model.tensor_infos.size() > count) {
        std::cout << "... (" << (model.tensor_infos.size() - count) << " more tensors)" << '\n';
    }
}

gguf_tensor_data load_gguf_tensor_data(const gguf_model &model, const std::string &tensor_name) {
    const auto it = std::find_if(
        model.tensor_infos.begin(),
        model.tensor_infos.end(),
        [&](const gguf_tensor_info &info) { return info.name.data == tensor_name; }
    );

    if (it == model.tensor_infos.end()) {
        throw std::runtime_error("tensor not found: " + tensor_name);
    }

    const uint64_t element_count = tensor_element_count(*it);
    const size_t type_size = ggml_type_size(it->type);
    const size_t block_size = ggml_blck_size(it->type);
    if (block_size == 0 || element_count % block_size != 0) {
        throw std::runtime_error("invalid tensor block layout for: " + tensor_name);
    }

    const uint64_t byte_count = (element_count / block_size) * type_size;

    std::ifstream input(model.file_path, std::ios::binary);
    if (!input.is_open()) {
        throw std::runtime_error("failed to reopen GGUF file: " + model.file_path);
    }

    const uint64_t absolute_offset = model.tensor_data_offset + it->offset;
    input.seekg(static_cast<std::streamoff>(absolute_offset), std::ios::beg);
    if (!input) {
        throw std::runtime_error("failed to seek tensor data for: " + tensor_name);
    }

    gguf_tensor_data tensor{};
    tensor.info = *it;
    tensor.raw_data.resize(checked_size(byte_count, "tensor byte count"));
    input.read(reinterpret_cast<char *>(tensor.raw_data.data()), static_cast<std::streamsize>(tensor.raw_data.size()));
    if (!input) {
        throw std::runtime_error("failed to read tensor data for: " + tensor_name);
    }

    return tensor;
}

const char *gguf_type_name(gguf_type type) {
    switch (type) {
        case GGUF_TYPE_UINT8:
            return "UINT8";
        case GGUF_TYPE_INT8:
            return "INT8";
        case GGUF_TYPE_UINT16:
            return "UINT16";
        case GGUF_TYPE_INT16:
            return "INT16";
        case GGUF_TYPE_UINT32:
            return "UINT32";
        case GGUF_TYPE_INT32:
            return "INT32";
        case GGUF_TYPE_FLOAT32:
            return "FLOAT32";
        case GGUF_TYPE_BOOL:
            return "BOOL";
        case GGUF_TYPE_STRING:
            return "STRING";
        case GGUF_TYPE_ARRAY:
            return "ARRAY";
        case GGUF_TYPE_UINT64:
            return "UINT64";
        case GGUF_TYPE_INT64:
            return "INT64";
        case GGUF_TYPE_FLOAT64:
            return "FLOAT64";
        case GGUF_TYPE_COUNT:
            return "COUNT";
        default:
            return "UNKNOWN";
    }
}

size_t ggml_type_size(ggml_type type) {
    switch (type) {
        case GGML_TYPE_F32:
            return 4;
        case GGML_TYPE_F16:
            return 2;
        default:
            throw std::runtime_error("unsupported ggml type size query");
    }
}

size_t ggml_blck_size(ggml_type type) {
    switch (type) {
        case GGML_TYPE_F32:
        case GGML_TYPE_F16:
            return 1;
        default:
            throw std::runtime_error("unsupported ggml block size query");
    }
}

const char *ggml_type_name(ggml_type type) {
    switch (type) {
        case GGML_TYPE_F32:
            return "F32";
        case GGML_TYPE_F16:
            return "F16";
        default:
            return "UNKNOWN";
    }
}

我们测试一下输出:

正常输出:

可以看到,我们成功读取到了gguf的metadata和Tokenizer的信息,并且和huggingface上的数据一致。这样我们就可以在之后来使用这些数据了。

下面简单给一下这些metadata的意义:

基础信息

  • general.architecture: 模型架构名字,比如 qwen2

  • general.type: 文件类型,通常是 model

  • general.name: 模型名

  • general.version: 模型版本

  • general.finetune: 微调来源或微调后的名称

  • general.size_label: 参数规模标签,比如 1.8B

  • general.file_type: 权重量化/存储类型编号

  • general.quantization_version: GGUF 量化格式版本

模型结构参数 这些决定网络长什么样:

  • qwen2.block_count: Transformer block 层数

  • qwen2.context_length: 最大上下文长度

  • qwen2.embedding_length: hidden size / embedding 维度

  • qwen2.feed_forward_length: MLP 中间层维度

  • qwen2.attention.head_count: 注意力头数

  • qwen2.attention.head_count_kv: KV 头数,GQA/MQA 相关

  • qwen2.rope.freq_base: RoPE 的 base 频率

  • qwen2.attention.layer_norm_rms_epsilon: RMSNorm 的 epsilon

Tokenizer 信息 这些决定“文本怎么变成 token”:

  • tokenizer.ggml.model: tokenizer 类型,比如 gpt2

  • tokenizer.ggml.pre: 预分词/兼容方案,比如 qwen2

  • tokenizer.ggml.tokens: 词表,id 到 token 字符串的映射

  • tokenizer.ggml.token_type: 每个 token 的类型标记

  • tokenizer.ggml.merges: BPE merge 规则

  • tokenizer.ggml.bos_token_id: BOS token id

  • tokenizer.ggml.eos_token_id: EOS token id

  • tokenizer.ggml.padding_token_id: padding token id

  • tokenizer.ggml.add_bos_token: 编码时是否自动加 BOS

对话模板

  • tokenizer.chat_template: 聊天模板,定义 system/user/assistant/tool 消息如何拼成最终 prompt

评论