在上学期的课程中,我们的任务是对修改后的llama2.c项目(移除了kv-cache和编译器优化)进行优化。因为这个项目强制按组分配,所以我没有机会完整的从头来编写并优化LLM的推理引擎。因此在这里尝试手动从头使用CPU,Naive CUDA Kernel,写一个Qwen的推理引擎。
在这里我们首先选取Qwen2.5-1.5B模型,因为对于Qwen2.5推理引擎来说,这是一个标准的decoder-only Transformer。而Qwen3.5则是Hybrid架构,这个我们将在之后处理。
LLM推理引擎结构
+----------------------------------------------------------------------------------+
| Qwen2.5-1.5B-Instruct.gguf |
+----------------------------------------------------------------------------------+
| Header |
| - magic/version |
| - tensor_count |
| - metadata_kv_count |
+----------------------------------------------------------------------------------+
| Metadata KV |
| - general.architecture = qwen2 |
| - general.name |
| - qwen2.block_count = 28 |
| - qwen2.embedding_length |
| - qwen2.attention.head_count = 12 |
| - qwen2.attention.head_count_kv = 2 |
| - qwen2.rope.freq_base |
| - qwen2.context_length = 32768 |
| - tokenizer.ggml.tokens |
| - tokenizer.ggml.merges / special tokens |
| - general.file_type / quantization info |
+----------------------------------------------------------------------------------+
| Tensor Directory |
| - tensor name |
| - n_dims |
| - shape[] |
| - ggml_type |
| - data_offset |
+----------------------------------------------------------------------------------+
| Tensor Data |
| - token_embd.weight |
| - blk.0.attn_norm.weight |
| - blk.0.attn_q.bias / weight |
| - blk.0.attn_k.bias / weight |
| - blk.0.attn_v.bias / weight |
| - blk.0.attn_output.weight |
| - blk.0.ffn_norm.weight |
| - blk.0.ffn_gate.weight |
| - blk.0.ffn_up.weight |
| - blk.0.ffn_down.weight |
| - ... |
| - blk.27.* |
| - output_norm.weight |
| - output.weight (或 tied 到 embedding) |
+----------------------------------------------------------------------------------+
Qwen2.5模型的gguf格式的内容结构。而要运行这个模型的推理,下面的基本逻辑图是如下的样子:

在这里我不做完整的Transformer每个步骤的详解,只做部分的过程解析:
RMSNorm
RMSNorm 是在网络前向传播中对 hidden state 做尺度归一化,以保证深层 Transformer 在训练和推理时的数值稳定。公式如下
简单来说,就是用均方根归一化 hidden state 的尺度。
Rope
Rotary Position Embedding, 旋转位置编码
大模型在讲token转化为高维向量之后,会有一个问题,那就是乱序盲,在它眼里,一个句子你将词语随机打乱生成一个乱序的语句,那么新的语句和之前的语句实际上Token是完全一致的。为了赋予大模型时序感,我们引入Rope.
位置信息本质上是Q和K之间的一种距离关系,而不是V自带的属性。Rope通过绝对位置编码来实现相对位置编码。具体来说就是将位置编码表示为二维空间中的旋转操作。这一变换保留了原始向量的性质,但通过旋转引入了位置信息。如果一个二维向量旋转了 m 弧度,另一个旋转了 n 弧度,那么这两个向量之间的夹角就是 |m - n|。基于这样的思想,我们可以有效地引入相对位置的信息。
SwiGLU
门控神经网络激活结构,
用来替代普通的激活函数。
Sampler
决定怎么选,如Greedy(选择最大),Top-K(在概率最大的k个token中随机采样)
Metadata读取
首先我们要获取到gguf的模型metadata信息,这里我们可以参考huggingsface上能够看到的metadata信息:

在这里,我们可以将metadata拆为两个部分,header和metadata_kv,其中header包含一个GGUF_Magic(用来标识这个是gguf),还有version, tensor_count, kv_count;而剩下的general, qwen2,tokenizer我们放到一个kv vector中。
首先我们设计一下这里的结构体:
// define gguf type
enum gguf_type {
GGUF_TYPE_UINT8 = 0,
GGUF_TYPE_INT8 = 1,
GGUF_TYPE_UINT16 = 2,
GGUF_TYPE_INT16 = 3,
GGUF_TYPE_UINT32 = 4,
GGUF_TYPE_INT32 = 5,
GGUF_TYPE_FLOAT32 = 6,
GGUF_TYPE_BOOL = 7,
GGUF_TYPE_STRING = 8,
GGUF_TYPE_ARRAY = 9,
GGUF_TYPE_UINT64 = 10,
GGUF_TYPE_INT64 = 11,
GGUF_TYPE_FLOAT64 = 12,
GGUF_TYPE_COUNT,
};
struct gguf_header {
uint32_t magic;
uint32_t version;
uint64_t tensor_count;
uint64_t metadata_kv_count;
};
struct gguf_string {
uint64_t len;
std::string data;
};
struct gguf_metadata_array;
using gguf_metadata_payload = std::variant<
std::monostate,
uint8_t,
int8_t,
uint16_t,
int16_t,
uint32_t,
int32_t,
float,
bool,
gguf_string,
std::shared_ptr<gguf_metadata_array>,
uint64_t,
int64_t,
double
>;
struct gguf_metadata_value {
gguf_type type = GGUF_TYPE_COUNT;
gguf_metadata_payload data{};
};
struct gguf_metadata_array {
gguf_type element_type = GGUF_TYPE_COUNT;
uint64_t len = 0;
std::vector<gguf_metadata_value> values;
};
struct gguf_metadata_kv {
gguf_string key;
gguf_metadata_value value;
};
struct gguf_string_hash {
size_t operator()(const std::string &value) const noexcept {
// FNV Hash
constexpr size_t kOffset = 1469598103934665603ULL;
constexpr size_t kPrime = 1099511628211ULL;
size_t hash = kOffset;
for (unsigned char ch : value) {
hash ^= static_cast<size_t>(ch);
hash *= kPrime;
}
return hash;
}
};
struct gguf_metadata {
gguf_header header;
std::vector<gguf_metadata_kv> kvs;
std::unordered_map<std::string, size_t, gguf_string_hash> kvs_map;
};
gguf_string read_gguf_string(std::ifstream &input);
gguf_metadata_value read_gguf_metadata_value(std::ifstream &input);
gguf_metadata_kv read_gguf_metadata_kv(std::ifstream &input);
void print_gguf_metadata(const gguf_metadata &meta);
const char * gguf_type_name(gguf_type type);
enum ggml_type : uint32_t {
GGML_TYPE_F32 = 0,
GGML_TYPE_F16 = 1,
};
struct gguf_tensor_info {
gguf_string name;
uint32_t n_dimensions = 0;
std::vector<uint64_t> dimensions;
ggml_type type = GGML_TYPE_F32;
uint64_t offset = 0;
};
struct gguf_model {
gguf_metadata metadata;
std::vector<gguf_tensor_info> tensor_infos;
uint64_t tensor_data_offset = 0;
uint32_t alignment = 32;
std::string file_path;
};
struct gguf_tensor_data {
gguf_tensor_info info;
std::vector<uint8_t> raw_data;
};
gguf_tensor_info read_gguf_tensor_info(std::ifstream &input);
gguf_model load_gguf_model(const std::string &path);
void print_gguf_tensor_info(const gguf_tensor_info &info);
void print_gguf_tensor_overview(const gguf_model &model, size_t limit = 8);
gguf_tensor_data load_gguf_tensor_data(const gguf_model &model, const std::string &tensor_name);
size_t ggml_type_size(ggml_type type);
size_t ggml_blck_size(ggml_type type);
const char *ggml_type_name(ggml_type type);下面是对应的实现:
gguf_type read_gguf_type(std::ifstream &input) {
const uint32_t raw_type = read_u32(input);
if (raw_type >= static_cast<uint32_t>(GGUF_TYPE_COUNT)) {
throw std::runtime_error("invalid gguf metadata type: " + std::to_string(raw_type));
}
return static_cast<gguf_type>(raw_type);
}
ggml_type read_ggml_type(std::ifstream &input) {
const uint32_t raw_type = read_u32(input);
switch (raw_type) {
case GGML_TYPE_F32:
case GGML_TYPE_F16:
return static_cast<ggml_type>(raw_type);
default:
throw std::runtime_error("unsupported ggml tensor type: " + std::to_string(raw_type));
}
}
uint32_t metadata_u32_or_default(const gguf_metadata &meta, const std::string &key, uint32_t default_value) {
const auto it = meta.kvs_map.find(key);
if (it == meta.kvs_map.end()) {
return default_value;
}
const gguf_metadata_kv &kv = meta.kvs[it->second];
if (kv.value.type != GGUF_TYPE_UINT32) {
throw std::runtime_error("metadata key '" + key + "' is not UINT32");
}
return std::get<uint32_t>(kv.value.data);
}
void build_metadata_index(gguf_metadata &meta) {
meta.kvs_map.clear();
meta.kvs_map.reserve(meta.kvs.size());
for (size_t i = 0; i < meta.kvs.size(); ++i) {
const std::string &key = meta.kvs[i].key.data;
const auto [_, inserted] = meta.kvs_map.emplace(key, i);
if (!inserted) {
throw std::runtime_error("duplicate metadata key: " + key);
}
}
}
uint64_t tensor_element_count(const gguf_tensor_info &info) {
if (info.dimensions.empty()) {
return 0;
}
return std::accumulate(
info.dimensions.begin(),
info.dimensions.end(),
uint64_t{1},
[](uint64_t lhs, uint64_t rhs) { return lhs * rhs; }
);
}
gguf_metadata_value read_gguf_metadata_value_of_type(std::ifstream &input, gguf_type type);
void print_gguf_metadata_value(std::ostream &output, const gguf_metadata_value &value);
void print_gguf_metadata_array(std::ostream &output, const std::shared_ptr<gguf_metadata_array> &array) {
if (array == nullptr) {
output << "[]";
return;
}
output << '[';
size_t printed = 0;
for (auto it = array->values.cbegin();
it != array->values.cend() && printed < kMetadataArrayPreviewCount;
++it, ++printed) {
if (it != array->values.cbegin()) {
output << ", ";
}
print_gguf_metadata_value(output, *it);
}
if (array->values.size() > kMetadataArrayPreviewCount) {
output << ", ...";
}
output << ']';
}
std::shared_ptr<gguf_metadata_array> read_gguf_metadata_array(std::ifstream &input) {
auto array = std::make_shared<gguf_metadata_array>();
array->element_type = read_gguf_type(input);
array->len = read_u64(input);
const size_t count = checked_size(array->len, "metadata array length");
array->values.reserve(count);
for (size_t i = 0; i < count; ++i) {
array->values.push_back(read_gguf_metadata_value_of_type(input, array->element_type));
}
return array;
}
gguf_metadata_value read_gguf_metadata_value_of_type(std::ifstream &input, gguf_type type) {
gguf_metadata_value value{};
value.type = type;
switch (type) {
case GGUF_TYPE_UINT8:
value.data.emplace<uint8_t>(read_u8(input));
break;
case GGUF_TYPE_INT8:
value.data.emplace<int8_t>(read_i8(input));
break;
case GGUF_TYPE_UINT16:
value.data.emplace<uint16_t>(read_u16(input));
break;
case GGUF_TYPE_INT16:
value.data.emplace<int16_t>(read_i16(input));
break;
case GGUF_TYPE_UINT32:
value.data.emplace<uint32_t>(read_u32(input));
break;
case GGUF_TYPE_INT32:
value.data.emplace<int32_t>(read_i32(input));
break;
case GGUF_TYPE_FLOAT32:
value.data.emplace<float>(read_f32(input));
break;
case GGUF_TYPE_BOOL:
value.data.emplace<bool>(read_u8(input) != 0);
break;
case GGUF_TYPE_STRING:
value.data.emplace<gguf_string>(read_gguf_string(input));
break;
case GGUF_TYPE_ARRAY:
value.data.emplace<std::shared_ptr<gguf_metadata_array>>(read_gguf_metadata_array(input));
break;
case GGUF_TYPE_UINT64:
value.data.emplace<uint64_t>(read_u64(input));
break;
case GGUF_TYPE_INT64:
value.data.emplace<int64_t>(read_i64(input));
break;
case GGUF_TYPE_FLOAT64:
value.data.emplace<double>(read_f64(input));
break;
case GGUF_TYPE_COUNT:
default:
throw std::runtime_error("unsupported gguf metadata type");
}
return value;
}
void print_gguf_metadata_value(std::ostream &output, const gguf_metadata_value &value) {
switch (value.type) {
case GGUF_TYPE_UINT8:
output << static_cast<uint32_t>(std::get<uint8_t>(value.data));
break;
case GGUF_TYPE_INT8:
output << static_cast<int32_t>(std::get<int8_t>(value.data));
break;
case GGUF_TYPE_UINT16:
output << std::get<uint16_t>(value.data);
break;
case GGUF_TYPE_INT16:
output << std::get<int16_t>(value.data);
break;
case GGUF_TYPE_UINT32:
output << std::get<uint32_t>(value.data);
break;
case GGUF_TYPE_INT32:
output << std::get<int32_t>(value.data);
break;
case GGUF_TYPE_FLOAT32:
output << std::get<float>(value.data);
break;
case GGUF_TYPE_BOOL:
output << (std::get<bool>(value.data) ? "true" : "false");
break;
case GGUF_TYPE_STRING:
output << std::get<gguf_string>(value.data).data;
break;
case GGUF_TYPE_ARRAY:
print_gguf_metadata_array(output, std::get<std::shared_ptr<gguf_metadata_array>>(value.data));
break;
case GGUF_TYPE_UINT64:
output << std::get<uint64_t>(value.data);
break;
case GGUF_TYPE_INT64:
output << std::get<int64_t>(value.data);
break;
case GGUF_TYPE_FLOAT64:
output << std::get<double>(value.data);
break;
case GGUF_TYPE_COUNT:
default:
output << "<unknown>";
break;
}
}
} // namespace
gguf_string read_gguf_string(std::ifstream &input) {
gguf_string value{};
value.len = read_u64(input);
value.data.resize(checked_size(value.len, "gguf string length"));
if (!value.data.empty()) {
input.read(value.data.data(), static_cast<std::streamsize>(value.data.size()));
if (!input) {
throw std::runtime_error("failed to read gguf string payload");
}
}
return value;
}
gguf_metadata_value read_gguf_metadata_value(std::ifstream &input) {
return read_gguf_metadata_value_of_type(input, read_gguf_type(input));
}
gguf_metadata_kv read_gguf_metadata_kv(std::ifstream &input) {
gguf_metadata_kv kv{};
kv.key = read_gguf_string(input);
kv.value = read_gguf_metadata_value(input);
return kv;
}
gguf_tensor_info read_gguf_tensor_info(std::ifstream &input) {
gguf_tensor_info info{};
info.name = read_gguf_string(input);
info.n_dimensions = read_u32(input);
info.dimensions.reserve(info.n_dimensions);
for (uint32_t i = 0; i < info.n_dimensions; ++i) {
info.dimensions.push_back(read_u64(input));
}
info.type = read_ggml_type(input);
info.offset = read_u64(input);
return info;
}
gguf_model load_gguf_model(const std::string &path) {
std::ifstream input(path, std::ios::binary);
if (!input.is_open()) {
throw std::runtime_error("failed to open GGUF file: " + path);
}
gguf_model model{};
model.file_path = path;
model.metadata.header.magic = read_u32(input);
model.metadata.header.version = read_u32(input);
model.metadata.header.tensor_count = read_u64(input);
model.metadata.header.metadata_kv_count = read_u64(input);
if (model.metadata.header.magic != kGgufMagic) {
throw std::runtime_error("invalid GGUF magic");
}
model.metadata.kvs.reserve(checked_size(model.metadata.header.metadata_kv_count, "metadata kv count"));
for (uint64_t i = 0; i < model.metadata.header.metadata_kv_count; ++i) {
model.metadata.kvs.push_back(read_gguf_metadata_kv(input));
}
build_metadata_index(model.metadata);
model.alignment = metadata_u32_or_default(model.metadata, "general.alignment", 32);
model.tensor_infos.reserve(checked_size(model.metadata.header.tensor_count, "tensor count"));
for (uint64_t i = 0; i < model.metadata.header.tensor_count; ++i) {
model.tensor_infos.push_back(read_gguf_tensor_info(input));
}
const uint64_t info_end = static_cast<uint64_t>(input.tellg());
model.tensor_data_offset = align_offset(info_end, model.alignment);
return model;
}
void print_gguf_metadata(const gguf_metadata &meta) {
const gguf_header &header = meta.header;
const char magic_text[5] = {
static_cast<char>(header.magic & 0xFF),
static_cast<char>((header.magic >> 8) & 0xFF),
static_cast<char>((header.magic >> 16) & 0xFF),
static_cast<char>((header.magic >> 24) & 0xFF),
'\0'
};
std::cout << "[header]" << '\n';
std::cout << "magic: 0x"
<< std::hex << std::uppercase << header.magic
<< std::dec << " (" << magic_text << ")" << '\n';
std::cout << "version: " << header.version << '\n';
std::cout << "tensor_count: " << header.tensor_count << '\n';
std::cout << "metadata_kv_count: " << header.metadata_kv_count << '\n';
std::cout << "[metadata]" << '\n';
for (auto it = meta.kvs.cbegin(); it != meta.kvs.cend(); ++it) {
std::cout << it->key.data << ": ";
print_gguf_metadata_value(std::cout, it->value);
std::cout << '\n';
}
}
void print_gguf_tensor_info(const gguf_tensor_info &info) {
std::cout << info.name.data << " | dims=[";
for (size_t i = 0; i < info.dimensions.size(); ++i) {
if (i != 0) {
std::cout << ", ";
}
std::cout << info.dimensions[i];
}
std::cout << "]"
<< " | type=" << ggml_type_name(info.type)
<< " | offset=" << info.offset
<< '\n';
}
void print_gguf_tensor_overview(const gguf_model &model, size_t limit) {
std::cout << "[tensors]" << '\n';
std::cout << "alignment: " << model.alignment << '\n';
std::cout << "tensor_data_offset: " << model.tensor_data_offset << '\n';
const size_t count = std::min(limit, model.tensor_infos.size());
for (size_t i = 0; i < count; ++i) {
print_gguf_tensor_info(model.tensor_infos[i]);
}
if (model.tensor_infos.size() > count) {
std::cout << "... (" << (model.tensor_infos.size() - count) << " more tensors)" << '\n';
}
}
gguf_tensor_data load_gguf_tensor_data(const gguf_model &model, const std::string &tensor_name) {
const auto it = std::find_if(
model.tensor_infos.begin(),
model.tensor_infos.end(),
[&](const gguf_tensor_info &info) { return info.name.data == tensor_name; }
);
if (it == model.tensor_infos.end()) {
throw std::runtime_error("tensor not found: " + tensor_name);
}
const uint64_t element_count = tensor_element_count(*it);
const size_t type_size = ggml_type_size(it->type);
const size_t block_size = ggml_blck_size(it->type);
if (block_size == 0 || element_count % block_size != 0) {
throw std::runtime_error("invalid tensor block layout for: " + tensor_name);
}
const uint64_t byte_count = (element_count / block_size) * type_size;
std::ifstream input(model.file_path, std::ios::binary);
if (!input.is_open()) {
throw std::runtime_error("failed to reopen GGUF file: " + model.file_path);
}
const uint64_t absolute_offset = model.tensor_data_offset + it->offset;
input.seekg(static_cast<std::streamoff>(absolute_offset), std::ios::beg);
if (!input) {
throw std::runtime_error("failed to seek tensor data for: " + tensor_name);
}
gguf_tensor_data tensor{};
tensor.info = *it;
tensor.raw_data.resize(checked_size(byte_count, "tensor byte count"));
input.read(reinterpret_cast<char *>(tensor.raw_data.data()), static_cast<std::streamsize>(tensor.raw_data.size()));
if (!input) {
throw std::runtime_error("failed to read tensor data for: " + tensor_name);
}
return tensor;
}
const char *gguf_type_name(gguf_type type) {
switch (type) {
case GGUF_TYPE_UINT8:
return "UINT8";
case GGUF_TYPE_INT8:
return "INT8";
case GGUF_TYPE_UINT16:
return "UINT16";
case GGUF_TYPE_INT16:
return "INT16";
case GGUF_TYPE_UINT32:
return "UINT32";
case GGUF_TYPE_INT32:
return "INT32";
case GGUF_TYPE_FLOAT32:
return "FLOAT32";
case GGUF_TYPE_BOOL:
return "BOOL";
case GGUF_TYPE_STRING:
return "STRING";
case GGUF_TYPE_ARRAY:
return "ARRAY";
case GGUF_TYPE_UINT64:
return "UINT64";
case GGUF_TYPE_INT64:
return "INT64";
case GGUF_TYPE_FLOAT64:
return "FLOAT64";
case GGUF_TYPE_COUNT:
return "COUNT";
default:
return "UNKNOWN";
}
}
size_t ggml_type_size(ggml_type type) {
switch (type) {
case GGML_TYPE_F32:
return 4;
case GGML_TYPE_F16:
return 2;
default:
throw std::runtime_error("unsupported ggml type size query");
}
}
size_t ggml_blck_size(ggml_type type) {
switch (type) {
case GGML_TYPE_F32:
case GGML_TYPE_F16:
return 1;
default:
throw std::runtime_error("unsupported ggml block size query");
}
}
const char *ggml_type_name(ggml_type type) {
switch (type) {
case GGML_TYPE_F32:
return "F32";
case GGML_TYPE_F16:
return "F16";
default:
return "UNKNOWN";
}
}我们测试一下输出:
正常输出:

可以看到,我们成功读取到了gguf的metadata和Tokenizer的信息,并且和huggingface上的数据一致。这样我们就可以在之后来使用这些数据了。
下面简单给一下这些metadata的意义:
基础信息
general.architecture: 模型架构名字,比如 qwen2
general.type: 文件类型,通常是 model
general.name: 模型名
general.version: 模型版本
general.finetune: 微调来源或微调后的名称
general.size_label: 参数规模标签,比如 1.8B
general.file_type: 权重量化/存储类型编号
general.quantization_version: GGUF 量化格式版本
模型结构参数 这些决定网络长什么样:
qwen2.block_count: Transformer block 层数
qwen2.context_length: 最大上下文长度
qwen2.embedding_length: hidden size / embedding 维度
qwen2.feed_forward_length: MLP 中间层维度
qwen2.attention.head_count: 注意力头数
qwen2.attention.head_count_kv: KV 头数,GQA/MQA 相关
qwen2.rope.freq_base: RoPE 的 base 频率
qwen2.attention.layer_norm_rms_epsilon: RMSNorm 的 epsilon
Tokenizer 信息 这些决定“文本怎么变成 token”:
tokenizer.ggml.model: tokenizer 类型,比如 gpt2
tokenizer.ggml.pre: 预分词/兼容方案,比如 qwen2
tokenizer.ggml.tokens: 词表,id 到 token 字符串的映射
tokenizer.ggml.token_type: 每个 token 的类型标记
tokenizer.ggml.merges: BPE merge 规则
tokenizer.ggml.bos_token_id: BOS token id
tokenizer.ggml.eos_token_id: EOS token id
tokenizer.ggml.padding_token_id: padding token id
tokenizer.ggml.add_bos_token: 编码时是否自动加 BOS
对话模板
tokenizer.chat_template: 聊天模板,定义 system/user/assistant/tool 消息如何拼成最终 prompt