写在开头
本文仅记录我在DDL最后一天(悲)的尝试与探索,目前模型暂时没有跑通
后续会继续做相关尝试
repo: https://github.com/MollySophia/rwkv-cix
导出一个单层模型
为了方便调试,先导出一个单层模型
下载rwkv7-g1a-0.1b-20250728-ctx4096.pth,修改脚本路径 python export_model_singlelayer.py
简单尝试写一个cfg
没找到cfg的具体定义文档(注:后来发现可以去https://github.com/Arm-China/Compass\_Optimizer 这些单独模块的README里找,并且有一个pdf格式的教程:https://github.com/Arm-China/Compass\_Optimizer/blob/main/tutorial.pdf),照着现有的改一改看
[Common]
mode = build
[Parser]
model_type = onnx
model_name = rwkv
detection_postprocess =
model_domain = llm
input_model = ./model.onnx
output_dir = ./build
[Optimizer]
# calibration_data = None
calibration_batch_size = 1
metric_batch_size = 1
output_dir = ./
dataset = numpydataset
save_statistic_info = True
cast_dtypes_for_lib = True
quantize_method_for_activation = per_tensor_asymmetric
quantize_method_for_weight = per_channel_symmetric_restricted_range
weight_bits = 8
activation_bits = 16
[GBuilder]
target = X2_1204MP3
outputs = ./rwkv.cix
profile = True
tiling = fps
cixbuild
试图先导出一个浮点模型 cixbuild config.cfg
发现在没有calibration data的情况下会报错
修改/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/AIPUBuilder/build\_cix.py的94行: if not os.path.isabs(calibration_data_path): -> if calibration_data_path is not None and not os.path.isabs(calibration_data_path):
继续cixbuild:
$ cixbuild config.cfg
/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/torch/cuda/__init__.py:287: UserWarning:
NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.
If you want to use the NVIDIA GeForce RTX 5090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(
No calibration data provided, please consider use a calib dataset...
[I] Build with version 6.1.3407
[I] Parsing model....
[I] [Parser]: Begin to parse onnx model rwkv...
2025-11-14 00:19:59.045141: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-14 00:19:59.046232: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-14 00:19:59.061841: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-11-14 00:19:59.061865: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-11-14 00:19:59.061881: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-11-14 00:19:59.065816: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-14 00:19:59.516625: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2025-11-14 00:19:59.861724: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-11-14 00:19:59.863475: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2211] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[W] [Parser]: Convert unsupported type int64 to type int32 for Input Node (input)!
[W] [Parser]: the input node(s) has changed or not set, please check the IR to confirm the input tensors order.
[I] [Parser]: The input tensor(s) is/are: state2_in_0,state1_in_0,state0_in_0,input_0
[I] [Parser]: 0 error(s), 2 warning(s) generated.
[I] [Parser]: Parser done!
[I] [Parser]: Parser cost 2.71 seconds.
[I] Parse model complete
[I] Simplifying float model.
[I] [IRChecker] Start to check IR: /home/molly/rwkv-cix/build/internal/rwkv.txt
[I] [IRChecker] model_name: rwkv
[I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled)
[I] [graph.cpp :1605] loading graph weight: /home/molly/rwkv-cix/./build/internal/rwkv.bin size: 0xdc95610
[I] Start to simplify the graph...
[I] Using fixed-point full optimization, it may take long long time ....
[I] GSim simplified result:
------------------------------------------------------------------------
OpType.Reshape: -8
OpType.Mul: +1
OpType.Rsqrt: +1
OpType.Sqrt: -1
OpType.Div: -1
------------------------------------------------------------------------
[I] Simplify Done.
[I] Simplify float model Done.
[I] Optimizing model....
[I] [OPT] [00:25:02]: [arg_parser] is running.
[W] [OPT] [00:25:02]: please set 'calibration_data' field in cfg file if want to statistic quantization values. And Optimizer will use all zeros dataset for statistic tensors information.
[I] [OPT] [00:25:02]: tool name: Compass-Optimizer, version: 1.3.3407, use cuda: True, running device: cuda::0
[I] [OPT] [00:25:02]: [quantization config Info][model name]: rwkv, [quantization method for weight]: per_channel_symmetric_restricted_range, [quantization method for activation]: per_tensor_asymmetric, [calibation strategy for weight]: extrema, [calibation strategy for activation]: mean, [quantization precision]: activation_bits=8, weight_bits=8, bias_bits=32, lut_items_in_bits=8
[I] [OPT] [00:25:02]: Suggest using "aipuchecker" to validate the IR firstly if you are not sure about its validity.
[W] [OPT] [00:25:02]: Failed to parse IR with the exception msg: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[W] [OPT] [00:25:02]: Invalid IR, please use "aipuchecker" to diagnose it for more specific information.
[I] [OPT] [00:25:02]: Compass-Optimizer has done at [arg_parser] period.
[I] [OPT] [00:25:02]: [Done]cost time: 304.5274109840393
[E] Optimizing model failed! CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
意识到cixbuild依赖torch==2.7.0,而torch在2.8.0开始才支持50系显卡
尝试安装torch==2.8.0后:
Traceback (most recent call last):
File "/home/molly/miniconda3/envs/cix/bin/cixbuild", line 7, in <module>
sys.exit(build_main())
File "/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/AIPUBuilder/build_cix.py", line 217, in build_main
from AIPUBuilder.Optimizer.config import fields_to_str
File "/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/AIPUBuilder/Optimizer/__init__.py", line 4, in <module>
from AIPUBuilder.Optimizer.optmaster import *
File "/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/AIPUBuilder/Optimizer/optmaster.py", line 15, in <module>
from AIPUBuilder.Optimizer.ops import *
File "/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/AIPUBuilder/Optimizer/ops/__init__.py", line 15, in <module>
from . import pyramidroi
File "/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/AIPUBuilder/Optimizer/ops/pyramidroi.py", line 8, in <module>
import torchvision
File "/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module>
from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils # usort:skip
File "/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
def meta_nms(dets, scores, iou_threshold):
File "/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/torch/library.py", line 1069, in register
use_lib._register_fake(
File "/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/torch/library.py", line 219, in _register_fake
handle = entry.fake_impl.register(
File "/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 50, in register
if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist
看来不太行,让他不用cuda试试:
$ CUDA_VISIBLE_DEVICES="" cixbuild config.cfg
No calibration data provided, please consider use a calib dataset...
[I] Build with version 6.1.3407
[I] Parsing model....
[I] [Parser]: Begin to parse onnx model rwkv...
2025-11-14 01:07:10.374290: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-14 01:07:10.375379: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-14 01:07:10.389602: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-11-14 01:07:10.389623: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-11-14 01:07:10.389640: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-11-14 01:07:10.393255: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-14 01:07:10.834149: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2025-11-14 01:07:11.200054: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2025-11-14 01:07:11.200076: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: molly-workstation
2025-11-14 01:07:11.200079: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: molly-workstation
2025-11-14 01:07:11.200104: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: 580.105.8
2025-11-14 01:07:11.200115: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: NOT_FOUND: could not find kernel module information in driver version file contents: "NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 580.105.08 Release Build (root@)
GCC version: gcc version 15.2.1 20250813 (GCC)
"
[W] [Parser]: Convert unsupported type int64 to type int32 for Input Node (input)!
[W] [Parser]: the input node(s) has changed or not set, please check the IR to confirm the input tensors order.
[I] [Parser]: The input tensor(s) is/are: state2_in_0,state1_in_0,state0_in_0,input_0
[I] [Parser]: 0 error(s), 2 warning(s) generated.
[I] [Parser]: Parser done!
[I] [Parser]: Parser cost 2.79 seconds.
[I] Parse model complete
[I] Simplifying float model.
[I] [IRChecker] Start to check IR: /home/molly/rwkv-cix/build/internal_2025_11_14_1_7_10_q1p4b/rwkv.txt
[I] [IRChecker] model_name: rwkv
[I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled)
[I] [graph.cpp :1605] loading graph weight: /home/molly/rwkv-cix/./build/internal_2025_11_14_1_7_10_q1p4b/rwkv.bin size: 0xdc95610
[I] Start to simplify the graph...
[I] Using fixed-point full optimization, it may take long long time ....
[I] GSim simplified result:
------------------------------------------------------------------------
OpType.Reshape: -8
OpType.Mul: +1
OpType.Rsqrt: +1
OpType.Sqrt: -1
OpType.Div: -1
------------------------------------------------------------------------
[I] Simplify Done.
[I] Simplify float model Done.
[I] Optimizing model....
[I] [OPT] [01:12:14]: [arg_parser] is running.
[W] [OPT] [01:12:14]: please set 'calibration_data' field in cfg file if want to statistic quantization values. And Optimizer will use all zeros dataset for statistic tensors information.
[I] [OPT] [01:12:14]: tool name: Compass-Optimizer, version: 1.3.3407, use cuda: False, running device: cpu
[I] [OPT] [01:12:14]: [quantization config Info][model name]: rwkv, [quantization method for weight]: per_channel_symmetric_restricted_range, [quantization method for activation]: per_tensor_asymmetric, [calibation strategy for weight]: extrema, [calibation strategy for activation]: mean, [quantization precision]: activation_bits=16, weight_bits=8, bias_bits=32, lut_items_in_bits=8
[I] [OPT] [01:12:14]: Suggest using "aipuchecker" to validate the IR firstly if you are not sure about its validity.
[I] [OPT] [01:12:14]: IR loaded.
Building graph: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 119/119 [00:00<00:00, 2800.49it/s]
[I] [OPT] [01:12:14]: Begin to load weights.
[I] [OPT] [01:12:14]: Weights loaded.
Deserializing bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47/47 [00:00<00:00, 326.60it/s]
[I] [OPT] [01:12:14]: Successfully parsed IR with python API.
[I] [OPT] [01:12:14]: init graph by forwarding one sample filled with zeros
[I] [OPT] [01:12:14]: [graph_optimize_stage1] is running.
[I] [OPT] [01:12:14]: [statistic] is running.
[I] [OPT] [01:12:14]: Optimizer will use all zeros inputs to statistic tensor information because the config is not setted 'calibration_data'.
statistic weights and biases: 2%|██ | 2/118 [00:02<02:23, 1.24s/it]/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/AIPUBuilder/Optimizer/framework/pycore/pytensor.py:409: UserWarning: std_mean(): degrees of freedom is <= 0. Correction should be strictly less than the reduction factor (input numel divided by output numel). (Triggered internally at /pytorch/aten/src/ATen/native/ReduceOps.cpp:1839.)
running_std_key_axis, running_mean_key_axis = torch.std_mean(fbetensor, dim=other_dims)
/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/AIPUBuilder/Optimizer/framework/pycore/pytensor.py:444: UserWarning: std_mean(): degrees of freedom is <= 0. Correction should be strictly less than the reduction factor (input numel divided by output numel). (Triggered internally at /pytorch/aten/src/ATen/native/ReduceOps.cpp:1839.)
running_std, running_mean = torch.std_mean(fbetensor)
[I] [OPT] [01:12:18]: [graph_optimize_stage2] is running.
[I] [OPT] [01:12:18]: applying calibration strategy based on statistic info
calibration: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:00<00:00, 31227.70it/s]
[I] [OPT] [01:12:18]: [quantize] is running.
update_tensor_quantization_attrs: 2%|█▉ | 2/118 [00:00<00:18, 6.17it/s][W] [OPT] [01:12:18]: layer_id=51, layer_type=OpType.Constant : requred bits for tensor "/blocks.0/att/mix_ka_sub/Constant_0" is 16, but actually got 8, which may cause accuracy issues.
[W] [OPT] [01:12:18]: layer_id=54, layer_type=OpType.Constant : requred bits for tensor "/blocks.0/att/mix_ka_add/Constant_0" is 16, but actually got 8, which may cause accuracy issues.
[W] [OPT] [01:12:18]: quantize 'node info: name=/blocks.0/att/mix_ka_sub/Sub_clone_, type=OpType.Sub, layer_id = 52' one input is quantize invariant and other one input is not, which may cause accuracy issue.
[W] [OPT] [01:12:18]: quantize 'node info: name=/blocks.0/att/Add_clone_, type=OpType.Add, layer_id = 67' one input is quantize invariant and other one input is not, which may cause accuracy issue.
[W] [OPT] [01:12:18]: quantize 'node info: name=/blocks.0/att/mix_ka_add/Add_clone_, type=OpType.Add, layer_id = 55' one input is quantize invariant and other one input is not, which may cause accuracy issue.
[W] [OPT] [01:12:18]: due to hardware limitations, it is actually doing per-16-channel quantization, which may cause accuracy dropping: layer_id=85, type=OpType.BatchNorm, name=/blocks.0/att/mul_ln_x/Mul_clone_, rescale values differ sharpely whithin channels,
[I] [OPT] [01:12:18]: These OPs will automatically cast dtypes to adapt to lib's dtypes' spec (may cause model accuracy loss due to corresponding spec's restriction): {'OpType.LayerNorm', 'OpType.Activation', 'OpType.Reduce', 'OpType.Mul', 'OpType.Input', 'OpType.Eltwise', 'OpType.BatchNorm', 'OpType.Add', 'OpType.Exp', 'OpType.Reshape', 'OpType.Negative', 'OpType.Constant', 'OpType.Gather', 'OpType.Sub', 'OpType.MatMul', 'OpType.Rsqrt', 'OpType.FullyConnected', 'OpType.Square'}
[W] [OPT] [01:12:18]: ''node info: name=/blocks.0/att/mix_ka_sub/Constant, type=OpType.Constant, layer_id = 51'', cast its output '/blocks.0/att/mix_ka_sub/Constant_0' dtype from Dtype.UINT8 to Dtype.UINT16 for 'node info: name=/blocks.0/att/mix_ka_sub/Sub, type=OpType.Sub, layer_id = 52' due to lib's OpType.Sub spec by insert a cast layer.
[W] [OPT] [01:12:18]: quantize 'node info: name=/blocks.0/att/mix_ka_sub/Sub_clone, type=OpType.Sub, layer_id = 52' one input is quantize invariant and other one input is not, which may cause accuracy issue.
[W] [OPT] [01:12:18]: ''node info: name=/blocks.0/att/mix_ka_add/Constant, type=OpType.Constant, layer_id = 54'', cast its output '/blocks.0/att/mix_ka_add/Constant_0' dtype from Dtype.UINT8 to Dtype.UINT16 for 'node info: name=/blocks.0/att/mix_ka_add/Add, type=OpType.Add, layer_id = 55' due to lib's OpType.Add spec by insert a cast layer.
[W] [OPT] [01:12:18]: quantize 'node info: name=/blocks.0/att/mix_ka_add/Add_clone, type=OpType.Add, layer_id = 55' one input is quantize invariant and other one input is not, which may cause accuracy issue.
update_tensor_quantization_attrs: 2%|█▉ | 2/122 [00:00<00:19, 6.20it/s][W] [OPT] [01:12:19]: quantize 'node info: name=/blocks.0/att/mix_ka_sub/Sub_clone_, type=OpType.Sub, layer_id = 52' one input is quantize invariant and other one input is not, which may cause accuracy issue.
[W] [OPT] [01:12:19]: quantize 'node info: name=/blocks.0/att/Add_clone_, type=OpType.Add, layer_id = 67' one input is quantize invariant and other one input is not, which may cause accuracy issue.
[W] [OPT] [01:12:19]: quantize 'node info: name=/blocks.0/att/mix_ka_add/Add_clone_, type=OpType.Add, layer_id = 55' one input is quantize invariant and other one input is not, which may cause accuracy issue.
[W] [OPT] [01:12:19]: due to hardware limitations, it is actually doing per-16-channel quantization, which may cause accuracy dropping: layer_id=85, type=OpType.BatchNorm, name=/blocks.0/att/mul_ln_x/Mul_clone_, rescale values differ sharpely whithin channels,
quantize each layer: 2%|██ | 2/122 [00:00<00:12, 9.64it/s][W] [OPT] [01:12:19]: quantize 'node info: name=/blocks.0/att/mix_ka_sub/Sub, type=OpType.Sub, layer_id = 52' one input is quantize invariant and other one input is not, which may cause accuracy issue.
[W] [OPT] [01:12:19]: quantize 'node info: name=/blocks.0/att/Add, type=OpType.Add, layer_id = 67' one input is quantize invariant and other one input is not, which may cause accuracy issue.
[W] [OPT] [01:12:19]: quantize 'node info: name=/blocks.0/att/mix_ka_add/Add, type=OpType.Add, layer_id = 55' one input is quantize invariant and other one input is not, which may cause accuracy issue.
[W] [OPT] [01:12:19]: due to hardware limitations, it is actually doing per-16-channel quantization, which may cause accuracy dropping: layer_id=85, type=OpType.BatchNorm, name=/blocks.0/att/mul_ln_x/Mul, rescale values differ sharpely whithin channels,
quantize each layer: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 122/122 [00:00<00:00, 502.68it/s]
[I] [OPT] [01:12:19]: [graph_optimize_stage3] is running.
[I] [OPT] [01:12:19]: [serialize] is running.
[I] [OPT] [01:12:19]: check the final graph by forwarding one sample filled with zeros
[I] [OPT] [01:12:19]: Begin to serialzie IR
Writing IR: 0it [00:00, ?it/s]/home/molly/miniconda3/envs/cix/lib/python3.10/site-packages/AIPUBuilder/Optimizer/framework/pycore/pyir.py:923: RuntimeWarning: invalid value encountered in cast
ct_value = ct_value.astype(dtype2nptype(ct.dtype))
Writing IR: 120it [00:00, 3899.99it/s]
Serializing bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 161/161 [00:00<00:00, 3635.01it/s]
[I] [OPT] [01:12:19]: IR has been saved into /home/molly/rwkv-cix/./internal_2025_11_14_1_7_10_q1p4b
[I] [OPT] [01:12:19]: Compass-Optimizer has done at [serialize] period.
[I] [OPT] [01:12:19]: [Done]cost time: 310s, and [qinfos(scale, zp, dtype)]: out: [[113409.8828125, 0, INT16], [25161.884765625, 0, INT16], [17209.119140625, 0, INT16], [911.8507690429688, 0, INT16], [34700.46875, 0, INT16]] in: [[1.0, 0, UINT16], [1.0, 0, UINT16], [1.0, 0, UINT16], [1.0, 0, INT32]]
[I] Optimizing model complete
[I] Simplifying quant model...
[I] [IRChecker] Start to check IR: /home/molly/rwkv-cix/internal_2025_11_14_1_7_10_q1p4b/rwkv_quant.txt
[I] [IRChecker] model_name: rwkv
[I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled)
[I] [graph.cpp :1605] loading graph weight: /home/molly/rwkv-cix/./internal_2025_11_14_1_7_10_q1p4b/rwkv_quant.bin size: 0x67607ce
[I] Start to simplify the graph...
[I] Using fixed-point full optimization, it may take long long time ....
[E] [IRChecker] [LayerNorm_Checker_Plugin] layer_id: 4, layer_name: /blocks.0/att/ln_1/LayerNormalization, layer_type: LayerNorm
[E] [IRChecker] :-1, key:'weights size', MSG: weights type must be the same with input data
[E] [IRChecker] [LayerNorm_Checker_Plugin] layer_id: 84, layer_name: /blocks.0/att/ln_x/LayerNormalization, layer_type: LayerNorm
[E] [IRChecker] :-1, key:'weights size', MSG: weights type must be the same with input data
[E] [IRChecker] [LayerNorm_Checker_Plugin] layer_id: 105, layer_name: /blocks.0/ffn/ln_2/LayerNormalization, layer_type: LayerNorm
[E] [IRChecker] :-1, key:'weights size', MSG: weights type must be the same with input data
[W] Simplify quant model Failed, will use original quant model. [E] [IRChecker] [LayerNorm_Checker_Plugin] layer_id: 4, layer_name: /blocks.0/att/ln_1/LayerNormalization, layer_type: LayerNorm
[E] [IRChecker] :-1, key:'weights size', MSG: weights type must be the same with input data
[E] [IRChecker] [LayerNorm_Checker_Plugin] layer_id: 84, layer_name: /blocks.0/att/ln_x/LayerNormalization, layer_type: LayerNorm
[E] [IRChecker] :-1, key:'weights size', MSG: weights type must be the same with input data
[E] [IRChecker] [LayerNorm_Checker_Plugin] layer_id: 105, layer_name: /blocks.0/ffn/ln_2/LayerNormalization, layer_type: LayerNorm
[E] [IRChecker] :-1, key:'weights size', MSG: weights type must be the same with input data
Traceback (most recent call last):
File "AIPUBuilder/simplifier/main.pyx", line 124, in AIPUBuilder.simplifier.main.main
Exception: GSim failed to simplify the graph due to pass or graph issue. The failed IR is saved:/home/molly/rwkv-cix/./internal_2025_11_14_1_7_10_q1p4b/rwkv_quant_s.txt.failed and /home/molly/rwkv-cix/./internal_2025_11_14_1_7_10_q1p4b/rwkv_quant_s.bin.failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "AIPUBuilder/main.pyx", line 276, in AIPUBuilder.main.main
File "AIPUBuilder/simplifier/main.pyx", line 119, in AIPUBuilder.simplifier.main.main
File "AIPUBuilder/core/utils.pyx", line 30, in AIPUBuilder.core.utils.RaiseCRuntimeError.__exit__
RuntimeError: [E] [IRChecker] [LayerNorm_Checker_Plugin] layer_id: 4, layer_name: /blocks.0/att/ln_1/LayerNormalization, layer_type: LayerNorm
[E] [IRChecker] :-1, key:'weights size', MSG: weights type must be the same with input data
[E] [IRChecker] [LayerNorm_Checker_Plugin] layer_id: 84, layer_name: /blocks.0/att/ln_x/LayerNormalization, layer_type: LayerNorm
[E] [IRChecker] :-1, key:'weights size', MSG: weights type must be the same with input data
[E] [IRChecker] [LayerNorm_Checker_Plugin] layer_id: 105, layer_name: /blocks.0/ffn/ln_2/LayerNormalization, layer_type: LayerNorm
[E] [IRChecker] :-1, key:'weights size', MSG: weights type must be the same with input data
[I] Building ...
[E] Building failed! libaipu_simulator_x2.so: cannot open shared object file: No such file or directory
坏消息是又报错了。看了一下主要是两个:1. LayerNorm估计affine weight/bias的类型得和输入类型一样(int16) 2. 找不到libaipu\_simulator\_x2.so。
问题二可参考@nihui的文章:https://aijishu.com/a/1060000000506106
好消息是optimizer导出了一个internal_2025_11_14_1_7_10_q1p4b/rwkv_opt_template.json,可以让我窥探一下tutorials里提到的manual mixed precision配置文件的格式:
{
"state1_in": {
"q_mode_activation": "per_tensor_symmetric_restricted_range",
"q_mode_weight": "per_channel_symmetric_restricted_range",
"q_bits_activation": 16,
"q_bits_weight": 8,
"q_bits_bias": 32,
"q_strategy_activation": "mean",
"q_strategy_weight": "extrema",
"running_statistic_momentum": 0.9,
"histc_bins": 2048,
"lut_items_in_bits": 8,
"force_dtype_int": false,
"force_shift_positive": false,
"trigger_float_op": "disable",
"just_for_display": {
"quantization_info": "{'state1_in_0': {'scale': '[1.0]', 'zerop': '[0]', 'qbits': '16', 'dtype': 'Dtype.UINT16', 'qmin': '0', 'qmax': '65535', 'fmin': '[0.0]', 'fmax': '[0.0]', 'fmin_key_axis': 'None', 'fmax_key_axis': 'None', 'qinvariant': 'False'}}",
"optimization_info": "{}",
"brief_info": "layer_id = 0, layer_type = OpType.Input, similarity=None, MSE=None"
}
},
...
根据经验,rwkv v7的模型可以在绝大多数层用int16激活+int4/8/16权重量化,少数核心层用float16的情况下运行
询问了cix的技术人员,
opt\_template.json中layer name并不会匹配正则表达式,但是可以在cfg里的Optimizer配置里加基于正则表达式匹配的量化配置
[Common]
mode = build
[Parser]
model_type = onnx
model_name = rwkv
detection_postprocess =
model_domain = llm
input_model = ./model.onnx
output_dir = ./build
[Optimizer]
calibration_data = ./calib_dataset.npy
calibration_batch_size = 1
metric_batch_size = 1
output_dir = ./
dataset = numpymultiinputdataset
save_statistic_info = True
cast_dtypes_for_lib = True
quantize_method_for_activation = per_tensor_asymmetric
quantize_method_for_weight = per_channel_symmetric_restricted_range
weight_bits = 16
activation_bits = 16
trigger_float_op = disable & <{.*wkv7.*}:float16_preferred><{state*_*}:float16_preferred>
[GBuilder]
target = X2_1204MP3
outputs = ./rwkv.cix
# profile = True
tiling = fps
导出完整模型:python export_model.py
生成量化校准数据集:
from rwkv_src.rwkv_tokenizer import RWKV_TOKENIZER, ABCTokenizer
from rwkv_src.rwkv_model import RWKV_RNN, RWKV_Config
import types
import os, sys
import torch
import numpy as np
import argparse
from pathlib import Path
from rwkv_src.rwkv_model_utils import get_dummy_state_kvcache, sample_logits
parser = argparse.ArgumentParser(description='test torch model')
parser.add_argument('model', type=Path, help='Path to RWKV pth file')
parser.add_argument('calib_dataset', type=Path, help='Path to calib text file')
parser.add_argument('--output', type=Path, default='./calib_dataset.npy', help='Path to output file')
parser.add_argument('--block_size', type=int, default=128, help='Block size')
parser.add_argument('--num_samples', type=int, default=1, help='Number of samples')
args = parser.parse_args()
model_args = RWKV_Config()
model_args.device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_args.dtype = torch.float32
model_args.convert_fc_to_conv = False
model_args.rescale_layer = 0
model_args.wkv_customop = False
model_args.output_last = False
model_args.model_name = str(args.model)
tokenizer = RWKV_TOKENIZER("./assets/rwkv_vocab_v20230424.txt")
model = RWKV_RNN(model_args)
device = model.device
with open(args.calib_dataset, 'r') as f:
text = f.read()
token_ids = tokenizer.encode(text)
data = {f"{i}": [] for i in range(model.model_info.n_layer * 3 + 1)}
num_samples = min(args.num_samples, len(token_ids) // args.block_size)
for i in range(num_samples):
states = get_dummy_state_kvcache(1, model.model_info, device)
for j in range(i * args.block_size, (i + 1) * args.block_size):
token_input = torch.LongTensor([token_ids[j]]).to(device=device)
for k in range(len(states)):
data[f"{k}"].append(states[len(states) - k - 1].squeeze(0).cpu().float().detach().numpy().astype(np.float32))
data[f"{len(states)}"].append(token_input.squeeze(0).cpu().detach().numpy().astype(np.int32))
_, states = model(token_input, *states)
for i in range(model.model_info.n_layer * 3 + 1):
data[f"{i}"] = np.stack(data[f"{i}"], axis=0)
np.save(args.output, data)``
正常导出
一个有点可疑的地方是它把我的elementwise mul+add lower成了一个batchnorm
拷贝到o6上运行:
from rwkv_src.rwkv_tokenizer import RWKV_TOKENIZER
import types
import os, sys
import torch
import numpy as np
import argparse
from pathlib import Path
from rwkv_src.rwkv_model_utils import get_dummy_state_kvcache, sample_logits
import time
# import NOE_Engine.NOE_Engine as NOE_Engine
from utils.NOE_Engine import EngineInfer
parser = argparse.ArgumentParser(description='test cix model on npu')
parser.add_argument('model', type=Path, help='Path to RWKV pth file')
args = parser.parse_args()
tokenizer = RWKV_TOKENIZER("./rwkv_src/rwkv_vocab_v20230424.txt")
engine = EngineInfer(str(args.model))
num_layers = 12
inputs = []
for _ in range(num_layers):
inputs.append(np.zeros((1, 1, 768), dtype=np.float16))
inputs.append(np.zeros((12, 64, 64), dtype=np.float16))
inputs.append(np.zeros((1, 1, 768), dtype=np.float16))
inputs.append(np.array([0], dtype=np.int32))
outputs = engine.forward(inputs)
for i in range(len(outputs)):
print(f"output{i} shape {outputs[i].shape}:", outputs[i][:10])
engine.clean()发现输出并不是很正确
(.venv) cix@cix-localhost:~/rwkv-cix$ python test_cix_npu.py rwkv.cix
npu: noe_init_context success
npu: noe_load_graph success
Input tensor count is 37.
Output tensor count is 37.
npu: noe_create_job success
output0 shape (65536,): [ -8.510366 -19.918808 -15.738653 -16.180578 -16.019804 -16.227575
-15.774106 -14.960337 -14.63961 -13.467189]
output1 shape (768,): [-0.11309868 -0.0242046 0.02590046 0.06894456 -0.04961172 -0.02146038
0.02713382 -0.15392274 -0.08966493 0.06616951]
output2 shape (49152,): [ 2.5987625e-05 2.3388863e-04 2.6645660e-03 -6.4969063e-06
-3.2484531e-05 0.0000000e+00 1.2993813e-05 7.1465969e-05
6.5765381e-03 -1.9490719e-05]
output3 shape (768,): [-1.7074339 0.6561923 1.6899902 0.17392431 -0.11646259 0.25909078
0.23959485 0.14314124 0.01693068 0.98967546]
output4 shape (768,): [-1.341836 0.21636143 0.36139274 0.5710361 0.2661542 0.68870115
-0.5155132 0.42679515 0.6600506 0.1615301 ]
output5 shape (49152,): [ 3.5762787e-06 -1.0430813e-05 0.0000000e+00 0.0000000e+00
5.1259995e-05 7.3313713e-06 5.3644180e-07 4.1127205e-06
1.7881393e-07 5.7816505e-06]
output6 shape (768,): [-1.5083371 0.40554813 0.24214767 0.50332594 0.00262491 1.325578
-0.54532444 1.3698733 0.639493 0.2897241 ]
output7 shape (768,): [-1.2040163 -0.0037025 0.36478397 0.02574116 0.2627009 1.0483352
-0.66556764 0.4543491 0.4298421 0.07334468]
output8 shape (49152,): [-1.1196136e-03 -2.4676323e-04 -1.3113022e-04 -1.3422966e-04
6.1988831e-06 -1.8477440e-05 0.0000000e+00 -3.3915043e-05
-1.0788441e-05 0.0000000e+00]
output9 shape (768,): [-0.45070863 0.08093906 0.22258243 0.02134352 0.04074672 2.3646958
-0.6583229 0.52499515 0.51473916 0.27885172]
output10 shape (768,): [-1.009094 -1.1329855 0.45006236 -0.5415302 -0.21618837 0.78608924
-0.8495549 0.28241736 0.43679813 -0.13089205]
output11 shape (49152,): [ 0.0000000e+00 -1.5497208e-06 -1.5497208e-06 1.5497208e-06
-2.8312206e-05 1.0490417e-03 0.0000000e+00 4.7087669e-06
-1.4126301e-05 0.0000000e+00]
output12 shape (768,): [-1.9310815 0.4530342 1.7703756 -0.39262962 0.25616008 0.68458503
-0.62194324 -0.19016251 0.400087 -0.28524375]
output13 shape (768,): [-1.6550293 0.19041641 0.46512595 -0.45994726 0.30753446 -0.03115181
-0.3802751 0.44966954 0.20651017 -0.2861823 ]
output14 shape (49152,): [ 0.0000000e+00 1.9669533e-05 0.0000000e+00 7.7486038e-07
-7.7486038e-07 0.0000000e+00 0.0000000e+00 0.0000000e+00
-4.1723251e-07 4.1723251e-07]
output15 shape (768,): [-2.496136 0.12258673 0.7731651 -0.00723937 0.64044327 0.35617718
-1.8749977 1.3315622 -0.04005787 1.1249987 ]
output16 shape (768,): [-1.409798 0.34624484 0.62533325 0.1478171 0.98520404 0.5497823
-0.8787514 1.8823261 0.31193668 1.5849887 ]
output17 shape (49152,): [ 1.1277199e-04 0.0000000e+00 -1.6915798e-04 6.3705444e-03
0.0000000e+00 1.0147095e-03 -5.6385994e-05 -5.0903320e-02
0.0000000e+00 6.2036514e-04]
output18 shape (768,): [-5.119354 0.56720114 -0.44263187 1.4094664 2.9005034 2.4566069
0.7423569 4.7519693 0.3376649 2.0044901 ]
output19 shape (768,): [-1.6945666 0.20420782 -0.04467705 -0.16922484 0.40377933 0.56352085
-0.6231605 0.771522 -0.16501004 0.09441187]
output20 shape (49152,): [-7.5101852e-06 7.5101852e-06 -7.5101852e-06 7.5101852e-06
3.7550926e-06 -7.5101852e-06 0.0000000e+00 1.1265278e-05
-2.6285648e-05 2.2530556e-05]
output21 shape (768,): [-7.6689606 0.36478376 -0.6886066 -1.3679391 0.33464274 2.1145093
-3.130803 1.6739866 0.5023505 1.1855472 ]
output22 shape (768,): [-1.5397377 0.06069563 0.10870035 0.13380627 0.09711301 0.38045123
-0.8538771 0.8643609 0.21684892 -0.34927574]
output23 shape (49152,): [ 7.1563721e-03 -3.5119057e-04 9.3984604e-04 -1.0101318e-02
-1.1360645e-04 -1.1672974e-03 -2.3746490e-03 3.2520294e-03
0.0000000e+00 -8.2612038e-05]
output24 shape (768,): [-4.854739 1.054105 -1.0674905 0.6157312 1.3686634 -0.6266069
-2.6469748 2.6787653 1.3385462 -0.9277798]
output25 shape (768,): [-1.3455039 0.47959018 -0.9268958 1.0320157 -0.64787567 -0.77763796
-0.8617027 1.2887329 -0.541508 0.40956223]
output26 shape (49152,): [ 4.4465065e-04 0.0000000e+00 -4.1604042e-04 0.0000000e+00
-8.0156326e-04 -1.3482571e-04 6.2942505e-04 2.8610229e-06
-5.3048134e-05 6.3121319e-05]
output27 shape (768,): [-4.1553454 1.5905493 -2.2979822 3.2139807 -0.9676703 -3.16043
-3.3492668 3.8688023 -1.7587173 1.3782256]
output28 shape (768,): [-1.0621272 0.31874982 -0.4821254 0.06438263 0.69462526 -1.0571032
-0.6776923 0.55227643 0.2932573 0.6535023 ]
output29 shape (49152,): [-1.21593475e-05 4.86969948e-05 -1.21593475e-05 0.00000000e+00
0.00000000e+00 1.21593475e-05 6.07967377e-06 -2.93922424e-03
-2.61783600e-04 2.43186951e-05]
output30 shape (768,): [-2.4796357 3.4336903 -0.7022722 2.3327088 1.0013043 -6.2058854
3.9404914 5.740509 2.4692798 2.9838479]
output31 shape (768,): [-1.0755666 -0.26367813 -0.24550979 -0.121649 0.48248833 -2.1931577
-0.30254263 0.7098298 0.0453419 0.05845471]
output32 shape (49152,): [-0.07598877 -0.09240723 -0.01463318 0.00286293 0.22375488 0.00206757
-0.08746338 -0.01844788 0.09777832 -0.04754639]
output33 shape (768,): [ 0.07431868 0.42884633 -1.1153308 3.6025295 -1.25516 -4.8020883
0.9006324 4.29562 1.9267807 0.4332504 ]
output34 shape (768,): [-0.5186271 0.7486977 -0.71732444 0.6173948 -0.54961306 -1.4615679
0.17603885 1.1102649 0.15705997 0.3290319 ]
output35 shape (49152,): [ 5.9413910e-04 -1.2576580e-04 4.3821335e-04 0.0000000e+00
5.1259995e-05 -1.9817352e-03 -1.5497208e-05 1.7642975e-03
4.1723251e-06 0.0000000e+00]
output36 shape (768,): [-1.9887835 1.1087908 -0.03128863 2.4463797 -1.761941 -4.796938
-3.1591737 3.7116137 -0.5279956 3.0995297 ]
npu: noe_clean_job success
npu: noe_unload_graph success
npu: noe_deinit_context successoutput0是logits,torch的输出为:
tensor([[[ -3.3106, -13.8115, -11.3934, ..., -17.9951, -18.0478, -18.0473]]])下一步要做的事
寻找导致输出不正确的原因