Optimizing Inference Performance for Large Language Models on ARMv9 Architecture

TLDR

This work explored LLM optimization performance on the ARMv9 computing platform, which is mainstream for intelligent edge usage and enhanced and memory usage was reduced by lowering dequantization cost, rewriting specific functions in LLaMA.cpp framework and using advanced compilation optimization techniques.

Abstract

Large Language Models (LLMs) have attracted extensive attetion for their remarkable performance across a variety of tasks. However, the considerable computational and memory demands associated with LLMs present challenges for ensuring inference speed performance when deployed in resource-constrained scenarios. Current researches widely adopt model compression techniques and have greatly reduced the computational and memory overhead. Meanwhile, some frameworks that serve as deployment tools for large models can provide optimized deployment strategies based on the characteristics of hardware architectures. In this work, we explored LLM optimization performance on the ARMv9 computing platform, which is mainstream for intelligent edge usage. The inference efficiency was enhanced and memory usage was reduced by lowering dequantization cost, rewriting specific functions in LLaMA.cpp framework and using advanced compilation optimization techniques. Through optimizating and evaluating of the Qwen-1.8B model on ARMv9, inference performance improvement of 23.36× was achieved in decoding stage and 1.68 × in prefilling stage over the baseline with only a 1.04% loss in accuracy. All the developed code in this work is open source at https://github.com/CEATRG/LLaMA.cpp-arm.