T2I-MultiLingen: A Multilingual Image Generation Framework for Complex Text Prompting
Jiawen Guo
TLDR
Experimental results showed that T2I-MultiLingen significantly outperformed current state-of-the-art (SOTA) models in both quantitative metrics of monolingual semantic alignment for Chinese and English and subjective evaluations of image generation quality, effectively enabling precise visualization of complex semantic scenes.
Abstract
In the field of complex text prompt-driven image synthesis, most current mainstream research focuses on English-centric systems. When encountering Chinese text prompts with complex attribute descriptions or multiple objects, models struggle to accurately parse the semantics, which significantly limits the application efficacy of multimodal generation technologies in Chinese scenarios. To address the above issues, this study proposes T2I-MultiLingen, a training-free generation framework supporting Chinese and English scenarios, which achieves precise mapping of complex semantic scenes through the chain-of-thought reasoning mechanism of multimodal large language models. The framework designs a cognitive chain guidance module, which leverages the logical reasoning capabilities of large models to perform hierarchical semantic parsing of complex scene descriptions in Chinese or English, and integrates regional feature generation methods to construct an accurate mapping from text to visual representations. Additionally, a dynamic semantic fusion regulation mechanism based on visual-language models is established, which dynamically generates evaluation coefficients based on assessment results to achieve adaptive weighted fusion of dual-path outputs. Experimental results showed that T2I-MultiLingen significantly outperformed current state-of-the-art (SOTA) models in both quantitative metrics of monolingual semantic alignment for Chinese and English and subjective evaluations of image generation quality, effectively enabling precise visualization of complex semantic scenes. The code for the relevant algorithm implementation has been open-sourced at https://github.com/GavinGuo000/T2I-MultiLingen.
