CoSSHI: LLM-Powered Large-Scale Multitask Hinglish Dataset for Speech Forensics
Nisarg Trivedi,Ravindrakumar M. Purohit,Satyam Tiwari,Hemant A. Patil
TLDR
By using the large language models, a largescale, code-mixed (English+Hindi) dataset called CoSSHI is created for various tasks, such as deepfake detection, accent identification, and speech synthesis, and the results indicate that the dataset offers competitive performance compared to existing datasets.
Abstract
Traditional datasets come with limited diversity in terms of language and linguistic features. This makes them language-biased and less generalized towards other languages. In another way, it reduces the effectiveness in a real-world scenarios. On the other side, manually creating the dataset is time-consuming, expensive, and resource-intensive. In order to minimize the limitation, in this paper, by using the large language models, a largescale, code-mixed (English+Hindi) dataset called CoSSHI is created for various tasks, such as deepfake detection, accent identification (60+ hours), and speech synthesis. The proposed dataset is created using the state-of-the-art LLM models (e.g., GPT-4o, Gemini 2.5 Pro, Grok-3, Mistral Large, Llama 4), text-to-speech models (160+ hours, e.g., XTTSv2, XTTSv1.1, Bark, gTTS, IndicTTS, YoursTTS, VITS), and vocoder (320+ hours, e.g., HiFiGAN, BigVGAN). CoSSHI is evaluated on various tasks using different measures, such as precision for deepfake detection, accuracy and precision for accent identification, and subjective and objective measures for speech synthesis. The results indicate that our dataset offers competitive performance compared to existing datasets.
