Machine Learning Strategies for Audio Deepfake Detection
Chappidi Aishwarya
TLDR
These results highlight that combining deep spatial–temporal feature learning with ensemble classification offers a strong and reliable solution for securing voice-based systems against DeepFake threats.
Abstract
The proliferation of synthetic audio generated by advanced generative models poses a significant threat to the integrity of digital communication systems. This study proposes a novel hybrid framework combining Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory (Bi-LSTM) networks, and eXtreme Gradient Boosting (XGBoost) to detect audio DeepFakes effectively. CNNs extract spatial features from Mel-frequency cepstral coefficients (MFCCs), Bi-LSTMs capture temporal dependencies, and XGBoost serves as a final decision-level classifier. Experiments conducted on benchmark datasets demonstrate that the proposed system achieves an accuracy of 98%, along with high precision, recall, and robustness against unseen attacks. These results highlight that combining deep spatial–temporal feature learning with ensemble classification offers a strong and reliable solution for securing voice-based systems against DeepFake threats.
