SE-Conformer: Time-Domain Speech Enhancement Using Conformer
Eesung Kim,Hyeji Seo
TLDR
This paper proposes an end-to-end speech enhancement architecture (SE-Conformer), incorporating a convolutional encoder–decoder and conformer, designed to be directly applied to the time-domain signal.
Abstract
Convolution-augmented transformer (conformer) has recently shown competitive results in speech-domain applications, such as automatic speech recognition, continuous speech separation, and sound event detection. Conformer can capture both the short and long-term temporal sequence information by attending to the whole sequence at once with multi-head self-attention and convolutional neural network. However, the effectiveness of conformer in speech enhancement has not been demonstrated. In this paper, we propose an end-to-end speech enhancement architecture (SE-Conformer), incorporating a convolutional encoder–decoder and conformer, designed to be directly applied to the time-domain signal. We performed evaluations on both the VoiceBank-DEMAND Corpus (VCTK) and Librispeech datasets in terms of objective speech quality metrics. The experimental results show that the proposed model outperforms other competitive baselines in speech enhancement performance.
