Evaluating Automatic Transcription Models Utilising Cloud Platforms

TLDR

The research has determined that the Whisper ASR model developed by OpenAI provides the lowest error rate of those evaluated (most accurate with 96% of the audio files providing the lowest WER value).

Resumo

Automatic Speech Recognition (ASR) technology is becoming pervasive in society and is being used for language translation, customer service and disability support. ASR and transcription is also rapidly becoming a popular manner of enabling qualitative research. Traditionally transcribing interviews and focus groups would have been very time consuming and labour intensive. In recent years, online tools have become available to help with automatic transcription. These tools have varying levels of accuracy, and most will require manual correction. Moreover, these tools require a researcher to manually upload audio files that have been processed or edited.This research proposes the development of an automatic framework for completing ASR and automatic transcription without the need for the researcher to perform any manual processes. The research is completed within an industrial context in an organisation that completes qualitative analysis and evaluation on behalf of clients in the third sector. The proposed framework utilises a cloud-based API for completing the automatic transcription. This research evaluates multiple APIs for completing automatic transcription and selects one service for inclusion within the framework. This evaluation is completed on a self-created audio dataset named “S3QualitativeAudio” using Word Error Rate (WER) calculation on the transcription and also based on cost-benefit analysis. The research has determined that the Whisper ASR model developed by OpenAI provides the lowest error rate of those evaluated (most accurate with 96% of the audio files providing the lowest WER value). The average WER for the Whisper ASR model was 0.07246. Further evaluation was completed using this model in an attempt to decrease the error rate further. The final automatic transcriptions could be used for sentiment analysis and text summarisation to complete further qualitative analysis.