Download PDFOpen PDF in browserPredicting Utterance-Final Timing Considering Linguistic Features Using Wav2vec 2.0EasyChair Preprint 151195 pages•Date: September 27, 2024AbstractAccurate turn-taking prediction is essential in spoken dialog systems, in order to determine whether the system or the user should make the next utterance. Previous research has significantly improved the accuracy of turn-taking prediction, allowing dialog systems to avoid unnatural pauses before responding. However, in human-to-human dialogs, responses do not always occur immediately after a speaker's utterance ends; sometimes there are deliberate pauses or responses made with overlap. Therefore, this study proposes a method to estimate in advance when the interlocutor's utterances will end, allowing the system to respond with more natural timing, including occasional overlaps. We utilized wav2vec 2.0, fine-tuned for automatic speech recognition, to estimate utterance end times by considering linguistic features, and compared these methods with prediction methods that use only acoustic features. The results of our comparison showed that considering linguistic features allows more accurate prediction of utterance-final timing. Additionally, we observed that when using the proposed method, the estimated time until the end of the utterance decreases as the utterance approaches its end. Keyphrases: Wav2vec 2.0, spoken dialog system, turn-taking, utterance-final timing prediction
|