Predicting Utterance-Final Timing Considering Linguistic Features Using Wav2vec 2.0

EasyChair Preprint 15119

5 pages•Date: September 27, 2024

Takanori Kanai, Yukoh Wakabayashi, Ryota Nishimura and Norihide Kitaoka

Abstract

Accurate turn-taking prediction is essential in spoken dialog systems, in order to determine whether the system or the user should make the next utterance. Previous research has significantly improved the accuracy of turn-taking prediction, allowing dialog systems to avoid unnatural pauses before responding. However, in human-to-human dialogs, responses do not always occur immediately after a speaker's utterance ends; sometimes there are deliberate pauses or responses made with overlap. Therefore, this study proposes a method to estimate in advance when the interlocutor's utterances will end, allowing the system to respond with more natural timing, including occasional overlaps. We utilized wav2vec 2.0, fine-tuned for automatic speech recognition, to estimate utterance end times by considering linguistic features, and compared these methods with prediction methods that use only acoustic features. The results of our comparison showed that considering linguistic features allows more accurate prediction of utterance-final timing. Additionally, we observed that when using the proposed method, the estimated time until the end of the utterance decreases as the utterance approaches its end.

Keyphrases: Wav2vec 2.0, spoken dialog system, turn-taking, utterance-final timing prediction

Links:

https://easychair.org/publications/preprint/pDzS

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:15119,
  author    = {Takanori Kanai and Yukoh Wakabayashi and Ryota Nishimura and Norihide Kitaoka},
  title     = {Predicting Utterance-Final Timing Considering Linguistic Features Using Wav2vec 2.0},
  howpublished = {EasyChair Preprint 15119},
  year      = {EasyChair, 2024}}

Download PDF Open PDF in browser