Speech transformer models demonstrate a sensitivity to articulatory events
View Session DetailPresentation Number: 5pSC34
Khalil Iskarous*1, Haley Hsu1, Dani Byrd1
1Linguistics, University of Southern California, Los Angeles, California, United States
Abstract (200 words): The black box of speech recognition LLMs has yet to give up its secret of how they are able to extract linguistically salient information from continuous audio. One hypothesis is that these system use low-level correlations in the audio signal; while a complementary view hypothesizes that these systems capture causal information that structures the signal in a way relevant to speech production. This quandary is addressed here by deploying the HuBERT LLM on audio for which we have simultaneous real-time vocal tract MRI. We probe correlations of attentional dynamics that incorporate acoustic measures and those that incorporate articulatory change as indexed by proxy MFCCs. Additionally, we extract airway edges from vocal tract rtMRI video of read speech. HuBERT-Large with 24 encoders and 16 attention heads in each encoder was used. We demonstrate that while the model’s attentional mechanisms can and do focus on acoustic, spectral, and articulatory events, we can provide analysis based on the acoustic theory of speech production that these results are best explained by assuming that these system are aware of the causal status of the articulatory events that generate the speech signal. [Supported by NSF]