Epoch-Based Spectrum Estimation for Speech

Jon Gudnason, Guolin Fang, Mike Brookes

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

An implicit assumption when using the discrete Fourier transform for spectrum estimation is that the time signal is periodic. This assumption clashes with the quasi-periodicity of voiced speech when the traditional short-time Fourier transform (STFT) is applied to it. This causes distortion and leads to a performance handicap in downstream processing. This work proposes a remedy to this by using epochs in the signal to determine better frame boundaries for the Fourier transform. The epochs are the estimated glottal closure instants in voiced speech and significant peaks in the unvoiced speech signal. The resulting coefficients are compared to the traditional STFT coefficients using copy-synthesis. An improvement of 15 dB signal-to-noise ratio and a PESQ score of 2.5 to 3.5 is achieved for copy-synthesis using 20 mel-filters. The results demonstrate that there is a great potential in improving down stream speech processing applications using this approach to spectrum estimation.

Original languageEnglish
Title of host publicationInterspeech 2023
Pages4274-4278
Number of pages5
Volume2023
DOIs
Publication statusPublished - 2023
Event24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISSN (Print)2308-457X

Conference

Conference24th International Speech Communication Association, Interspeech 2023
Country/TerritoryIreland
CityDublin
Period20/08/2324/08/23

Bibliographical note

Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.

Other keywords

  • copy-synthesis
  • Fourier analysis
  • speech signal processing
  • vocoding

Fingerprint

Dive into the research topics of 'Epoch-Based Spectrum Estimation for Speech'. Together they form a unique fingerprint.

Cite this