Complementary regional energy features for spoofed speech detection

dc.authoridDisken, Gokay/0000-0002-8680-0636
dc.contributor.authorDisken, Gokay
dc.date.accessioned2025-01-06T17:37:38Z
dc.date.available2025-01-06T17:37:38Z
dc.date.issued2024
dc.description.abstractAutomatic speaker verification systems are found to be vulnerable to spoof attacks such as voice conversion, text-to-speech, and replayed speech. As the security of biometric systems is vital, many countermeasures have been developed for spoofed speech detection. To satisfy the recent developments on speech synthesis, publicly available datasets became more and more challenging (e.g., ASVspoof 2019 and 2021 datasets). A variety of replay attack configurations were also considered in those datasets, as they do not require expertise, hence easily performed. This work utilizes regional energy features, which are experimentally proven to be more effective than the traditional frame-based energy features. The proposed energy features are independent from the utterance length and are extracted over nonoverlapping time-frequency regions of the magnitude spectrum. Different configurations are considered in the experiments to verify the regional energy features' contribution to the performance. First, light convolutional neural network - long shortterm memory (LCNN - LSTM) model with linear frequency cepstral coefficients is used to determine the optimal number of regional energy features. Then, SE-Res2Net model with log power spectrogram features is used, which achieved comparable results to the state-of-the-art for ASVspoof 2019 logical access condition. Physical access condition from ASVspoof 2019 dataset, logical access and deep fake conditions from ASVspoof 2021 dataset are also used in the experiments. The regional energy features achieved improvements for all conditions with almost no additional computational or memory loads (less than 1% increase in the model size for SERes2Net). The main advantages of the regional energy features can be summarized as i) capturing nonspeech segments, ii) extracting band-limited information. Both aspects are found to be discriminative for spoofed speech detection.
dc.description.sponsorshipTUBITAK [121E057]
dc.description.sponsorshipAcknowledgment This work was supported by TUBITAK under project no. 121E057.
dc.identifier.doi10.1016/j.csl.2023.101602
dc.identifier.issn0885-2308
dc.identifier.issn1095-8363
dc.identifier.scopus2-s2.0-85180748134
dc.identifier.scopusqualityQ1
dc.identifier.urihttps://doi.org/10.1016/j.csl.2023.101602
dc.identifier.urihttps://hdl.handle.net/20.500.14669/2309
dc.identifier.volume85
dc.identifier.wosWOS:001147050600001
dc.identifier.wosqualityQ2
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.language.isoen
dc.publisherAcademic Press Ltd- Elsevier Science Ltd
dc.relation.ispartofComputer Speech and Language
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.snmzKA_20241211
dc.subjectEnergy features
dc.subjectFeature extraction
dc.subjectReplay speech detection
dc.subjectSynthetic speech detection
dc.subjectDeep learning
dc.titleComplementary regional energy features for spoofed speech detection
dc.typeArticle

Dosyalar