Delta feature maps with application to spoofed speech detection

[ X ]

Tarih

2025

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Pergamon-Elsevier Science Ltd

Erişim Hakkı

info:eu-repo/semantics/closedAccess

Özet

Convolutional layers have been used in many deep learning architectures due to their feature extraction capabilities. Besides traditional convolution, several modified convolution techniques have been proposed. Among them, differential convolution generates additional feature maps by considering the differences on activation maps in a selected direction. It was found to be effective for image recognition with pre-defined fixed filters focusing on two adjacent activations. For speech-related tasks, tracking dynamic information on a broader range may be beneficial. With this intention, this paper proposes delta feature maps, where the fixed filters of differential convolution are modified based on the computation of handcrafted delta cepstral features. The proposed filters can extract dynamic information, similar to the delta cepstral features, within a convolutional neural network scheme. Handcrafted Delta and/or delta-delta features are proven to be effective especially for synthetic speech detection. Hence, logical access (LA) condition of ASVspoof 2019 and the recent ASVspoof 5 datasets are used to verify the effectiveness of the delta feature maps. For ASVspoof 2019 dataset, residual time-domain synthetic speech detection net (Res-TSSDNet) is used as a 1-D model and one-class neural network with directed statistics pooling (OCNet-DSP) is used as a 2-D model, verifying that delta feature maps can work with both dimensions. As ASVspoof 5 is a more challenging dataset, data augmentation, a foundation model front-end, and Nes2Net-X back-end are used. Delta feature maps are utilized within Nes2Net-X via two different configurations. One of these configurations dramatically reduced the back-end size from 291 K to 76 K while preserving the performance. The other configuration achieved the lowest equal error rate, 4.33 %, among the reported single systems with a pre-trained foundation model.

Açıklama

Anahtar Kelimeler

Convolutional neural networks, Feature extraction, Synthetic speech detection

Kaynak

Computers & Electrical Engineering

WoS Q Değeri

Scopus Q Değeri

Cilt

128

Sayı

Künye