Yazar "Disken, Gokay" seçeneğine göre listele
Listeleniyor 1 - 15 / 15
Sayfa Başına Sonuç
Sıralama seçenekleri
Öğe A Review on Feature Extraction for Speaker Recognition under Degraded Conditions(Taylor & Francis Ltd, 2017) Disken, Gokay; Tufekci, Zekeriya; Saribulut, Lutfu; Cevik, UlusSpeech is a signal that includes speaker's emotion, characteristic specification, phoneme-information etc. Various methods have been proposed for speaker recognition by extracting specifications of a given utterance. Among them, short-term cepstral features are used excessively in speech, and speaker recognition areas because of their low complexity, and high performance in controlled environments. On the other hand, their performances decrease dramatically under degraded conditions such as channel mismatch, additive noise, emotional variability, etc. In this paper, a literature review on speaker-specific information extraction from speech is presented by considering the latest studies offering solutions to the aforementioned problem. The studies are categorized in three groups considering their robustness against channel mismatch, additive noise, and other degradations such as vocal effort, emotion mismatch, etc. For a more understandable representation, they are also classified into two tables by utilizing their classification methods, and used data-sets.Öğe A robust polynomial regression-based voice activity detector for speaker verification(Springer International Publishing Ag, 2017) Disken, Gokay; Tufekci, Zekeriya; Cevik, UlusRobustness against background noise is a major research area for speech-related applications such as speech recognition and speaker recognition. One of the many solutions for this problem is to detect speech-dominant regions by using a voice activity detector (VAD). In this paper, a second-order polynomial regression-based algorithm is proposed with a similar function as a VAD for text-independent speaker verification systems. The proposed method aims to separate steady noise/silence regions, steady speech regions, and speech onset/offset regions. The regression is applied independently to each filter band of a mel spectrum, which makes the algorithm fit seamlessly to the conventional extraction process of the mel-frequency cepstral coefficients (MFCCs). The kmeans algorithm is also applied to estimate average noise energy in each band for spectral subtraction. A pseudo SNR-dependent linear thresholding for the final VAD output decision is introduced based on the k-means energy centers. This thresholding considers the speech presence in each band. Conventional VADs usually neglect the deteriorative effects of the additive noise in the speech regions. Contrary to this, the proposed method decides not only for the speech presence, but also if the frame is dominated by the speech, or the noise. Performance of the proposed algorithm is compared with a continuous noise tracking method, and another VAD method in speaker verification experiments, where five different noise types at five different SNR levels were considered. The proposed algorithm showed superior verification performance both with the conventional GMM-UBM method, and the stateof- the-art i-vector method.Öğe Complementary regional energy features for spoofed speech detection(Academic Press Ltd- Elsevier Science Ltd, 2024) Disken, GokayAutomatic speaker verification systems are found to be vulnerable to spoof attacks such as voice conversion, text-to-speech, and replayed speech. As the security of biometric systems is vital, many countermeasures have been developed for spoofed speech detection. To satisfy the recent developments on speech synthesis, publicly available datasets became more and more challenging (e.g., ASVspoof 2019 and 2021 datasets). A variety of replay attack configurations were also considered in those datasets, as they do not require expertise, hence easily performed. This work utilizes regional energy features, which are experimentally proven to be more effective than the traditional frame-based energy features. The proposed energy features are independent from the utterance length and are extracted over nonoverlapping time-frequency regions of the magnitude spectrum. Different configurations are considered in the experiments to verify the regional energy features' contribution to the performance. First, light convolutional neural network - long shortterm memory (LCNN - LSTM) model with linear frequency cepstral coefficients is used to determine the optimal number of regional energy features. Then, SE-Res2Net model with log power spectrogram features is used, which achieved comparable results to the state-of-the-art for ASVspoof 2019 logical access condition. Physical access condition from ASVspoof 2019 dataset, logical access and deep fake conditions from ASVspoof 2021 dataset are also used in the experiments. The regional energy features achieved improvements for all conditions with almost no additional computational or memory loads (less than 1% increase in the model size for SERes2Net). The main advantages of the regional energy features can be summarized as i) capturing nonspeech segments, ii) extracting band-limited information. Both aspects are found to be discriminative for spoofed speech detection.Öğe Differential convolutional network for noise mask estimation(Elsevier Sci Ltd, 2023) Disken, GokayWith the advancements of speech synthesis technology, audio spoof detection systems have become vital for the security of automatic speaker verification systems. Many effective solutions have been offered for clean speech data. However, additive noise has a detrimental impact on the detection performance, as in many other speech related tasks. Noise mask is one of the methods proposed to increase the robustness against the additive noise. The purpose of the noise mask is to identify the time-frequency regions dominated by the noise signal. In this work, differential convolutional neural network is used to create noise masks. Differential convolution considers directional changes of the activations and generates new feature maps. Compared to the traditional convolutional network, a finer noise mask can be created with this method. Once the differential network for the noise masks is trained, its outputs are given to the spoof detection systems. Linear filterbank magnitudes are used as acoustic features for both noise masks and spoof detection. Therefore, the spoof detection systems have 2-channel inputs, i.e., linear filterbank magnitudes and its corresponding mask. Probabilistic linear discriminant analysis (PLDA) with x-vectors, emphasized channel attention, propagation and aggregation time delay neural network (ECAPA-TDNN), and light convolutional neural network (LCNN) followed by long short-term memory layers (LSTM) were used as classifiers. Three different noise types are used in both training and test stages, and two different noise types are used solely in the test stage, to stimulate seen and unseen conditions, respectively. Experiments conducted on the noisy version of ASVspoof 2015 challenge dataset showed that the LCNN-LSTM network with noise masks can achieve superior performance compared to other robust systems and can compete with the state-of-the-art. Considering the average of the known noise types, 2.67% equal error rate (EER) was observed. For the unknown noise types, 3.10% average EER was achieved. For the original (clean) ASVspoof 2015 data, the EER was 0.83%. Additionally, 2.6% EER was observed for logical access condition of ASVspoof 2019 data.Öğe DUAL MODE ANTENNA DESIGN USING SPLIT RING RESONATOR ARRAY(IEEE, 2014) Disken, Gokay; Ekmekci, EvrenIn this study an array structure formed by four split ring resonators (SRR) has been used as the radiating part of the antenna. Due to the coupling between resonators placed in proper distances along the propagation direction, two mode of radiation has been shown by numerical calculations. Changing the distance between SRR couples shifts the resonance frequencies and affects return loss values associated with resonance frequencies. Calculated surface current distributions shows that each radiation mode has been caused by different SRR structures on the array structure.Öğe Dual mode antenna design using split ring resonator array(IEEE Computer Society, 2014) Disken, Gokay; Ekmekci, EvrenIn this study an array structure formed by four split ring resonators (SRR) has been used as the radiating part of the antenna. Due to the coupling between resonators placed in proper distances along the propagation direction, two mode of radiation has been shown by numerical calculations. Changing the distance between SRR couples shifts the resonance frequencies and affects return loss values associated with resonance frequencies. Calculated surface current distributions shows that each radiation mode has been caused by different SRR structures on the array structure. © 2014 IEEE.Öğe Multilabel voice disorder classification using raw waveforms(Tubitak Scientific & Technological Research Council Turkey, 2024) Disken, GokayAutomated voice disorder systems that distinguish pathological voices from healthy ones have been developed with the aid of machine learning methods. Both clinicians and patients can benefit from these systems as they provide many advantages, compared to the invasive techniques. These systems can produce binary (healthy/pathological) or multiclass (healthy/selected pathologies) decisions. However, multiple disorders might exist in an individual's voice. Multilabel classification should be considered in such cases. By this time, only a single report is available on this topic, where hand-crafted features were used, and a data augmentation technique was utilized to overcome class imbalances. In this study, a similar experimental setup is followed to investigate the suitability of raw voice signals as inputs for multilabel classification. A deep learning model which consists of residual blocks and a novel gating mechanism is proposed. The gating mechanism weighs the channels of a residual block's output based on both its output and the previous layer's output. Using a SincNet filterbank that operates directly on the raw waveform as the initial layer, 0.99 accuracy and 0.98 F1 score were observed for natural /a/ vowels of Saarbruecken Voice Database with time domain augmentation to balance the class samples. On the other hand, reducing the number of augmented samples decreased the performance for both systems, indicating the need for a balanced dataset to avoid oversampling underrepresented classes. The proposed architecture performed consistently better than ResNet18 with deep connected attention, which verified the effectiveness of the proposed gating mechanism.Öğe Real-Time Speaker Independent Isolated Word Recognition on Banana Pi(IEEE, 2018) Disken, Gokay; Saribulut, Lutfu; Tufekci, Zekeriya; Cevik, UlusDevices controlled with voice commands have gained popularity over the last decade. To recognize an utterance, they usually require an internet connection, or use commercial programming libraries. Therefore, their flexibility is low, and algorithm update opportunities are limited. In this study, a speaker independent isolated word recognition algorithm, embedded in a single board computer, is proposed to recognize utterances in real-time. The proposed system neither requires an internet connection, nor uses external libraries. Mel Frequency Cepstral Coefficients and their deltas are used as feature vectors. Gaussian mixture models are utilized to define word models. Digits and some confirmation words of Turkish language are recorded ten times in one session from twenty-four individuals. Seven of these records are used for training, and the others for testing the system. The off-line experimental results showed that the system is working with 99.98%. In real-time experiments, the system's recognition accuracy was proficient for controlled environments.Öğe Real-Time Speaker Independent Isolated Word Recognition on Banana Pi(Institute of Electrical and Electronics Engineers Inc., 2018) Disken, Gokay; Saribulut, Lutfu; Tufekci, Zekeriya; Cevik, UlusDevices controlled with voice commands have gained popularity over the last decade. To recognize an utterance, they usually require an internet connection, or use commercial programming libraries. Therefore, their flexibility is low, and algorithm update opportunities are limited. In this study, a speaker independent isolated word recognition algorithm, embedded in a single board computer, is proposed to recognize utterances in realtime. The proposed system neither requires an internet connection, nor uses external libraries. Mel Frequency Cepstral Coefficients and their deltas are used as feature vectors. Gaussian mixture models are utilized to define word models. Digits and some confirmation words of Turkish language are recorded ten times in one session from twenty-four individuals. Seven of these records are used for training, and the others for testing the system. The off-line experimental results showed that the system is working with 99.98%. In real-time experiments, the system's recognition accuracy was proficient for controlled environments. © 2018 IEEE.Öğe RECOGNITION OF NON-SPEECH SOUNDS USING MEL-FREQUENCY CEPSTRUM COEFFICIENTS AND DYNAMIC TIME WARPING METHOD(IEEE, 2015) Disken, Gokay; Ibrikci, TurgayWith the developing technology, speech recognition systems are getting more space in our daily lives. Sounds in our environment are not only pure speech. Because of this, it is important for cochlear implants, unmanned vehicles and security systems to be able to recognize other sounds. In this work, Mel-frequency cepstrum coefficients, one of the most widely used methods for feature extraction in speech recognition, applied to various nature and animal sounds. Because each sound does not have the same duration, dynamic time warping, one of the methods used in speech recognition, is preferred to classify the feature vectors. The difference in durations of sounds affects the lengths of the feature vectors. With dynamic time warping method, one can overcome these differences. One reference record and 10 test records obtained from 10 different sound sources. True classification rate is found as 88%.Öğe Robust Spoofed Speech Detection with Denoised I-vectors(Gazi Univ, 2023) Disken, GokaySpoofed speech detection is recently gaining attention of the researchers as speaker verification is shown to be vulnerable to spoofing attacks such as voice conversion, speech synthesis, replay, and impersonation. Although various different methods have been proposed to detect spoofed speech, their performances decrease dramatically under the mismatched conditions due to the additive or reverberant noises. Conventional speech enhancement methods fail to recover the performance gap, hence more advanced techniques seem to be necessary to solve the noisy spoofed speech detection problem. In this work, Denoising Autoencoder (DAE) is used to obtain clean estimates of i-vectors from their noisy versions. ASVspoof 2015 database is used in the experiments with five different noise types, added to the original utterances at 0, 10, and 20 dB signal-to-noise ratios (SNR). The experimental results verified that the DAE provides a more robust spoof detection, where the conventional methods fail.Öğe Scale-invariant MFCCs for speech/speaker recognition(Tubitak Scientific & Technological Research Council Turkey, 2019) Tufekci, Zekeriya; Disken, GokayThe feature extraction process is a fundamental part of speech processing. Mel frequency cepstral coefficients (MFCCs) are the most commonly used feature types in the speech/speaker recognition literature. However, the MFCC framework may face numerical issues or dynamic range problems, which decreases their performance. A practical solution to these problems is adding a constant to filter-bank magnitudes before log compression, thus violating the scale-invariant property. In this work, a magnitude normalization and a multiplication constant are introduced to make the MFCCs scale-invariant and to avoid dynamic range expansion of nonspeech frames. Speaker verification experiments are conducted to show the effectiveness of the proposed scheme.Öğe Speaker Model Clustering to Construct Background Models for Speaker Verification(Polska Akad Nauk, Polish Acad Sciences, Inst Fundamental Tech Res Pas, 2017) Disken, Gokay; Tufekci, Zekeriya; Cevik, UlusConventional speaker recognition systems use the Universal Background Model (UBM) as an imposter for all speakers. In this paper, speaker models are clustered to obtain better imposter model representations for speaker verification purpose. First, a UBM is trained, and speaker models are adapted from the UBM. Then, the k-means algorithm with the Euclidean distance measure is applied to the speaker models. The speakers are divided into two, three, four, and five clusters. The resulting cluster centers are used as background models of their respective speakers. Experiments showed that the proposed method consistently produced lower Equal Error Rates (EER) than the conventional UBM approach for 3, 10, and 30 seconds long test utterances, and also for channel mismatch conditions. The proposed method is also compared with the i-vector approach. The three-cluster model achieved the best performance with a 12.4% relative EER reduction in average, compared to the i-vector method. Statistical significance of the results are also given.Öğe Spoofed Speech Detection with Weighted Phase Features and Convolutional Networks(Polska Akad Nauk, Polish Acad Sciences, Inst Fundamental Tech Res Pas, 2022) Disken, GokayDetection of audio spoofing attacks has become vital for automatic speaker verification systems. Spoofing attacks can be obtained with several ways, such as speech synthesis, voice conversion, replay, and mimicry. Extracting discriminative features from speech data can improve the accuracy of detecting these attacks. In fact, a frame-wise weighted magnitude spectrum is found to be effective to detect replay attacks recently. In this work, discriminative features are obtained in a similar fashion (frame-wise weighting), however, a cosine normalized phase spectrum is used since phase-based features have shown decent performance for the given task. The extracted features are then fed to a convolutional neural network as input. In the experiments ASVspoof 2015 and 2017 databases are used to investigate the proposed system???s spoof detection performance for both synthetic and replay attacks, respectively. The results showed that the proposed approach achieved 34.5% relative decrease in the average EER for ASVspoof 2015 evaluation set, compared to the ordinary cosine normalized phase features. Furthermore, the proposed system outperformed the others at detecting S10 attack type of ASVspoof 2015 database.Öğe Very Short-Term Prosumer Electric Load Forecasting Using Deep Learning-Based Techniques(IEEE, 2024) Aydin, Bari; Zor, Kasim; Disken, GokayIn recent years, owing to the increasing penetration of renewable energy sources into the modern electric distribution networks within the age of smart grid, the concept of consumer in the electricity markets has been evolved into the concept of prosumer which can be referred to as an individual who consumes and produces electricity. In order to maintain the crucial balance between the generation and consumption of electricity, prosumer electric load forecasting (PELF) has become a requisite for energy management and planning in today's microgrids. Deep learning (DL)-based techniques are frequently employed for forecasting the electric load that is nonstationary and affected by several factors such as seasonal effects, climatological conditions, and random effects. The aim of this paper is to present a benchmark regarding PELF of a household residing in the state of California, USA via using DL-based techniques, namely convolutional neural networks (CNN) and gated recurrent unit networks (GRU) within the very short-term horizon. In addition, hourly meteorological data belonging to the residential area has been obtained from Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2) database of NASA. Consequently, the results of the paper unveiled that utilizing CNN achieved better performance for PELF in terms of mean absolute error (MAE) and root mean squared error (RMSE) by 13% and 8%, respectively. Furthermore, it is considered that there is a gap in the literature for PELF and this paper will bridge this gap along with guiding the potential researchers in the field.