Yazar "Tufekci, Zekeriya" seçeneğine göre listele
Listeleniyor 1 - 7 / 7
Sayfa Başına Sonuç
Sıralama seçenekleri
Öğe A Review on Feature Extraction for Speaker Recognition under Degraded Conditions(Taylor & Francis Ltd, 2017) Disken, Gokay; Tufekci, Zekeriya; Saribulut, Lutfu; Cevik, UlusSpeech is a signal that includes speaker's emotion, characteristic specification, phoneme-information etc. Various methods have been proposed for speaker recognition by extracting specifications of a given utterance. Among them, short-term cepstral features are used excessively in speech, and speaker recognition areas because of their low complexity, and high performance in controlled environments. On the other hand, their performances decrease dramatically under degraded conditions such as channel mismatch, additive noise, emotional variability, etc. In this paper, a literature review on speaker-specific information extraction from speech is presented by considering the latest studies offering solutions to the aforementioned problem. The studies are categorized in three groups considering their robustness against channel mismatch, additive noise, and other degradations such as vocal effort, emotion mismatch, etc. For a more understandable representation, they are also classified into two tables by utilizing their classification methods, and used data-sets.Öğe A robust polynomial regression-based voice activity detector for speaker verification(Springer International Publishing Ag, 2017) Disken, Gokay; Tufekci, Zekeriya; Cevik, UlusRobustness against background noise is a major research area for speech-related applications such as speech recognition and speaker recognition. One of the many solutions for this problem is to detect speech-dominant regions by using a voice activity detector (VAD). In this paper, a second-order polynomial regression-based algorithm is proposed with a similar function as a VAD for text-independent speaker verification systems. The proposed method aims to separate steady noise/silence regions, steady speech regions, and speech onset/offset regions. The regression is applied independently to each filter band of a mel spectrum, which makes the algorithm fit seamlessly to the conventional extraction process of the mel-frequency cepstral coefficients (MFCCs). The kmeans algorithm is also applied to estimate average noise energy in each band for spectral subtraction. A pseudo SNR-dependent linear thresholding for the final VAD output decision is introduced based on the k-means energy centers. This thresholding considers the speech presence in each band. Conventional VADs usually neglect the deteriorative effects of the additive noise in the speech regions. Contrary to this, the proposed method decides not only for the speech presence, but also if the frame is dominated by the speech, or the noise. Performance of the proposed algorithm is compared with a continuous noise tracking method, and another VAD method in speaker verification experiments, where five different noise types at five different SNR levels were considered. The proposed algorithm showed superior verification performance both with the conventional GMM-UBM method, and the stateof- the-art i-vector method.Öğe Music emotion recognition using convolutional long short term memory deep neural networks(Elsevier - Division Reed Elsevier India Pvt Ltd, 2021) Hizlisoy, Serhat; Yildirim, Serdar; Tufekci, ZekeriyaIn this paper, we propose an approach for music emotion recognition based on convolutional long short term memory deep neural network (CLDNN) architecture. In addition, we construct a new Turkish emotional music database composed of 124 Turkish traditional music excerpts with a duration of 30 s each and the performance of the proposed approach is evaluated on the constructed database. We utilize features obtained by feeding convolutional neural network (CNN) layers with log-mel filterbank energies and mel frequency cepstral coefficients (MFCCs) in addition to standard acoustic features. Classification results show that the best performance is obtained when the new feature set is combined with the standard features using the long short term memory (LSTM) + deep neural network (DNN) classi fier. The overall accuracy of 99.19% is obtained using the proposed system with 10 fold cross-validation. Specifically, 6.45 points improvement is achieved. Additionally, the results also show that the LSTM + DNN classifier yields 1.61, 1.61 and 3.23 points improvements in music emotion recognition accuracies compared to k-nearest neighbor (k-NN), support vector machine (SVM), and Random Forest classifiers, respectively. (C) 2020 Karabuk University. Publishing services by Elsevier B.V.Öğe Real-Time Speaker Independent Isolated Word Recognition on Banana Pi(IEEE, 2018) Disken, Gokay; Saribulut, Lutfu; Tufekci, Zekeriya; Cevik, UlusDevices controlled with voice commands have gained popularity over the last decade. To recognize an utterance, they usually require an internet connection, or use commercial programming libraries. Therefore, their flexibility is low, and algorithm update opportunities are limited. In this study, a speaker independent isolated word recognition algorithm, embedded in a single board computer, is proposed to recognize utterances in real-time. The proposed system neither requires an internet connection, nor uses external libraries. Mel Frequency Cepstral Coefficients and their deltas are used as feature vectors. Gaussian mixture models are utilized to define word models. Digits and some confirmation words of Turkish language are recorded ten times in one session from twenty-four individuals. Seven of these records are used for training, and the others for testing the system. The off-line experimental results showed that the system is working with 99.98%. In real-time experiments, the system's recognition accuracy was proficient for controlled environments.Öğe Real-Time Speaker Independent Isolated Word Recognition on Banana Pi(Institute of Electrical and Electronics Engineers Inc., 2018) Disken, Gokay; Saribulut, Lutfu; Tufekci, Zekeriya; Cevik, UlusDevices controlled with voice commands have gained popularity over the last decade. To recognize an utterance, they usually require an internet connection, or use commercial programming libraries. Therefore, their flexibility is low, and algorithm update opportunities are limited. In this study, a speaker independent isolated word recognition algorithm, embedded in a single board computer, is proposed to recognize utterances in realtime. The proposed system neither requires an internet connection, nor uses external libraries. Mel Frequency Cepstral Coefficients and their deltas are used as feature vectors. Gaussian mixture models are utilized to define word models. Digits and some confirmation words of Turkish language are recorded ten times in one session from twenty-four individuals. Seven of these records are used for training, and the others for testing the system. The off-line experimental results showed that the system is working with 99.98%. In real-time experiments, the system's recognition accuracy was proficient for controlled environments. © 2018 IEEE.Öğe Scale-invariant MFCCs for speech/speaker recognition(Tubitak Scientific & Technological Research Council Turkey, 2019) Tufekci, Zekeriya; Disken, GokayThe feature extraction process is a fundamental part of speech processing. Mel frequency cepstral coefficients (MFCCs) are the most commonly used feature types in the speech/speaker recognition literature. However, the MFCC framework may face numerical issues or dynamic range problems, which decreases their performance. A practical solution to these problems is adding a constant to filter-bank magnitudes before log compression, thus violating the scale-invariant property. In this work, a magnitude normalization and a multiplication constant are introduced to make the MFCCs scale-invariant and to avoid dynamic range expansion of nonspeech frames. Speaker verification experiments are conducted to show the effectiveness of the proposed scheme.Öğe Speaker Model Clustering to Construct Background Models for Speaker Verification(Polska Akad Nauk, Polish Acad Sciences, Inst Fundamental Tech Res Pas, 2017) Disken, Gokay; Tufekci, Zekeriya; Cevik, UlusConventional speaker recognition systems use the Universal Background Model (UBM) as an imposter for all speakers. In this paper, speaker models are clustered to obtain better imposter model representations for speaker verification purpose. First, a UBM is trained, and speaker models are adapted from the UBM. Then, the k-means algorithm with the Euclidean distance measure is applied to the speaker models. The speakers are divided into two, three, four, and five clusters. The resulting cluster centers are used as background models of their respective speakers. Experiments showed that the proposed method consistently produced lower Equal Error Rates (EER) than the conventional UBM approach for 3, 10, and 30 seconds long test utterances, and also for channel mismatch conditions. The proposed method is also compared with the i-vector approach. The three-cluster model achieved the best performance with a 12.4% relative EER reduction in average, compared to the i-vector method. Statistical significance of the results are also given.