Similarity detection between Turkish text documents with distance metrics

[ X ]

Tarih

2017

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Institute of Electrical and Electronics Engineers Inc.

Erişim Hakkı

info:eu-repo/semantics/closedAccess

Özet

The aim of this study is to compare the successes of various distance metrics and to determine the most appropriate methods in order to detect similarities among textual documents written in Turkish. Computing similarities between text documents is the basic step of plagiarism detection, and text mining methods like author detection, text classification and clustering. Therefore, plagiarism detection and text mining applications will be more successful by using the distance metrics that are determined according to the results obtained in this study. For this purpose, chunks of texts in different lengths are selected as the experimental dataset in this study. After that, preprocessing methods are applied to the dataset that is used; therefore new and different experimental scenarios are created by removing stopwords and Turkish characters, and stemming words with Zemberek. According to the experimental results, it is observed that the preprocessing phase increases the accuracy of similarity detection. Especially, stemming using Zemberek increases the success rate. In all cases, the Cosine Similarity method has been observed as more successful than other distance metrics, because of producing more realistic results. © 2017 IEEE.

Açıklama

2nd International Conference on Computer Science and Engineering, UBMK 2017 -- 5 October 2017 through 8 October 2017 -- Antalya -- 132116

Anahtar Kelimeler

Cosine Similarity, Distance metrics, Document similarity, Turkish texts, Zemberek

Kaynak

2nd International Conference on Computer Science and Engineering, UBMK 2017

WoS Q Değeri

Scopus Q Değeri

Cilt

Sayı

Künye