Download PDFOpen PDF in browserComparative Analysis of Semantic Similarity Word Embedding Techniques for Paraphrase DetectionEasyChair Preprint 583314 pages•Date: June 16, 2021AbstractA paraphrase typically is a restatement/ rephrasing of a text or a passage based on its elucidation. Paraphrasing has its applications in various fields such as text summarization, plagiarism detection, question answering, machine translation, text grouping, sentiment analysis, etc. Most of the current state of art plagiarism detection tools focus on verbatim reproduction of document and do not account for its semantic properties, hence paraphrase plagiarism goes undetected in many cases. This paper gives an overview and comparison of the performances of five word embedding models in the field of semantic similarity such as TF-IDF, Word2Vec, Doc2Vec, FastText and BERT on two publicly available corpora: Quora Question Pairs (QQP) and Plagiarized Short Answers (PSA). After extensive literature review and experiments, the most appropriate text preprocessing approaches, distance measures, and the thresholds have been settled on for detecting semantic similarity/paraphrasing. The paper concludes on FastText being the most efficient model out of the five, both in terms of evaluation metrics i.e. accuracy, precision, recall, F1-score and resource consumption. It also compares all the models with each other based on the above mentioned metrics. Keyphrases: Paraphrase Identification, Plagiarism, Semantic Similarity Detection, deep learning model, paraphrase detection, paraphrasing, plagiarism detection, word embedding
|