Font Size: a A A

Machine Learning Based Model For Detecting Similarity Of Scientific Papers

Posted on:2019-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:BALEN HAMAKAREEM AZIZFull Text:PDF
GTID:2428330545950698Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The number of papers published during a year has increased in an unprecedented pace,and the nature of research papers today are often largely multidisciplinary or might have applications to some extent in different fields of study as well.For a researcher,therefore,searching and keeping up with papers relevant to his expertise could become a tedious task.To overcome this problem,several solutions have been proposed such as recommender systems or search engines like Google Scholar enabling researchers with as little effort as possible to obtain published research papers.At the heart of these systems lies the ability to measure to what extent papers are similar.Determining similarity between research papers is of great interest in a variety of applications such as document clustering,and it is considered by many as one of the serious problems in artificial intelligence.This master thesis explores the possibility of adapting machine learning based language modeling techniques to this end.In this master thesis,the architecture of some of language models is reviewed in order to provide an insight of the most historically important methods of text processing and to understand their applications and limitations.Therefore,methods of text preprocessing will also be presented since the input text data has to be cleaned and prepared before performing further analysis and processing.Methods of clustering with language models created for this master work,and thus they will be discussed as well,including clustering based on deep neural networks.The focus of this master thesis is to capture semantic similarity between scientific research papers of different fields and to cluster them based on the vector representations of words learnt by language models implemented on a text corpus downloaded from ar Xiv website.Corpus includes the title and abstract of papers together with labels,i.e.categories and subcategories of the abstract.The paper's abstract can have multiple categories and subcategories,but only first category and/or subcategory is used for evaluation.The methodologies chosen to train language models depend on word distribution or distribution of the sequence of words.Similarity between papers are determined using word vectors obtained based on only the title,only the abstract or both together.Above such a set of vectors,methods of machine learning are deliberately chosen to perform clustering.Three algorithms have been selected in the creation of language models for producing word vectors from the titles and abstracts of papers by leveraging four versions of word2 vec algorithm and two versions of its extension doc2 vec algorithm,and Glo Ve algorithm.The performance and quality of the result of these models are evaluated and compared.it was examined how well they recognize relationships and semantic similarities between the words.Then,title and abstract vectors are used to cluster papers in categories and subcategories using K-means and spectral clustering.In order to evaluate the capacity of algorithms to understand the similarities between words,a set of few key words have been chosen,from which formulas were discarded.The experiments show that it is better to create language models using the Glo Ve algorithm for smaller data volumes,and it is better to use word2 vec in the case of large datasets.Also,certain features and capabilities of the word2 vec algorithm cannot be easily exert over such formal text.Another goal of the experiment is to understand how today's best algorithms are able to cluster scientific articles written by the researchers in the respective fields according to their semantic similarity,and how clusters obtained in this way corresponds with the categorization of the arxiv.org website.Although this master thesis primarily deals with the texts of scientific papers,the methodologies to be shown can be utilized to process general texts from Wikipedia,news from newspapers,other type of texts such as program code or short texts from social networks like Twitter as well.
Keywords/Search Tags:Semantic similarity, word2vec, GloVe, K-means, text preprocessing
PDF Full Text Request
Related items