Font Size: a A A

Text Similarity Calculation With Penalty Factor

Posted on:2016-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q HanFull Text:PDF
GTID:2308330479450315Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, various information grows at an explosive speed.Considering the text data is still the most important and the most direct information carrier, how to extract valuable information from the mass text information quickly and efficiently has become an important issue in the field of information processing, and has made information retrieval, information filtering technology be further studied and widely used. As the basis of the above application technology, text similarity calculation especially Chinese text similarity calculation technology possesses important significance.This paper studies on text similarity calculation with penalty factor.On the basis of the improved existing Chinese word segmentation algorithm by special identifier,a new method of text similarity calculation with penalty factor is presented in this paper.In this method,through a combination of the high efficiency of statistics-based methods and the veracity of semantic-based methods,a distance matrix model is constructed with the help of vector space model idea.Then,the concept of penaty factor is put forward and synonyms are processed in the stage of similarity calculation with the consideration of effects of word order and synonyms while semantic factor is taken as the breakthrough point.In a word,a new algorithm of sentence-level text similarity calculation algorithm is got in this paper.Firstly,the exsiting Chinese word segmentation algorithm is improved by using Shapley model to optimize the segmentation results,which can be divided into two steps:First,the text is segmentated by non-chinese-character and chinese-character special identifiers, in which process non-special identifier characters are segmented by method of two-charater words segmentation;second,in order to solve the problem of inaccuracy and incorrect segmentation caused by that some whole words may be segmentated by chinese-character special identifiers and some three-character and four-character words may be segmentated in the process of two-charater words segmentation, Shapley model is used to optimize the segmentation results to get a better accuracy.Secondly, after the word segmentation preprocessing work done,vectors of original sentence and the sentence to be compared are constucted and then multiplied to construct a distance matrix model, through which the penalty factor is calculated and put into the similary calculation formula to get the sentence-level similarity. When similarity calculation of all sentences is completed,text-level similarity is acquired by integrating all sentence-level similarities.In additon,considering effects of synonyms,through Synonyms Dictionary query, synonyms appearing in the text are processed as the same words to get a better text similarity calculation result.Finally, do the test by calculation cases and compare the calculation results of different text similarity calculation methods,which shows that text similarity calculation method with penalty factor has a certain improvement in accuracy.
Keywords/Search Tags:Chinese text similarity, Penalty factor, Chinese segmentation, Shapley model
PDF Full Text Request
Related items