Text Similarity Calculation With Penalty Factor

Posted on:2016-06-04

Degree:Master

Type:Thesis

Country:China

Candidate:Y Q Han

Full Text:PDF

GTID:2308330479450315

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology, various information grows at an explosive speed.Considering the text data is still the most important and the most direct information carrier, how to extract valuable information from the mass text information quickly and efficiently has become an important issue in the field of information processing, and has made information retrieval, information filtering technology be further studied and widely used. As the basis of the above application technology, text similarity calculation especially Chinese text similarity calculation technology possesses important significance.This paper studies on text similarity calculation with penalty factor.On the basis of the improved existing Chinese word segmentation algorithm by special identifier,a new method of text similarity calculation with penalty factor is presented in this paper.In this method,through a combination of the high efficiency of statistics-based methods and the veracity of semantic-based methods,a distance matrix model is constructed with the help of vector space model idea.Then,the concept of penaty factor is put forward and synonyms are processed in the stage of similarity calculation with the consideration of effects of word order and synonyms while semantic factor is taken as the breakthrough point.In a word,a new algorithm of sentence-level text similarity calculation algorithm is got in this paper.Firstly,the exsiting Chinese word segmentation algorithm is improved by using Shapley model to optimize the segmentation results,which can be divided into two steps:First,the text is segmentated by non-chinese-character and chinese-character special identifiers, in which process non-special identifier characters are segmented by method of two-charater words segmentation;second,in order to solve the problem of inaccuracy and incorrect segmentation caused by that some whole words may be segmentated by chinese-character special identifiers and some three-character and four-character words may be segmentated in the process of two-charater words segmentation, Shapley model is used to optimize the segmentation results to get a better accuracy.Secondly, after the word segmentation preprocessing work done,vectors of original sentence and the sentence to be compared are constucted and then multiplied to construct a distance matrix model, through which the penalty factor is calculated and put into the similary calculation formula to get the sentence-level similarity. When similarity calculation of all sentences is completed,text-level similarity is acquired by integrating all sentence-level similarities.In additon,considering effects of synonyms,through Synonyms Dictionary query, synonyms appearing in the text are processed as the same words to get a better text similarity calculation result.Finally, do the test by calculation cases and compare the calculation results of different text similarity calculation methods,which shows that text similarity calculation method with penalty factor has a certain improvement in accuracy.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Study On Chinese Text Classification Technology Based On Improved Text Similarity Algorithm
2	Research And Implementation Of Subjective Question Scoring System Based On Chinese Word Segmentation And Text Similarity
3	Advanced techniques for Chinese chunk segmentation and the similarity measure of Chinese sentences
4	Research On Text Similarity Algorithm Based On VSM Combined With Word Semantics
5	Study On Chinese Text Similarity Computing Based On Word Segmentation
6	Research On The Calculation Method Of Chinese-Lao Bilingual Text Similarity
7	Chinese-Old Bilingual Text And Sentence Similarity Calculation Research
8	Chinese Text Dimensionality Reduction Based On Factor Analysis
9	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
10	OCR Error Post-correction Based On Chinese Character-level Features And Language Model