Font Size: a A A

Research And Application On Text Similarity Algorithm Based On Semantics

Posted on:2015-12-13Degree:MasterType:Thesis
Country:ChinaCandidate:J P ZhangFull Text:PDF
GTID:2298330431977044Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Because of the great development of the Internet and the much progress of thetechnology, more and more and information flood the Internet so that the amount of all thedata in the Internet shows a trend of exponential growth. Text is the major carrier in somany of the information carrier on the Internet undoubtedly, which lead the research ontext mining to be a new hot field in the area of computer science. All of those fundamentalresearches on text segmentation, text part of speech, text representation and so on keep afinal destination which is the similarity calculation, in the other hand; the similaritycalculation is a basic procedure of deep research on many of text application. With thesignificant role between fundamental research and text application, it is widely applied innatural language processing, text categorization, text clustering, question answering,information retrieval, search engine and many other text mining areas.Text similarity algorithm aims at comparing the similarity degree between two textsthrough using certain of strategy. The text similarity algorithm has two research directionscurrently, one is vector space model, and text is represented as text vector with the vectorspace model, computing the cosine angle with cosine vector strategy between text vectors,using the cosine angle to show the text similarity between texts. The other is semanticlibrary, extracting the optimal term matching pair successive, using the sum of all theoptimal term matching pair to show the final similarity between two texts. Both of the twoapproaches have certain of significance while calculating the similarity between two texts,however, lacking of deep analysis on the characteristics of text such as length and languageof the text, the cosine angle is suitable for large scale text relatively while the semanticrelationship among terms is totally ignored, the semantic method based on semantic librarycould not show the correct similarity between texts for the limited amount of terms in thesemantic library.The features of different text application have certain of differences such as lengthand language characteristics), in this paper, we start our research on text similaritycalculation from different length and different language characteristics of text. For the richnumbers of terms with different meanings existed in large scale of text, text is divided intoseveral semantic units, the semantic relationship among terms in each of semantic unit isobtained through different strategies(such as term occurrence frequency voteprobability, term part of speech weight and so on). For those little scales of text with different languages, text is divided into part of speech vector with part of speech; thedefinition of semantic weight of terms adopts different strategies as well.We study those algorithms between large scale of text with large scale of text, littlescale of text with little scale of text and little scale of text and large scale of text primarily,the algorithm on large scale of text with large scale of text is applied in text categorization,the algorithm on little scale of text with little scale of text is applied in FAQ area, thealgorithm on little scale of text with large scale of text is applied in search engine, theexperimental results show that the improved semantic text similarity algorithm acquires agood performance on precision rate of both text categorization and sentence similaritycalculation compared with traditional algorithm.
Keywords/Search Tags:text similarity algorithm, semantic unit division, POS space definition, termsemantic weight definition
PDF Full Text Request
Related items