Font Size: a A A

Study On Chinese Text Similarity Computing Based On Word Segmentation

Posted on:2007-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:B ShenFull Text:PDF
GTID:2178360182485964Subject:Business management
Abstract/Summary:PDF Full Text Request
In Chinese information processing, text similarity computing is widely used in the area ofinformation retrieval, machine translation, automatic question-answering, text mining and etc.It's a question of much essential and important that people study as a hotspot and difficulty for along time. It's more difficult for computers to process Chinese than to Western letters in theprocessing of word segmentation. Word segmentation is the foundation and precondition ofChinese text similarity computing, the accuracy of the result can be greatly improved whenadopting more efficient arithmetic.In this paper, a kind of improved Maximum Matching Method and the strategy to eliminatethe ambiguity is put forward on the basis of analysis and contrast of common Chinese wordsegmentation arithmetic. A new method which can improve the integrality and accuracy of wordsegmentation is put forward to improve the construction of the word segmentation dictionary, thesteps of word segmentation and the process of the ambiguity. Then on the basis of analysis andcontrast of existing text similarity computing methods, the realization of Chinese text wordsegmentation and similarity computing with computer system is put forward which make use ofTF-IDF method based on VSM combined with the word segmentation arithmetic whichmentioned above. Technological texts are tested as example to validate the method that used.The research and its outcome will have valuable reference and good applicable prospect tomany domains in Chinese information processing especially in technological text similaritycomputing.
Keywords/Search Tags:Text similarity, Word segmentation, Maximum Matching Method, Text feature vector, TF-IDF
PDF Full Text Request
Related items