Font Size: a A A

Research Of Chinese Word Segmentation Based On Mechanical Matching And Character Tagging

Posted on:2010-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2178360275981836Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Information bombs with the development of computation technology and Internet, it is unrealistic to find or analyze information manually. Information processing, especially Natural Language Processing(NLP), become more important. The goal of NLP is to make human's language understandable to computer. The first step is to recognize words, that's word segmentation. Chinese word segmentation(CWS) is to segment Chinese characters strings to word strings. Chinese word segmentation is the basis of Chinese information processing.There are three types of methods in CWS: character matching(CM), statistical methods and linguist methods. CM is an important basic method, it's used in many fields. As a statistical method, Character tagging(CT) proposed in recent years is outstanding in international CWS tests.Under these backgrounds, the thesis makes a deep research on CM and CT. Proposed a model named RMT(Reverse Matching and Matching and Tagging) for search engine, which uses both CM and CT. RMT uses several CM methods and reserves different segmentation results to creates index; because the keywords users input are short, RMT use both CM and CT method when searching, it not only ensure SE return results fast, but also find new words and enlarge the dictionary. RMT has a fast speed of indexing and searching by using an advanced dictionary structure. The paper develops a search engine based on Lucene, whose word segmentation module is improved according to RMT, test shows that RMT suits SE.CT needs to do machine traning using training data. Through researchs on training model, this paper try to optimize training models based on CRF++. The optimized model can assign tag of a chatacter forcibly, and also can export binary model as text model. Experiment results show that the optimized model can speed up segmentation.
Keywords/Search Tags:Chinese Word Segmentation, Character Matching, Character Tagging, Conditional Random Fields, Search Engine
PDF Full Text Request
Related items