Font Size: a A A

Research And Implementation Of Chinese Word Segmentation System For Enterprise Information Retrieval

Posted on:2009-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:J N ChuFull Text:PDF
GTID:2178360308477816Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of enterprise information, study in enterprise information retrieval is becoming a hot point in information retrieval. As an important procedure of the text operation stage, Chinese word segmentation(CWS) influences the accuracy of the searching results directly. There are many researches which focus on the technology of CWS. However, most of them are always on universal algorithms, seldom specialized for enterprise information retrieval. Therefore, study on CWS in enterprise information retrieval has important theoretical and real significances.In this thesis, we study the key technology and difficulties of CWS, analyze the impact of the CWS on large scale information retrieval. Then an EIRCWS system is designed on the basis of this and characteristics of CWS on enterprise information retrieval. Because of high demand for speed on segmentation algorithm of enterprise information retrieval, we design a new dictionary structure of multi-word-hash-indexing, improve the query algorithm on the word dictionary and efficiency on word segmentation. Ambiguity resolution and unknown words identification are two difficulties in CWS. According to the characteristics of enterprise information retrieval, we only resolve overlapping ambiguity during the disambiguation phase. Bi-direction matching results are compared to detect the ambiguity and self-defined rules are used to resolve them. In the unknown identification phase, a new method is put forward. In this method, quantifier identification rules, clipping word segmentation fragment associate with the auxiliary empty words and the statistic of the local word frequency combining with the probability of the single word are used to make the algorithm efficiently identify different types of unknown words in many areas without large corpus.Our experiments show that the EIRCWS system not only has the high speed and accuracy of the word segmentation, but also has a strong capability in identifying the unknown words. It meets the needs of the Chinese automatic words segmentation of enterprise information retrieval.
Keywords/Search Tags:Enterprise information retrieval, Chinese word segmentation, Ambiguity resolution, Unknown words identification
PDF Full Text Request
Related items