Font Size: a A A

Research Of Chinese Word Segmentation Oriented To Special Domain

Posted on:2013-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:T TangFull Text:PDF
GTID:2248330371958511Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese word segmentation is a basic research content of natural language processing, and it is also the bottlenecks in natural language processing. The precision of word segmentation will bring direct impact on follow-up work of Chinese information processing, such as information retrieval, machine translation, text classification, etc. At present, domestic word segmentation work mainly focuses on general domains, and the word segmentation systems which are mature and practical are also designed for general domains.Specific domain includes the appropriate domain knowledge, domain concepts and terms. Nowadays, there are few studies of word segmentation in specific domain, and due to the nature of specific domain, existing segmentation tools cannot achieve good segmentation results. On the basis of analysis on the characteristics of aviation documents and patent documents, this paper, proposes a word segmentation method based on term extraction for specific domain, and the method achieved good segmentation results. The details are as follows:Firstly, aiming at the characteristics of aviation documents and patent documents this paper formulates appropriate norms of word segmentation, and integrates the characteristics of aviation encyclopedic dictionary and patent summary to develop the standards of terminology, and under the guidance of the norms of word segmentation and the standards of terminology, we segment words manually and tag terms.Secondly, we extract terms based on the existing tagged corpus. As an important issue of domain knowledge acquisition, term extraction plays an important role in term relationship extraction and domain ontology construction. In this paper, in the foundation of 5-best results provided by the CRF model, we propose a term extraction method combining the statistic-based method and rule-based method. The method can significantly improve the extraction results of unknown terms.Finally, focusing on the characteristics of domain documents, this paper proposes a part-segmentation method and a full-segmentation method for domain word segmentation. During the segmentation of domain corpus , the word segmentation system can get the term which has been segmented. Experiments show that this method can achieve higher precision of word segmentation.
Keywords/Search Tags:domain word segmentation, term extraction, conditional random fields, term, unknown term
PDF Full Text Request
Related items