Font Size: a A A

The Research Of Chinese Words Segmentation Algorithm Based On Statistics And Semantic Information

Posted on:2016-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:L J LiFull Text:PDF
GTID:2308330461993546Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In this paper, the research takes advantage of the field of specific areas and structural characteristics of the Semantic Web ontology to improve dictionary-based bi-directional maximum matching algorithm, and then makes a point of Chinese segmentation algorithm based on statistical and semantic information. In order to verify the proposed algorithm, the paper has developed a Chinese word segmentation system. Finally, by comparing the results of NLPIR Chinese word segmentation system, the proposed algorithm proved effective in a particular area than traditional segmentation methods. For the above, this paper made the following five aspects of work:1. Build a plane geometry domain ontology according to OWL standard. With the help of Wikipedia to understand the concept knowledge and hierarchy in the field of plane geometry, this paper extracted 30 terminology from this area. Using four basic ontology relationship, the paper completed the semi-automatic annotation and proofreading for the relationship between terms. The paper constructs a semantically related database, and completes management and editing of the domain ontology.2. Proposed a ambiguity digestion algorithm based on statistical rules. Considering the significant impact of ambiguity field to segmentation precision, this paper presents five statistical rules on the basis of analyzing the existing ambiguity processing methods. And then according to the rules, the paper designed and implemented one processing algorithm for crossing ambiguity.3. Proposed a Chinese word segmentation algorithm based on semantic information. The algorithm improves the traditional maximum matching algorithm which is based on dictionary. In this paper, the domain ontology construction replaced the traditional Chinese word dictionary. After pretreatment, the process is to match entries in the domain ontology and experiments corpus. Through the relationship between terminologies in the domain ontology, the algorithm achieves semantic information processing of mechanical segmentation, thereby reduces ambiguity field, and ultimately gets more accurate segmentation results.4. Design and implement a Chinese word segmentation system based on statistical and semantic information. The system realizes the proposed segmentation algorithm, and specifically describes the function of each module of the framework. Finally, the system achieves three functions including pretreatment, semantic ambiguity of words and word processing.5. Validate the proposed segmentation algorithm in the paper. Test corpuses used in the paper are selected from the plane geometry exam. The Chinese word segmentation system tested 50 randomly selected examples, and the test items include segmentation accuracy, ambiguity processing, unknown words recognition and system response time. Then this paper compare the test results with the results of NLPIR Chinese word system, and experimental results show that the proposed algorithm is more effective in a particular area than traditional segmentation methods.
Keywords/Search Tags:Chinese words segmentation, Semantic analysis, Domain ontology Ambiguity processing
PDF Full Text Request
Related items