Font Size: a A A

Research And Implement Of Chinese Word Segment Techniques Based On The Conditional Random Field

Posted on:2012-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:X LuFull Text:PDF
GTID:2218330362456259Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
The increasingly large information on Internet bring a great challenge to the information processing, especially in the field of Chinese information processing, one of the most important requirements is the Chinese word segmentation, which the main purpose is resolve the Chinese sentences to the words combinations correctly to make them understandable for computers. In many information processing fields, including information index, summarization, text categorization, automatic clustering, text correction, handwriting input, intelligent response, the Chinese word segmentation is the first step, so to strengthen the research of Chinese word has becoming very important.This paper converts the word segmentation problem to the combination of morphological characters successfully by using conditional random fields static model which introduces of the concept of Chinese characters position in a word. This make it can consolidate the characters by the rules of positions consolidation, the process is complemented through machine learning approaches, which not only improves the accuracy of segmentation, but also make the segmentation could get rid of dependence on the dictionaries.In this paper, the experiments are implemented by using JAVA language, firstly static the feature information in corpus and establish an extendable information database, and then use the viterbi algorithm to compute the best solution of characters position sequences, during the process we reduces the calculation complexity by remove the invalid combinations. We use real corpus to do training and testing, and implement an evaluation algorithm to assess the system performance. The research method of this paper could be used to identify unknown words by Chinese characters consolidation rules, there are further more works could be do in this area.
Keywords/Search Tags:Chinese word segment, Conditional random fields, machine learning
PDF Full Text Request
Related items