With the increasingly wide range of computer applications and information processing technology, the continuous enhancement, natural language information processing technology has been paid much attention, how to improve the computer's understanding of the natural language of computer technology has a very important significance. As the automatic Chinese word segmentation is the premise and basis to a text proof-reading, information retrieval, speech recognition, text mining, machine translation for the study, so automatic Chinese word segmentation has become basis and core work to natural language information processing technology.Because the variability and complexity of Chinese sentences, automatic Chinese word segmentation is bottleneck in automatic processing of Chinese information. The main different between Chinese word processing and foreign word processing is that there is no obvious separation tags. Therefore, the primary issue of Chinese information processing, that is, to a sentence to separate words, this is the Chinese word segmentation problem.The main content of the research, key technology and innovation points mainly in the following areas:First, the research is from the word in word of location to consider, thus bringing Chinese segmentation process as word in word location of the problem. A large-scale corpus-based statistics, statistics of each word in the word location of the probability of the formation of an assessed value of the dictionary word for word basis. Such an approach is also the subject of the study, one of innovative ideas. As the word in Chinese the number of words within words focused on four and more than five is relatively small, so this study aimed at four-character word position within a word in the assessed value of a statistical word.Second, the basic theory of hidden Markov model three basic issues to be addressed, and the second problem is that decoding problems, word decoding problem into the problem. Viterbi algorithm can find the optimal solution, its ideological essence of the global optimal solution of the calculation process will be broken down into stages of the calculation of optimal solution. An assessed value of the use of a dictionary word segmentation to treat each sentence word by Viterbi thinking of valuation, and then cut back hours to do one important advantage is that it is able to maintain a balanced view vocabulary words and unknown word recognition. And thus better able to solve the unknown word problem and most of the ambiguity problem. This is the second innovative ideas of the research.Finally, the Hidden Markov Model to solve the third problem is that learning problems, according to preliminary results of Segmentation analysis, the results of Segmentation size of the error of Machine Learning. Word in the word to learn the location of features, the assessed value of the initial statistical word segmentation as initial parameters, and then adjust the parameters of model building, so that the computer repeatedly to learn to adjust and optimize the value of the dictionary words and valuation of assessed value of parameters, after adjusting for segmentation. Machine learning optimization assessed value of the parameter words, and the ideas is the third innovative ideas of research projects.At present, it is difficult to make uniform standards in the field of segmentation word. In recent years, there have been a number of high-accuracy word segmentation software, but unknown word problems and word ambiguity problem has constraints, this research project, mainly from a better solution to log on word problems and ambiguity problems, we should study in order to make better and more accurate for word segmentation. |