| With the rapid development of the Internet and modern communication technologies,there is a growing need for people to obtain useful information from huge amounts of textual data quickly and intelligently through computers.And natural language processing is an important research direction to meet such demand.Word segmentation technology is a category of natural language processing and the first“process”of text information processing.However,relevant work at present is much less focused on Zhuangwen word segmentation,which does not have a mature theoretical basis and has not established a standardized evaluation system.The traditional Zhuangwen word segmentation method uses the space between words as the separation sign,but in most cases,the word segmentation method will ruin the specific and independent meaning to be expressed by semantic words formed by combined adjacent consecutive words.To address this issue,the main work of this paper is on the study and analysis of existing segmentation algorithms.Based on the characteristics of Zhuangwen text,this thesis gets the inspiration from the related research results of Chinese and other minority languages,summarizes and selects the word segmentation model suitable for Zhuangwen word segmentation.The main content of this paper are as follows:1、The first is the choice of word segmentation algorithm.According to the analysis of existing segmentation algorithms,the dictionary-based and rules-based segmentation methods need to aquire a complete dictionary as basis,the understanding-based word segmentation method requires a lot of linguistic knowledge,and statistics-based segmentation method only uses statistical models for a large number of Standardized corpus text for statistical calculation.Existing Zhuangwen resources are less,meanwhile,there is no more complete dictionary and less Zhuang language knowledge resources,so this paper decides to use statistics-based segmentation method to accomplish Zhuangwen word segmentation.2、Based on the former MI method,which uses the mutual information MI to measure the degree of association between adjacent words,a word segmentation model of Zhuangwen word segmentation algorithm based on mutual information is constructed.Focused on the existing overestimation part of the word and the shortcoming of strength of two Low-frequency words and the combination of the word strings,MI~k and t-test difference Zhuangwen text segmentation are first introduced,which can effectively improve the accuracy of word segmentation.After analyzing respective advantages in evaluating the static binding ability and dynamic binding ability of adjacent words,a new TD-MI~k hybrid algorithm combining MI~k and t-test difference is proposed for Zhuangwen word segmentation,Finally this paper make comparison among MI~k,t-test difference and TD-MI~k hybrid algorithm.3、We collect Zhuangwen corpus text set on the Zhuangwen version of People's Daily as the training corpus and randomly selecting a group of articles to make artificial marking as the test corpus for word segmentation experiment,the experimental result shows that the segmentation method used in this paper is more accurate,the TD-MI~k hybrid algorithm proposed in this paper has the highest accuracy of word segmentation and the accuracy rate and recall rate increased by 3.77%and 4.7%,respectively.At the same time,we designed and developed a Zhuangwen word segmentation system that has user-interface and is easy-operated. |