Font Size: a A A

Research And Implementation Of Chinese Word Segmentation Algorithm For Aerospace Field

Posted on:2020-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:G X ZhengFull Text:PDF
GTID:2392330602452123Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Since the 1960s,China's space industry has developed rapidly and accumulated a large amount of space information resources.In the retrieval and analysis of space information,the traditional manual method is time-consuming and laborious,and is no longer competent for space information retrieval tasks.Chinese word segmentation is a key step in search engine technology.The effect of word segmentation will greatly affect the accuracy of retrieval.Although Chinese word segmentation technology in China ranks among the top in the world,most Chinese word segmentation researches focus on the general field,and few Chinese word segmentation researches are oriented to the space field.Therefore,a Chinese word segmentation algorithm with excellent performance in the space field is of great significance for the space information retrieval task.This paper mainly studies Chinese word segmentation algorithms in the aerospace field.Firstly,it studies three commonly used Chinese word segmentation algorithms,analyzes and summarizes the problems existing in traditional Chinese word segmentation methods,and proposes a multi-strategy fusion Chinese word segmentation algorithm based on the traditional Chinese word segmentation algorithm according to the characteristics of the terms in the aerospace field.The multi-strategy fusion Chinese word segmentation algorithm proposed in this paper is composed of three modules: dictionary-based initial word segmentation module,disambiguation module and aerospace terms extraction module.In the dictionary based initial word segmentation module,an improved maximum matching algorithm(DF-MM)is proposed to address the shortcomings of the maximum matching algorithm.In the ambiguity resolution module,in order to ensure the accuracy and efficiency of ambiguity resolution,a new ambiguity resolution method combining statistics and rules is proposed.After a forward maximum matching algorithm for segmentation corpora and reverse maximum matching algorithm segmentation,if do not match the number of words segmentation out,according to the principle of "minimum segmentation",the segmentation of less number of reserved word form as the final results of the ambiguity resolution,if the same number of words segmentation out,is the Bi-Gram model,two kinds of segmentation are calculated respectively in the form of probability,the segmentation of retention probability form as the final results of the ambiguity resolution.In the module of term extraction in the aerospace field,the most popular conditional random fields is adopted to transform the term extraction task into serialized labeling problem.By summarizing the characteristics of terms in the aerospace field,5-tag method is adopted to extract 5 features and build feature templates to complete the term extraction task.Multi-strategy fusion Chinese word segmentation algorithm combines the advantages of several traditional Chinese word segmentation algorithms.The dictionary-based Chinese word segmentation algorithm is selected as the main word segmentation module to ensure the overall efficiency of the algorithm.Through experimental verification,the performance of the proposed multi-strategy fusion Chinese word segmentation algorithm and each submodule is improved compared with the traditional method.Finally,the multi-strategy fusion Chinese word segmentation algorithm is applied to the "space system" to improve the accuracy of the system word segmentation and provide users with more accurate retrieval results.
Keywords/Search Tags:Aerospace Field, Chinese Word Segmentation, Maximum Matching, Bi-Gram, Conditional Random Fields
PDF Full Text Request
Related items