Font Size: a A A

Research On Chinese Word Segmentation Algorithm Based On News Text

Posted on:2022-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2518306602469274Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Natural language is human language according to certain rules,such as Chinese,English,French,etc.Natural language processing(NLP)is a process that uses computers to process the shape,sound,meaning and other information of natural language to get the final result.For example,the license plate recognition is the use of computer image recognition technology to process the shape of natural language;We Chat mobile terminal speech to text is the computer using speech recognition technology to process natural language sounds;Text classification and automatic summarization are computer processing of the meaning of natural language.Chinese word segmentation as the first step of natural language processing technology directly affects the results of application,so its research has an important market value.This paper focuses on the research of how to improve the Chinese part of speech ability in the field of news text.The main work and its innovation are reflected in the following aspects:1.Research on News Text Word Segmentation.Chinese culture is extensive and profound,and texts in different fields have different characteristics.This paper studies the work of word segmentation in the field of news.Firstly,on the shoulders of predecessors,the existing Chinese word segmentation device is used to segment the news text data for the first time,and then the ambiguous words and new words in the segmentation results are processed for the second time respectively.2.Research on disambiguation of ambiguous words.Aiming at the problem of ambiguous words in news text,a new bidirectional maximum matching word segmentation algorithm based on N-gram language model is proposed.Firstly,the news text to be partitioned is segmented by forward maximum segmentation and backward maximum segmentation,and two kinds of segmentation results are obtained.Then,the position of ambiguous words is located in the two kinds of segmentation results,and the disambiguation of ambiguous words is processed by using N-gram model.Experimental results show that the proposed word segmentation method can effectively disambiguation partial ambiguous words.3.Study on new word recognition.In view of the news text in the name,place names,organization names,and proper nouns can't identify the problem,this paper puts forward a model based on N-gram multiword unsupervised segmentation algorithm,most of the current algorithm by mutual information and statistics such as information entropy to determine whether the Chinese string into words,in this paper,on the contrary,by these statistics,reduce the correlation between different words and words,didn't agree to get part of the rules of words,will delete it,the rest is our new words in news text.Experimental results show that the proposed method is effective in identifying new words in articles.
Keywords/Search Tags:Chinese word segmentation, N-gram, ambiguous words, new word recognition
PDF Full Text Request
Related items