Font Size: a A A

The Research Of Chinese Word Segmentation Disambiguation Based On Word Environment Information

Posted on:2019-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:L HuangFull Text:PDF
GTID:2428330626950118Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Today's world has entered the information society.With the rapid advancement of the global information process,the Internet has become increasingly inseparable from peop le's work and study.It can be said that every minute,every job in modern society cannot be separated from the help of the Internet.How to use network information more efficiently and accurately is the goal of Chinese information processing(CLP).CLP is the cornerstone and premise of Chinese information processing,and the disambiguation of ambiguity is the key and difficult point in the field of Chinese word segmentation.So,it realize the better resolution and application of ambiguity in Chinese word segmentation.As we all know,words are the smallest and most meaningful language units that can be used independently.Unlike the foreign languages,there are no obvious space marks in the Chinese sentences.Therefore,the Chinese word segmentation technology arises.After years of exploration and development,the Chinese word segmentation has made great progress,but it still faces problems such as the definition of word boundaries,the identification of unregistered words,the standardization of word segmentation,and the elimination of word meanings.The main goal of this paper is to explore the problem of ambiguity resolution.In particular,many ambiguities can only be properly segmented in a corresponding context.In general,CLP are mainly based on statistical large-scale corpus,dictionary-based rules,and a combination of both methods.However,these disambiguation methods seldom take into account information in the context of word relevance,grammar,and semantics,leading to some ambiguity problems that are not well resolved.Based on the in-depth study of the existing CLP theory and disambiguation model algorithm,the problem of ambiguity can be solved for those who need information in the corresponding context.This paper explores the Chinese word segmentation disambiguation method based on contextual information.The main contents of this thesis are as follows:(1)This article introduces and analyzes Chinese word segmentation and disambiguation theories,discusses the basic methods and advantages and disadvantages of Chinese automatic word segmentation,and elaborates several statistical models of CLP.(2)Using the word segmentation method of dummy words and the improved two-way maximum matching algorithm,the input text is subjected to rough and subdivided words.And through the pretreatment and part-of-speech tagging,the text segmentation is completed,and several segmentation results of the ambiguity field are obtained.(3)For the disambiguation of ambiguous fields,this paper simulates the process of disambiguation based on human context and fuses information in contexts such as word length,part-of-speech,tf-idf,and semantic similarity to construct TextRank graph models for keyword extraction,combining statistical and semantic methods to achieve full extraction and full use of information in context.(4)Based on Hownet corpus and keyword extraction algorithm based on context information,semantic similarity and degree of relevance are combined.This paper presents a CLP model based on context information to obtain the correct segmentation result of ambiguity.(5)A CLP's system is designed and implemented,and the system is experimentally tested.
Keywords/Search Tags:Chinese word segmentation, Ambiguity disambiguation, Keyword extraction
PDF Full Text Request
Related items