Font Size: a A A

Research On Chinese Text Proofreading Algorithm Based On The Combination Of Statistical Features And Rules

Posted on:2020-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:L P WangFull Text:PDF
GTID:2438330596497514Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Words have a status and role that cannot be ignored in today’s society.The text is transmitted in the crowd by means of electronic publications,newspapers,and social platforms,which brings convenience to people’s information life,and the huge information is dazzling.In these massive words,the existence of words errors is very common.The problems of low efficiency,high intensity and long period of traditional manual proofreading obviously cannot meet the requirements of text proofreading.Therefore,automatic text proofreading becomes a key technology research of natural language information processing hot spot.Chinese text automatic proofreading is mainly divided into two steps: text error detection and text proofreading.Traditional text error detection is generally based on text segmentation.Commonly used disperse string detection and text mutual information error detection.Therefore,the Chinese text segmentation has a crucial influence on the error detection of the text.In text proofreading,commonly used methods include N-gram model,Markov model,etc.These proofreading models need large-scale corpus support;in general,text proofreading and text error detection are performed separately,which also increases the overhead of proofreading time.The proofreading model proposed in this paper mainly focuses on text-shaped near-word errors and text-non-word errors.It uses the method of integrating text error detection and proofreading.Firstly,the detection and proofreading of the near-word error needs to use the minimum edit distance algorithm and the pen-like text similarity method to construct the text-shaped near-word table.Then,the text-shaped near-word table is used to construct the text-shaped near-word candidate matrix,according to the text features,the adjacent vectors are grooped into words and obtains the candidate words of the text.In order to obtain the best candidate words,a binary model based on context is proposed,this model is used to calculate the words with highest degree of support in the text and abtain output under the path of the best candidate word.Secondly,it is aimed at text proofreading using language knowledge base,that is,text non-word proofreading.The non-word proofreading in this paper is divided into two parts.The first part is long-word proofreading.The proofreading of long words mainly uses fuzzy matching algorithm to locate long-word errors in the text,and proofread the text in the use of the thesaurus;The dictionary tree index is also used in the process to improve the retrieval speed of the text.The second part is The proofreading of heavy words in the text,firstly we define the text heavy words,distinguishes the overlapping words from the text and filters the overlapping thesaurus with the overlapping words,and finally locates and corrects the errors in the text.Finally,this paper combines and implements the two text proofing methods.Through the real data test,the proofreading rate,accuracy rate and other indicators are pointed out,indicating that the proposed method has a good effect.
Keywords/Search Tags:automatic text proofing, near-word proofreading, non-word proofreading, fuzzy matching, contextual binary model
PDF Full Text Request
Related items