Font Size: a A A

Research On Key Technologies Of Textual Proofreading On Government Websites

Posted on:2019-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z YuanFull Text:PDF
GTID:2348330569995806Subject:Engineering
Abstract/Summary:PDF Full Text Request
As government services gradually migrate to websites,more and more information can be obtained on government websites.The open information on government websites aims to convey timely,accurate and authoritative messages to the public.However,with the rapid increase in the amount of information disclosed,editors can easily overlook errors in text.Due to the high accuracy requirements of open information in government websites,the use of computers for text-assisted proofreading has become an urgent need.In recent years,Chinese text proofreading has made targeted researches in areas such as question and answer,social networking,and opinion texts.There is still a lack of research on government websites.Based on the lack of research on the text of the Chinese text proofreading on government websites,this paper uses natural language processing techniques to conduct in-depth research on the key technologies of government website text proofing from the perspective of statistics and machine learning.By analyzing the general error types of Chinese texts and combining the text features of government website texts,the scope of research was determined to be word-level errors and short-range contextual collocation errors based on homonym word substitutions.Word-level errors,also known as "nonmultiword errors," short-range contextual collocation errors,also known as "true multiple word errors."To deal with these two types of errors,we started with the error detection and error correction of text proofreading.This paper has studied the existing research results and conducted the following three aspects from the perspective of statistics and machine learning:1.For "non-multiword errors" : Implemented a traditional dictionary-based proofreading program.Through the analysis of multiple sets of instances,it is found that when a sentence containing "non-multi-word errors" is segmented,its wrong words are more likely to be divided into single-word fragments.For this rule,a single-word merging algorithm was proposed to increase the error detection rate and improve the overall error correction rate.The experimental verification verifies that the single-word merging algorithm improves 6% error detection rate and 3.1% error correction rate based on the original scheme.2.For "True multi-word errors" : The error detector uses the traditional N-gram modelcombined with thresholds for error detection.According to the features of the same pinyin string based on the incorrect collocation and the correct matching,an error correction scheme based on the Hidden Markov Model(HMM)model is proposed at the error correction end.According to the word typed by the user rather than the characters,the error correction scheme based on the word-based directed acyclic graph model is proposed at the error correction end.The error correction scheme based on the HMM model and the word-based directed acyclic graph model obtained 65.46% and 53.19% error correction rate respectively in the test set.3.Research of text proofing based on recurrent neural network :Based on the LSTM(Long Short Term Memory Networks)-based sequence decoding model,the proofreading problem is modeled.Using LSTM's long-term memory feature to fully obtain the semantic information of the sentence,combined with the sequence decoding model,to achieve proofreading from the wrong sentence to the correct sentence.Finally,a lot of comparative experiments verify the feasibility of the proposed algorithm and innovative solutions.
Keywords/Search Tags:Single-word Merging Algorithm, HMM Model, Word-based Directed Acyclic Graph, Sequential Decoding Model Based on LSTM
PDF Full Text Request
Related items