Font Size: a A A

Research On Tibetan Word Spelling Check Technology Based On Bidirectional LSTM

Posted on:2022-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:M C SanFull Text:PDF
GTID:2518306752993289Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,a large number of electronic texts have increased,and the traditional manual proofreading methods can not meet the actual needs.Computer automatic spell check(hereinafter referred to as spell check)uses the computer to automatically proofread the text instead of manual,which can free users from the heavy proofreading work and improve the efficiency of text proofreading.Spell checking is a basic work in the field of natural language processing.It has important application value in the fields of publishing industry,corpus construction,search engine,question and answer system,speech recognition and so on.From the different granularity of text composition,Tibetan spelling check includes syllable level,word level,grammar level and semantic level.With the development of Tibetan natural language processing technology,syllable level spell checking technology is basically mature.In recent years,scholars began to study word level spell checking technology.Taking Tibetan word level spell checking as the research object,this paper studies Tibetan word spell checking technology from the aspects of corpus construction,spell checking error type analysis,word level spell checking evaluation data construction,word level spell checking method and system design and implementation.(1)Corpus construction and word level error type analysisSince there is no open source Tibetan language model training corpus at present,this paper expands and establishes a corpus with the size of 221.6mb based on the word spelling data set established by the Tibetan natural language processing group of Qinghai Normal University.On this basis,taking the construction rules of words and words,grammar and semantics in Tibetan grammar as the theoretical basis,this paper comprehensively summarizes and summarizes the types of spelling errors in Tibetan texts,which lays a foundation for the research of Tibetan spelling checking technology.(2)Construction of word level spelling evaluation setSpell check evaluation set is a data set used to evaluate the effect of spell check.It can be divided into traditional spell check evaluation set and standard spell check evaluation set.As there is no standard spell check evaluation set up so far,scholars use artificial forgery to establish the traditional spell check evaluation set when studying word level spell check,so as to evaluate the system performance.Based on the construction process of English and Chinese spell check evaluation set,this paper collects the evaluation data by means of on-site collection,constructs a Tibetan word level spell check evaluation set with a size of 1403 kb,and analyzes the distribution of error types in the evaluation set.(3)Tibetan word spelling checking method and system developmentBased on the analysis of language model and cyclic neural network,combined with the advantages of language model and cyclic neural network,this paper constructs a TS?Bi LSTM language model with syllables as input by using bidirectional LSTM,designs a Tibetan word spelling checking algorithm based on the language model,and verifies the effectiveness of the language model and algorithm through three groups of experiments.The optimal model is selected from three groups of experiments,and a Tibetan word spelling visualization system based on TS?Bi LSTM is developed.
Keywords/Search Tags:natural language processing, Spell check, Error type, Evaluation set, Tibetan word level spelling check, Bidirectional LSTM
PDF Full Text Request
Related items