Font Size: a A A

Research On Text Classification Based On Subword-level Occlusion Prediction Method Of Bert Model

Posted on:2021-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:S R LiFull Text:PDF
GTID:2518306725952339Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous development of information technology and network technology,the amount of data in the network is increasing rapidly.In these data,such as instant conversations,user comments,video titles and other information saved by text account for a large part.This kind of text has a large amount of information,rapid dissemination and wide influence,and has certain research,learning and application value.The use of text classification technology can make it easier for people to extract effective information from these texts.Based on the text in the network,this paper makes an in-depth study of Chinese text classification.Through the research,it is found that network text usually has diverse text types,novel diction and grammatical structure,and sparse features.Moreover,due to the use of input method and colloquialism,typo noise is easy to occur in the text,which affects the quality of the text and reduces the classification accuracy.Therefore,in order to make effective use of these network texts,this paper deeply analyzes different models and finds that Bert model has the best effect in most natural language processing tasks,which is the most excellent model at present.However,when Bert model was applied to Chinese web text classification task,it was found that Bert model had certain deficiencies in both Chinese and web text classification.Therefore,this paper proposed a text classification method of subword-level masking prediction,making Bert model more suitable for Chinese web text classification task.In this text classification method,the main work carried out in this paper is summarized as follows:(1)Masking language model based on sub-word level.Because the masking language task of Bert model can only mask Chinese characters,rather than complete words,Bert model has poor ability to process Chinese.So this paper proposes a shading language model based on the child word level,in the shading language model,through the use of the word the textual representation of particle sizing,shield method based on the child word level position,as well as extra words into three methods,the model can distinguish between words in Chinese the first word with the first character,and through the son word class text first tag sequences generated by the word,and implements the shelter of the whole Chinese words,strengthened the Bert model on the cover of the Chinese prediction accuracy,improved Chinese word vector expression ability,so as to improve the accuracy of text classification.(2)A text error correction method based on subword level shadowing prediction.In the face of incorrect word noise in Chinese text,the model usually USES text error correction to reduce the influence of the noise.However,text error correction methods are usually implemented using encoder-decoder model.If you want to correct classified text,you need to train the two models to correct first and then classify,which requires too much computing resources and is too inefficient.So this paper proposes a text error correction method based on cover prediction,fine-tuning the Bert model stage to join the shading language model,and according to the first cover prediction steps covered projections for text,if the text does not agree with the original text,then with reference to certain rules to replace the original text to predict text,enhance the ability of text Bert model error correction and noise reduction,improve the quality of the text,so as to improve the accuracy of classification.
Keywords/Search Tags:Bert-model, Text-classification, Subword-levels, Text-correction
PDF Full Text Request
Related items