Font Size: a A A

A Study On The Identification Of Confusing Short Texts

Posted on:2021-06-14Degree:MasterType:Thesis
Country:ChinaCandidate:K L M Y L H M YiFull Text:PDF
GTID:2518306128479154Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
At present,there are a large number of network texts in the tasks of network public opinion analysis,such as confusing languages,various coding forms and short text length.In this paper,Chinese,Japanese,Uyghur,Kazakh and Uzbek are selected as the representative research objects of language identification.Two coding forms are considered respectively for Uyghur,Kazakh and Uzbek.In this paper,five languages are divided into four categories according to character encoding range,including Chinese and Japanese character encoding range,current character encoding range,Latin character encoding range and Cyrillic character encoding range.Because there are two languages in the character encoding range,and they are very similar in writing form,and the language recognition of network text is very easy to be confused,this paper has carried out in-depth research on the language recognition of five different languages.The main research content is divided into the following aspects:(1)First of all,the analysis and preprocessing of the above-mentioned language text data collected by the network are carried out.Through manual annotation,the language information of the text is confirmed and the original corpus is constructed.For each coding form of each language,analyze the average character length,average vocabulary number,average vocabulary character length and other information of the text for analysis and decision-making;according to the data distribution,use the language coding conversion method to expand the data of the language coding text with the same language and different coding text;carry out different fine-grained according to the characteristics of the text Segmentation of degrees.(2)This paper proposes an effective recognition method based on the combination of multi strategies,and designs a multi strategies language identification system(MSLIS).The core principle of this method is accurate language recognition based on three levels of coding interval,character and naive Bayes classifier.The experimental results show that the multi strategy based statistical learning method can effectively distinguish the Chinese and Japanese characters and the languages with different codes of Uyghur,Kazak and Uzbek.(3)In view of the general performance of Multi Strategy statistical learning method in the case of short character length,this paper further considers the influence of word level and character level combination,and uses vectors with different combination granularity as the input of neural network model to verify its recognition performance.Firstly,the text is segmented on the basis of n-gram(1 ? n ? 5)gram model according to vocabulary level and character level.Then,the segmentation combination vector of NGram is used as the input of neural network.The feature extraction ability and category judgment ability of convolution neural network and bidirectional short and long-term memory network are compared respectively,and the best neural network model is selected.
Keywords/Search Tags:Language Identification, Multiple Strategies, Naive Bayes, N-gram model, Network
PDF Full Text Request
Related items