Font Size: a A A

Research And Implementation Of Tibetan Text Classification Based On MLP And SepCNN Models

Posted on:2022-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:H J SuFull Text:PDF
GTID:2505306509997739Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the information age,the types and quantities of text resources are increasing day by day,but these resources are mostly isolated,scattered,complex and diverse,resulting in extremely low effective utilization of resources.Currently,Tibetan digital information resources are rich and colorful,and researchers have become more and more mature in relevant knowledge of Tibetan information processing technology.Text classification is a traditional classic problem in the fields of natural language processing,data mining,and information retrieval.Its traditional research is mainly carried out on rich languages such as Chinese and English,and there is a lack of text classification methods for small languages such as Tibetan.In response to this problem,the paper conducts research from the perspectives of data collection,feature construction,and neural network model construction.The main research contents and results are as follows:1.Construct a Tibetan data set.Using web crawler technology,140,000 Tibetan texts were collected on Tibetan news websites and Tibetan e-book websites to construct an initial data set.Aiming at the problem of sparse features of the initial samples,short texts are selected to merge the same type without repetition,and the new texts are limited to a more balanced size range,and a data set that can be used for training is obtained by processing.Using this data enhancement method,based on the MLP model for training,verification and testing of different gradient data,the classification accuracy of the training set has increased by 7.8%,and the classification accuracy has improved in the range of 6.5%-12.8%.Utilization rate of Tibetan information resources.2.Aiming at the problem of text offset in the classification task,referring to the total number of samples in the entire data set and the storage space of a single sample,the log function is introduced,the threshold is set,and the number of texts is logged.On this basis,based on the TF-IDF method,a more expressive series of features are extracted,and the feature dimension is reduced.In experimental comparison,the log:TF-IDF function method dimensionality reduction processing can increase the accuracy of model classification by 1%-10%.3.Aiming at the problem of weak expression of single feature in the text,based on the n-gram language model,select 1-gram,2-gram,3-gram,1+2+3-based on Tibetan word features and Tibetan syllable features gram feature for comparative experiment.It is verified through experimental exploration that the 1+2+3-gram hybrid feature method can obtain better semantic representation capabilities of Tibetan text.Compared with a single feature,the classification accuracy can be improved by 1%-7%,which enhances the classification performance of the model.4.Research and analyze the four commonly used classification models(KNN and Gaussian NB models and MLP and SepCNN neural network models),and conduct training and parameter adjustment tests on different Tibetan text classification data sets,and finally build a stable classification Device.Experiments show that the multi-layer perceptron MLP neural network model has a simple structure and is easy to interpret.Combined with simple feature selection and extraction algorithms,it can not only train a better classification model,but also adapt to the classification needs of large-scale corpora in the era of big data.In the experiment,the overall classification effect can reach 95%.5.Based on the above research,this article designed a set of Tibetan text classification system assistants,which can visually display the effects of each process of text classification in the classification process.After processing the misclassified text obtained from the final evaluation,it is put into a shallow machine learning classifier for secondary classification,which can save the training time of the classification model,improve the overall classification accuracy of the system,and the system has strong portability and is easy for users.And operate quickly.
Keywords/Search Tags:data enhancement, log:TF-IDF, MLP&SepCNN, Tibetan text classification, classification system
PDF Full Text Request
Related items