Font Size: a A A

Research On Tibetan Text Classification Technology Based On TWC?CNN

Posted on:2022-12-29Degree:MasterType:Thesis
Country:ChinaCandidate:J Z X DaoFull Text:PDF
GTID:2518306752493274Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Text classification is a technique that automatically determines unknown categories of documents in a document collection by content according to certain rules based on a pre-defined subject category.It is one of the basic and important research contents in natural language processing,and has a wide range of application value in information retrieval,intelligent recommendation,public opinion analysis,news classification,and other fields.With the rapid development of information technology and the popularization and application of the Internet,the requirements for text classification technology are getting higher and higher,and more and more electronic documents are processed and managed by automatic text classification.Due to the lack of data resources and poor technical level,the research on Tibetan text classification has not yet made a major breakthrough.At present,the classification of Tibetan texts is mainly based on the traditional machine learning method based on words.This method is not only restricted by Tibetan word segmentation technology,but also requires complicated manual feature engineering.In order to solve the shortcomings of this method,this paper studies the related technologies of Tibetan text classification from the aspects of data set construction,feature primitive selection,classification methods,and so on.(1)Construction of Tibetan text classification text datasetAiming at the scarcity of Tibetan classified text datasets,this paper proposes a preprocessing method for Tibetan classified text datasets based on the characteristics of Tibetan texts and the basic requirements of classified text datasets,including a syllablelevel Tibetan classified text dataset preprocessing model,syllable correction algorithm and text normalization algorithm TC?CTCN.The experimental data show that the algorithm achieves the expected effect,and a Tibetan text classification data set with a scale of 104.8M is constructed,which lays the foundation for the technical research of Tibetan text classification.(2)Selection of feature primitives for Tibetan text classificationDue to the restriction of Tibetan word segmentation technology,using words as the feature primitives of Tibetan text classification has a great influence on the classification performance.Based on the analysis of the text classification process and Tibetan text structure,this paper proposes a feature primitive selection method that integrates words and syllables.The experimental data show that this method has the best text classification performance under the current technical conditions.(3)Classification of Tibetan textsBased on the analysis of Tibetan natural language processing technology,the deep learning Tibetan text classification method is studied,and a Tibetan text classification method based on TWC?CNN is proposed.TWC?CNN uses the double primitives of fused words and syllables as feature primitives,and uses CNN to build a classifier.It is experimentally verified that its performance is better than the baseline model,and three conclusions are mentioned:(1)For Tibetan text classification,the double primitives text classification has a better performance than the text classification of the word or syllable single primitive;(2)In the text classification method of the deep learning model,the classifier built by the CNN model is better than the classifier built by other models;(3)Based on TWC?CNN,the accuracy,recall rate,and F1 value of Tibetan text classification have been greatly improved,and the text classification performance is better than other baseline models.
Keywords/Search Tags:Natural language processing, Text classification, Deep learning, Tibetan, Words, syllable
PDF Full Text Request
Related items