Font Size: a A A

Research On Text Macro Feature Extraction And Centroid-based Automatic Classification Methods

Posted on:2015-07-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:D D WangFull Text:PDF
GTID:1108330479978705Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Automatic text classification(also called text classification) is the hot topic of Webinformation processing. With the flexibility of class hierarchy and the rapid growth ofthe number of sub-categories in practical application, standard corpus construction of textclassification becomes one of the key issues for getting a better classifier in real applica-tions. In addition, the term-level features of text ignore the relationship between wordsas the document is an organic whole; moreover, the centroid based classification methodsare hard to compare without clear applicable scene and usually get bad performance. Inview of the above questions, this thesis will make in-depth research on three key issuesof text classification, including the methods of text corpus construction, feature extractionand text classification algorithm. The main contents of this thesis are as follows.Automatic text corpus construction based on webpage structure is proposed. Sincethe fixed class hierarchy cannot meet the practical needs, and meanwhile, there are richtext classification knowledge in the large scale of website resources on the Web. Thisthesis proposes a novel method to construct text corpus. The method is to make full useof internet resources, webpage structure、content and link relationship, to construct thecorpus automatically base on unsupervised clustering. Experiments show that the totalaccuracy can achieve 73.73%, and prove the effectiveness of this method.Supervised macro feature selection for text classification is proposed. The rela-tionship between the text is often ignored by the traditional feature extraction methods.Therefore, this thesis extracts features based on relationship between text, and calls themmacro features. Meanwhile, according to the influence of the proportion of labeled corpuson classification performance, we design and implement three macro feature extractionmethods based on clustering and centroid. Experiments show that text classification per-formance is effectively improved after combing micro features with each kind of macrofeature.Macro feature fusion for text classification is proposed. Considering different sizeof labeled data which can be obtained in different applications, this thesis proposes a nov-el fusion method between supervised and unsupervised macro features. From the pointof view of model and feature fusion, we implement the macro feature fusion methodsbased on bagging and feature augmentation. The unsupervised macro feature extractionmethods include macro feature extraction based on K-means clustering, Latent DirichletAllocation and Deep Belief Networks. The fusion macro features combined with microfeatures are for text classification jointly. Experiments show the fusion features are moreeffective than applying the supervised or unsupervised macro features separately.Ranking-based centroid text classification is proposed. The centroid-based classi-fication is popular due to its simple mode and short training time. At present, there areseveral centroid-base classification methods. The principles of existing centroid-basedclassification methods are different, which makes it di?cult to compare and improve theperformance. Also, the overall performance is not high. Therefore, this thesis proposesa unified framework of centroid-based classification methods based on machine learningranking. Under this framework, text classification is treated as a ranking problem, whichoptimizes the prototype vector by information retrieval techniques; moreover, three com-monly used centroid-based classification methods are represented. On basis of it, wepropose new ranking-based centroid classification methods. Experiments show that thenew ranking-based centroid methods are superior to other three commonly used centroid-based methods.
Keywords/Search Tags:Text classification, Automatic construction corpus, Text feature extraction, Macro feature extraction, Centroid-based classification
PDF Full Text Request
Related items