Font Size: a A A

Research And Application Of Ensemble Learning Based On Semantic Vector

Posted on:2020-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:M R JinFull Text:PDF
GTID:2428330596476042Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the development of information society,more and more text information comes from the Internet.Nowadays,the number of texts is huge and their update speed is very fast.However,manual text categorization takes a lot of time and labor costs.Therefore,how to automatic categorize text become a hot topic.The first step of text categorization is to convert unstructured text into structured semantic vectors.Semantic vectors can reduce the dimension of text,filter out important information,and improve the performance and efficiency of subsequent text categorization.However,the current constructed text semantic vectors still need to be improved on the improvement of classification effect.The second step of text categorization is to classify text vectors through classifiers and output the results.The accuracy and efficiency of current common classifiers can also be improved.In recent years,ensemble learning has developed rapidly,and it can also be used as a classifier for text classification.In this thesis,a new semantic vector tdCHI is constructed,and the constructed semantic vectors are input into the improved XGBoost ensemble learning algorithm for text classification.The main innovations of this thesis are as follows:(1)An improved semantic vector tdCHI for text categorization is constructed in this thesis.The tdCHI semantics vector combines word2 vec and improved chi-square test,which has abundant semantics information and low dimensions.Aiming to solve the problem of low word frequency defect in chi-square test,this thesis introduces t-test to increase the weight of word frequency;aiming to solve the problem that the value of chisquare test only represents correlation,and when the chi-square value is large,it may be negative correlation,a filter function is added to filter the results;aiming to solve the problem that chi-square test only considers the distinction between feature items and categories,and does not consider the importance of feature items in specific articles,TFIDF algorithm is introduced.Experiments show that tdCHI semantic vector improves the performance of text categorization.(2)For the application of text categorization environment,this thesis improves the XGBoost integrated learning algorithm.Aiming to solve the problem that the default method of XGBoost to deal with missing values does not consider the characteristics of text categorization,combined with tdCHI semantics vector,the method of XGBoost to deal with missing values is improved by cyclically selecting the top-ranking feature items to fill missing values.Experiments show that the improved missing value processing method can improve the accuracy of text categorization better than the default missing value processing method and other common missing value processing methods.Then,under the background of unbalanced samples,the thesis analyzes the shortcomings of default score function of XGBoost,and introduce marco-average to design a new score function.Experiments show that the designed scoring function has good performance in text categorization,whether in the case of balanced or unbalanced samples.Finally,the overall experiment proves that the XGBoost ensemble learning classifier based on tdCHI semantics vector can improve the performance of text categorization,and has better operational efficiency.
Keywords/Search Tags:semantic vector, tdCHI, ensemble learning, XGBoost
PDF Full Text Request
Related items