Research And Application Of Ensemble Learning Based On Semantic Vector

Posted on:2020-03-26

Degree:Master

Type:Thesis

Country:China

Candidate:M R Jin

Full Text:PDF

GTID:2428330596476042

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the development of information society,more and more text information comes from the Internet.Nowadays,the number of texts is huge and their update speed is very fast.However,manual text categorization takes a lot of time and labor costs.Therefore,how to automatic categorize text become a hot topic.The first step of text categorization is to convert unstructured text into structured semantic vectors.Semantic vectors can reduce the dimension of text,filter out important information,and improve the performance and efficiency of subsequent text categorization.However,the current constructed text semantic vectors still need to be improved on the improvement of classification effect.The second step of text categorization is to classify text vectors through classifiers and output the results.The accuracy and efficiency of current common classifiers can also be improved.In recent years,ensemble learning has developed rapidly,and it can also be used as a classifier for text classification.In this thesis,a new semantic vector tdCHI is constructed,and the constructed semantic vectors are input into the improved XGBoost ensemble learning algorithm for text classification.The main innovations of this thesis are as follows:(1)An improved semantic vector tdCHI for text categorization is constructed in this thesis.The tdCHI semantics vector combines word2 vec and improved chi-square test,which has abundant semantics information and low dimensions.Aiming to solve the problem of low word frequency defect in chi-square test,this thesis introduces t-test to increase the weight of word frequency;aiming to solve the problem that the value of chisquare test only represents correlation,and when the chi-square value is large,it may be negative correlation,a filter function is added to filter the results;aiming to solve the problem that chi-square test only considers the distinction between feature items and categories,and does not consider the importance of feature items in specific articles,TFIDF algorithm is introduced.Experiments show that tdCHI semantic vector improves the performance of text categorization.(2)For the application of text categorization environment,this thesis improves the XGBoost integrated learning algorithm.Aiming to solve the problem that the default method of XGBoost to deal with missing values does not consider the characteristics of text categorization,combined with tdCHI semantics vector,the method of XGBoost to deal with missing values is improved by cyclically selecting the top-ranking feature items to fill missing values.Experiments show that the improved missing value processing method can improve the accuracy of text categorization better than the default missing value processing method and other common missing value processing methods.Then,under the background of unbalanced samples,the thesis analyzes the shortcomings of default score function of XGBoost,and introduce marco-average to design a new score function.Experiments show that the designed scoring function has good performance in text categorization,whether in the case of balanced or unbalanced samples.Finally,the overall experiment proves that the XGBoost ensemble learning classifier based on tdCHI semantics vector can improve the performance of text categorization,and has better operational efficiency.

Keywords/Search Tags:

semantic vector, tdCHI, ensemble learning, XGBoost

PDF Full Text Request

Related items

1	An XGBoost-Based Ensemble Learning Approach To Personal Credit Risk Assessment
2	Semantic Reasoning Based On Deep Learning
3	The Study And Application Of Ensemble Of Trees Based On Boosting
4	The Application Of Ensemble Learning In The Early Warning Model Of Operator User Churn
5	Research On Credit Card Fraud Detection Based On Ensemble Learning
6	The Research On Support Vector Machine Ensemble Learning Approach
7	Prediction And Research Of Market Index Based On Machine Learning
8	Ensemble Learning Based On Support Vector Machines
9	Research On Support Vector Machine Ensemble Learning Algorithm
10	Analysis Of Abnormal Behavior Of Web Users Based On MRMR-XGBoost Two-Layer Model