Font Size: a A A

Research On Chinese Text Classification Based On Improved FastText

Posted on:2022-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y LiangFull Text:PDF
GTID:2518306509489004Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
In the current wave of mobile Internet,the production of text data presents an exponential explosion trend,indicating the arrival of the era of information overload.Facing the vast sea of text data,users can't get the required information quickly only by reading and traversing.In order to help users quickly match valuable information,we need to use text mining technology to process the text data,and label different categories to facilitate users' reading and understanding.Text classification technology in natural language field can easily solve such problems.With the continuous development of computer technology and the increasing optimization of computing performance,text classification has also experienced a long iterative process.At the beginning of the rise of text classification technology,the research direction of researchers is the rule-based artificial method.The text data in different fields need to rely on expert system to build Feature Engineering,which costs manpower and has low accuracy.Before 2010,the shallow learning algorithm based on statistical model method dominated,compared with the rule-based method,the shallow learning method is more accurate and stable.After 2010,the decline of the cost of computing resources and the popularity of GPU promote the research upsurge in the field of deep learning.Neural network model can get rid of the cumbersome feature engineering processing and automatically mine the rich semantic representation of features.Deep learning methods begin to occupy the main stream of research.In 2017,Facebook proposed a supervised learning model Fast Text to solve the problems of word vector representation and text classification.However,Fast Text algorithm,which performs well in English context,is not adapted to Chinese short text classification.The main research work of this paper is to build an improved Fast Text model for Chinese short text classification and prediction.The strong feature words in documents are identified by the new definition of category purity,and the norm is introduced to quantify the dimensional features of the word vector.The conclusion that the word vector is positively related to its category purity is obtained.Fast Text algorithm takes the mean value of the sequence word vector in the document as the feature representation of the document information,which makes the strong feature words dominate the document vector and ignores other effective keywords,which is the key factor leading to the poor generalization ability of the model.In order to solve this problem,considering the distribution of words in different categories,this paper designs a dropout method with non random probability discarding,which can help the model find effective keywords except strong feature words.The improved Fast Text model is constructed by integrating the designed non random probability dropout method with the original model for numerical experiments.The experimental results show that,compared with the traditional Fast Text algorithm and other baseline models,the classification evaluation score of the improved Fast Text model on the data set is significantly improved,which verifies the effectiveness of the non random probability dropout method in alleviating the model over fitting problem.
Keywords/Search Tags:Text classification, FastText, Word vector, Over fitting, Dropout
PDF Full Text Request
Related items