Font Size: a A A

Research On FastText Text Classification Algorithm Based On TF-IDF

Posted on:2020-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:M M SunFull Text:PDF
GTID:2428330575493566Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the vigorous development of mobile intelligent terminals,China has entered the era of universal Internet.The number of netizens is increasing day by day,and Chinese text data such as news and e-books are also growing on a large scale.How to accurately classify text data automatically becomes a hot issue in the field of natural language processing.Automatically classifying Chinese text data has great significance for the research of information management and text mining.Traditional machine learning based text classification algorithms generally use TF-IDF algorithm to extract features in text,but this will ignore other features of words in the text and the relationship between words,making the last extracted feature not very precise,and resulting in the poor classification.Nowdays,in the field of natural language processing,deep learning is used to deal with text classification problems.Although the text classification based on deep learning has an advantage in the classification effect,but as the hidden layer increases,the calculation expense of the algorithm also increases.This will cost a lot of computing resources and time.The appearance of FastText text classification algorithm solves the above problems perfectly.Compared with other classification algorithms,this text classification algorithm can reduce the time overhead while ensuring classification accuracy.The only downside is that FastText did not feature extraction of input data at the input layer,which will have a certain impact on the classification effect of the algorithm.Therefore,this paper mainly studies and improves the TF-IDF feature extraction algorithm and FastText text classification algorithm.The main research contents are as follows:(1)The text feature extraction algorithm TF-IDF was studied and improved.Because the traditional TF-IDF algorithm not only ignores features except word frequency,but also does not consider the distribution of feature words in and between text categories.In response to this deficiency,this paper proposes the GF-IDF-IE algorithm,which is based on the TF-IDF algorithm.Firstly,the word frequency TF is improved by using the group feature factor,which includes the part of speech feature factor,the word length feature factor,the word position feature factor and the word frequency feature factor.Then combined it with the information entropy factor to improve the inverse document frequency IDF.In-class information entropy factor and inter-class information entropy factor are added to consider the distribution of feature words in and between text categories.Finally,the experimental results show that the improved algorithm is more suitable for feature extraction of text.(2)Researched and improved the FastText text classification algorithm.In FastText,there is no feature extraction of input data at the input layer,this paper first uses the traditional TF-IDF algorithm to extract the features at the input layer of FastText;since the FastText text classification algorithm adds n-gram,after feature extraction,input layer data will generate a large number of meaningless words under the processing of n-gram,and these unrealistic words need to be filtered.After the feature extraction and filtering of the n-gram results,the rest of input layer data is very important in the text.This improvement not only reduces the input of noise data,but also enhances the classification effect of the FastText text classification algorithm to a certain extent.(3)Improved the FastText text classification algorithm based on the GF-IDF-IE algorithm proposed in(1).First,extract the keywords of each category in the training text data as supplementary data based on the GF-IDF-IE algorithm,due to feature extraction of the input layer of FastText,and filtering the processing result of n-gram in(2).This processing may make the input data too short(generally less than 160 characters),and the FastText text classification algorithm may also affect the classification effect because the input data amount is too short.Therefore,it is necessary to supplement the input data in(2).Then,determine if the input data length is too short.If it is,the category supplemental data generated by the GF-IDF-IE algorithm needs to be added to the input data of the corresponding category to complete the data supplement.The final data can be entered into the hidden layer of the FastText algorithm for classification calculation.(4)This paper realizes the current mainstream text classification algorithms through experiments,including text classification algorithm based on machine learning,text classification algorithm based on deep learning and classic FastText text classification algorithm.We also compared the improved FastText text classification algorithm with the current mainstream text classification algorithms in terms of precision,recall,F1 value,and algorithm execution time.The conclusion shows that the improved FastText text classification algorithm has better classification effect on text classification.
Keywords/Search Tags:FastText, TF-IDF, text classification, information entropy, feature extraction
PDF Full Text Request
Related items