Font Size: a A A

Research On Text Classification Based On Improved TF-IDF And FastText Algorithm

Posted on:2021-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:S YanFull Text:PDF
GTID:2428330605456944Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development and popularization of the Internet,massive amounts of information are covered in the places involved in the Internet,various types of data are rapidly growing,and text data is the main body of these information.Now,the way people get information through the Internet has changed,from a single computer terminal to a computer,smart phone,tablet,or even a smart TV,smart watch,etc.The convenience of the way leads people to be exposed to a lot of information every day.Therefore,how to quickly obtain useful information that is classified is particularly important.This article mainly studies the classification problem of applying text classification-related technologies to Chinese news texts,analyzes some algorithms commonly used in the field of text classification,and then combines the commonly used text classification models to consider the points that have not been considered,such as:the importance of key characters in title,the distribution of feature words between categories and within categories have been improved.First,pre-process the data to be classified,and use regular expressions to further purify the processing results based on the original word segmentation and stop word removal,so as to obtain the purified text dataThen,the traditional TF-IDF algorithm applied in the field of text classification was analyzed,and some deficiencies in its practical application in the classification of Chinese news text were found and related improvement ideas were given.The improved ETF-IDF algorithm was proposed,enables certain feature words to obtain higher weight values in weight calculationFinally,the fastText model is analyzed,and two shortcomings of its application in Chinese news classification are pointed out:1.It does not consider the impact of a large number of interfering vocabularies on the classification result of the input layer of fastText model during input;2.It does not consider the effect of inaccurate calculation of keyword weight values caused by the different degree of distinction between article titles keywords and content keywords.In response to these problems,this paper combines the previously proposed ETF-IDF algorithm to improve the fastText model and proposes a new E-fastText model.The improved E-fastText model,K-nearest neighbor algorithm,Naive Bayes algorithm,and the original fastText model were compared on three sets of Chinese news datasets with different numbers.By analyzing the experimental results,it is proved that the improved algorithm and model have a certain improvement in the text classification effect.Figure[21]table[22]reference[53]...
Keywords/Search Tags:text classification, tf-idf, algorithm, fasttext
PDF Full Text Request
Related items