Font Size: a A A

An Improved Approach To TF-IDF Algorithm In Text Classification

Posted on:2020-09-12Degree:MasterType:Thesis
Country:ChinaCandidate:X M YeFull Text:PDF
GTID:2428330578965975Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the computer hardware's computing power and storage capacity have been greatly improved,which has led to the explosive growth of network information that brings some obstacles to users' timely and effective access to the required information.Text categorization is a supervised learning using a tagged text training set.The model trained by the classifier assigns a specified category to a document of an unknown category,which can facilitate the user to obtain information and improve the user experience to a certain extent.However,with the development of China's Internet environment,a large number of new words with rich information have become popular.The new word is a word that is not included in the old dictionary published by the sixth Chinese orientation analysis,and it is treated the same as the unregistered word.The emergence of new words reduces the rationality and accuracy of Chinese word segmentation,which further affects the accuracy of Chinese text classification.The process of transforming text from unstructured form to structured form is the cornerstone of the entire text classification work,and the assignment of feature items is one of the most important.Currently,the TF-IDF algorithm is the most frequently used feature weighting algorithm.In recent years,the improved feature weight TF-IDF algorithms are mostly limited to the frequency,location and feature distribution of feature items,but do not consider the particularity of the new word.So this paper proposes an improved feature weighting algorithm based on new word discovery.The main work is to identify new words and use the improved feature weighting algorithm to improve the weights of new words in feature items.In addition,according to the characteristics of the online corpus,the new word recognition is added to the Chinese text classification process and combined with the improved feature weighting algorithm to improve the text classification process.At the same time,a series of comparative experiments were conducted on the standard Sogou corpus and the manually crawled Sina corpus by using the improved and unimproved text classification process.The experimental results showed that the improved text classification process could not only achieve the purpose of feature dimension reduction,but also optimize the text classification results.
Keywords/Search Tags:New words, TF-IDF, Vector Space Model, Text classification
PDF Full Text Request
Related items