Font Size: a A A

Design And Implementation Of Text Classification Model Based On The Improved TF-IDF Feature Extraction

Posted on:2017-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:P P DuFull Text:PDF
GTID:2348330503990916Subject:Industrial Engineering
Abstract/Summary:PDF Full Text Request
2035 China engineering science and technology development strategy research was launched by China Academy of Engineering and the National Natural Science Foundation in 2015. The 3D printing technology foresight was taken as an important part of the strategy research. Researchers have used objective data such as literatures and patents to support technology analysis and desicion making. So classifying the massive text data is the basis of carrying out the work. However, 3D printing technology includes many sub-areas and the number of relevant text data is huge. Using artificial method to classify these data is not only time-consuming and laborious, but also needs many people with professional knowledge. The use of computer for automatic text classification is thus a effective solution.Feature word dimension reduction is the most core part of automatic text classification. Feature extraction based on TF-IDF is one of the most commonly used dimension reduction method. TF only considers the frequency information contained in the text, and lacks of taking feature word context into account, so TF ignores the text structure information. While the IDF algorithm lacks of considering feature word distribution between classes. Due to the interdependence relation between the text feature word can express text structure information very well, the paper firstly builds a text network according to this kind of relations, then utilizes the importance values of feature word in the text network calculated by improved PageRank algorithm to amend TF. This paper also improves the IDF algorithm from the perspective of the degree of feature word distribution concentration in each class.In order to verify the feasibility of the improved method in this paper, This paper individually uses Chinese text classification procedure before and after the improvement to classify the same text data. Finally, by comparing the experiment results, this paper proves that the improved method in this paper is feasible.
Keywords/Search Tags:Text classification, Feature extraction, TF-IDF, PageRank, Text network
PDF Full Text Request
Related items