Font Size: a A A

Research Of Feature Selection And Feature Extraction Methods In Internet News Classification

Posted on:2017-09-13Degree:MasterType:Thesis
Country:ChinaCandidate:T T WangFull Text:PDF
GTID:2348330491960074Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Recently, with the rapid development of the Internet technology, text information and other kinds of information in the Internet shows explosive growth. How to extract useful knowledge from the huge amounts of information is an important issue for each industry. Therefore, improving the accuracy and efficiency of text classification of Internet news to provide a high-quality and intelligent text classification service is a scientific problem currently. Feature selection and feature extraction methods are the primary means of text dimension reduction. However, common feature selection methods select features on balanced datasets and perform poorly on imbalanced datasets. Besides the existing feature selection methods have some drawbacks. For example, TF-IDF method selects features that reflect the essential characteristic of the text sets and ignore the distinguish abilities of features on the distinct classes. Mutual-information and Chi-Square test methods often have "low-frequency defect" problems. As far as feature extraction is considered, the eigenvectors characterized by the vector space model are often high-dimensional and sparse and they cannot capture the semantic and syntactic relationships between different features. To solve the imbalance classification problem and the drawbacks of common feature selection and feature extraction methods. The main works of this dissertation are as follows.1. To solve the imbalance problem in text datasets of Internet news, this dissertation proposes two novel feature selection methods which are TF-IDF method based on document variance and mutual information method based on probability distributive variance between classes. The probability distributive variance between classes and the document variance are the variance of term frequency and document frequency that show close relationship with the characteristics of the text categories instead of the number of samples of different categories. Thus, the proposed feature selection methods could effectively select features from small classes and solve the imbalanced classification problem. The experiments results show that the proposed methods could improve the performance of text classification of Internet news, comparing to the common feature selection methods.2. This dissertation proposes a Word2vec frame work based on an exponential-decay model on the basis of statistic analysis of the context words of Internet news to improve the accuracy of the result word embedding. The originalWord2vec framework holds that the effect of context words on predicting the target word decays linearly. However, the impacts of context words on target word decrease rapidly with the distance between context word and target word. The exponential-decay model is closer to reality on the basis of statistic analysis. The experiment results show that word embeddings trained by the framework based on the exponential-decay model are more accurate than the original Word2vec framework.3. To solve the existing problem of the vector space model in text feature extraction, this dissertation extracts text features by using word embeddings and takes the eigenvectors constructed by superposing the features’ word embeddings as the new feature vectors. The experiment results show that, the eigenvectors represented by superposed word embeddings of the features on the basis of the variance based feature selection method show good performance and could further improve the text classification performance of Internet news.To solve the existing problems of common feature selection and feature extraction methods and the imbalance classification problem of text dataset of Internet news, this dissertation proposes two feature selection methods based on document variance and probability distribution variance. The methods could select features and hot words from small Internet news categories. Besides, the Word2vec framework is improved by using the exponential-decay model to improve the accuracy of the result word embedding. Then the eigenvectors constructed by word embeddings are taken as the new eigenvector to further improve the classification of Internet news.
Keywords/Search Tags:Internet news classification, feature selection, feature extraction, word embedding, Word2vec
PDF Full Text Request
Related items