Font Size: a A A

Research On Feature Vector Optimization Techniques In Web Text Classification

Posted on:2008-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:W L WangFull Text:PDF
GTID:2178360215972136Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development and spread of Internet, electronic text information greatly increases. It is a great challenge for information science and technology that how people organize and process large amount of document data, and find the interesting information for users quickly, exactly and fully. As the key technology in organizing and processing large amount of document data, text classification can solve the problem of information disorder to a great extent, and is convenient for users to find the required information quickly. Moreover, text classification has a broad applied future as the technical basis of information filtering, information retrieval, search engine, text database, digital library and so on.In web document classification applications, the document is always represented using Vector Space Model or Latent Semantic Indexing Model in which each document is represented as a vector and each unique term is one dimension of this vector. This representation is very simple, however it arises one severe problem: the high dimensionality of the feature space and inherent data sparsity. In addition, this representation also can't solve text data's polysemant problem. All these problems interfere with classification learning processes greatly and make their performance be dramatically dropped. Therefore it is highly desirable to solve the problem through feature vector optimization techniques.The text vector optimizing techniques generally fall into two categories: weight adjustment and dimensionality reduction. The weight adjustment method adjusts a word's weight through considering its importance to a text, a data set or a category. While the dimensionality reduction optimizes text representation by reducing the dimension of feature space, which includes two commonly used techniques: feature extraction and feature selection. So in this paper, an all-around research is done on weight adjustment and dimensionality reduction.This paper presents a method that put emphasis on two main aspects: the feature selection and its weight computing method; the feature extraction and its weight computing method. The research is done in the paper as follows:As to feature selection, considering the insufficient research in feature redundancy, we have proposed a new method to eliminate the feature redundancy-the relevancy analysis of the features during the process of feature selection of which we have analyzed and argued the importance in the feature selection. Taking the measurement of information theory as the basic tool, this method defines a new feature selection algorithm after considering comprehensively some problems such as computing cost and the subjectivity of feature assessment etc. The algorithm has abandoned the redundant feature and simultaneously maintenance the category correlated features, which achieves good results. In weight adjustment strategy, we analyse and improve the traditional formula TF*IDF:(1) we use generalized information theory as the theory base to introduce the quadratic entropy mutual information into the formula. (2) We study the feature of web documents information and propose the concepts of primary feature word, primary feature field(PFF) and primary feature space(PFS). Then a new PFS term weighting scheme is proposed, which takes document frequency(DF) into account in stead of the traditional IDF factor. Finally, a combination strategy of term weighting is given.As to feature extraction, Latent Semantic Indexing(LSI) is one of the most important techiques. It can be divided into two categorizations: Global LSI and Local LSI. Global LSI doesn't consider the information about each class, when it is applied to text classification, it's founded that Global LSI always drops the text classification performance greatly. Compared to Global LSI, Local LSI is not carried out in the entire training data but in Local semantic space. Local LSI takes good advantage of the class information and extracts distinct semantic structure between them. This method can improve the text classification performance but very limitedly; moreover, we find that Local LSI's weight adjustment method simply inherit vector space model. Although both of these models have the certain similarity in the description of the text, the basic idea is substantially different: VSM general looks terms as essential dimensions of the space. Local LSI no longer regards the words as the independent dimension but was considered as each corresponded "latent concept". So in this paper, a new local LSI weight adjustment method is proposed to improve text classification performance by performing a separate Single Value Decomposition(SVD) on the transformed local region of each class.As the basis of data mining technology, text categorization is the foundation and core of information filtering. Finally, we put the vector optimizing strategies into action in information filtering platform and the results are quite satisfying.
Keywords/Search Tags:web text categorization, feature vector optimization, feature selection, feature extraction, weight adjustment
PDF Full Text Request
Related items