Font Size: a A A

The Research Of Text Representation And Feature Selection In Text Categorization

Posted on:2014-02-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:J M YangFull Text:PDF
GTID:1228330395996607Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
With the rapid development of the internet and information technologies, thehuman society was filled with various digital information, during which the text is ofspecial importance. Managing digital information by hand cannot meet the needs ofsociety, so the technologies which help users to deal with information overload andmine the valuable data have been a research spot. In recent years, the automatic textcategorization has achieved the considerable development and has been widely usedin many fields, such as information filtering, information organization andmanagement, information indexing, digital library, and so on. But the performance ofthe text categorization still requires further improvements.The text categorization is a supervised learning and covers many keytechnologies of machine learning, data mining, etc. There are many factors that affectthe performance of the text categorization, such as text preprocessing, featureextraction, dimensionality reduction, text representation, classifier design, evaluationcriterion, etc. Since the traditional text representation is of high dimensionality andhigh sparsity, designing the efficient text representation and reducing thedimensionality are the research hotspots of the text categorization. In this paper, thebackground, significance and the research status of text categorization are firstlyreviewed, and then the stages during the process of the text categorization are detailed,such as text preprocessing, term definition, text representation, vector spacedimensionality reduction, classifier design, performance evaluation, etc. Based onthis, the extensive research work has been carried out and the significant progress hasbeen made in the feature selection and text represenetation. Some research resultsproposed in this paper are listed as follows:1. The feature selection algorithm based on binomial hypothesis testing(Bi Test)The feature selection algorithm based on binomial hypothesis testing thinks thatthe event that one feature only represents ham or spam is a binomial trial (Bernoullitrial). Every feature in original feature vector space follows the binomial distribution,of which the probability is a constant but unknown. The probability of the binomialdistribution can be evaluated using the binomial hypothesis testing. If the probabilityof one feature is close to0or1, the ham category information or spam categoryinformation contained in this feature is more. Bi Test retains the features whoseprobabilities are close to0or1, and abandons the features whose probabilities areclose to0.5. The experiments show that Bi Test performs signifcantly better than chi statistic and Poisson distribution when Na ve Bayes classifer is used; it achievescomparable performance with the other methods when SVM classifer is used.Moreover, Bi Test executes faster than the other algorithms.2. The feature selection algorithm based on comprehensive measurement both ininter category and intra category(CMFS)The feature selection algorithm based on term to category matrixcomprehensively measures the signifcance of a term for categorization both from theprobability in intra category and inter category. We evaluated CMFS on threebenchmark document collections,20Newsgroups, Reuters21578and WebKB, usingtwo classifcation algorithms, Na ve Bayes and Support Vector Machines, comparingwith Information Gain(IG), CHI statistic(CHI), improved Gini index(GINI),Document Frequency(DF), DIA association factor(DIA) and Orthogonal CentroidFeature Selection(OCFS) in terms of accuracy and micro F1. The experimental resultsshow that the proposed method CMFS is signifcantly superior to IG, CHI, DF, OCFSand DIA when Na ve Bayes classifer is used and signifcantly outperforms IG, DF,OCFS and DIA when Support Vector Machines are used; the time complexity ofCMFS is lower than that of IG, CHI and OCFS, similar to improved Gini and higherthan that of DF and DIA.3. Text representation based on key terms of documentIn order to reduce the sparisity of the traditional text representation, the new textrepresentation model, named KT of DOC, was proposed. All documents in new textrepresentation model are represented by a certain amount of key terms from thedocument. In this paper, we firstly selected key terms from all documents to constructthe vector space based on six feature selection algorithms, improved Gini index (Gini),Information Gain (IG), Mutual Information (MI), Odds Ratio (OR), AmbiguityMeasure (AM) and DIA association factor (DIA), respectively, and then evaluated ouralgorithm on three benchmark document collections,20Newsgroups, Reuters21578and WebKB, using two classification algorithms, Support Vector Machines (SVM)and K Nearest Neighbors (KNN). The experiment results show that the performanceof the classifier has been greatly improved when the new text representation methodinstead of the traditional text representation method is used.4. The dimensionality reduction and text representation based on term clusteringThe term clustering is one of the methods that reduce the dimensionality of thetext representation space. The terms which are similar to each other based on asimilarity measurement are grouped into together and mapped on an element of thevector space. We investigated the method proposed by Baker and McCallum, andthank that some terms, which are not similar according to the method proposed byBaker and McCallum, are similar in terms of the relative contribution for textcategorization. So a new term clustering based on relative contribution for categorization was proposed. The proposed method is evaluated on three benchmarkcorpora (20newgroups, reuters21578and industry sector), combined with twoclassification algorithms (Support Vector Machines, K Nearest Neighbor), andcompared with five well known similarity measures (weighted average KLdivergence, City block, Euclidean) in terms of micro F1, macro F1and accuracy. Theexperiments show the proposed method can reduce the dimensionality of the vectorspace and improve the performance of the text categorization. Moreover, the qualityof clusters generated by the proposed method is the best.The high dimensionality and high sparsity are key factors which affect theperformance of text categorization. In this paper, we made an intensive study of thedimensionality reduction and text representation, and proposed some methods aboutfeature selection and text representation which can significantly improve theperformance of the text categorization. The experiments show that the efficientfeature selection algorithm and the compact text representation model are of benefit toimprove the performance of text categorization.
Keywords/Search Tags:Text categorization, Feature selection, Dimensionality reduction, Textrepresentation, Sparsity, Term clustering, Binomial hypothesis testing, Significance ofthe feature
PDF Full Text Request
Related items