Font Size: a A A

A Normalized-vector Classification Algorithm For Text Retrieval On Web

Posted on:2013-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:Q G SunFull Text:PDF
GTID:2248330362974054Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Information retrieval, as an important part of Internet applications, extremely helpsto people’s daily lives, and web applications in form of text documents remain themainstream of Internet applications. So it is still a key problem for researchers that howto obtain useful information from large number of texts on web. The technology ofautomatic text classification is not only an important branch of natural languageprocessing, but also the important basis of information retrieval and data mining.Every day, hundreds of millions of text pages on web are updated, so, the automatictext classification technology used for web information retrieval, not only need consideraccuracy of the classification algorithm, but also care efficiency in time. For this, thispaper proposes a new classification algorithm with high accuracy and low time cost,called normalized vector classification algorithm (NLV, for short).This paper, firstly, introduces some background knowledge about informationretrieval and text classification, and the detail workflow of classification processing, andthen sets out several typical feature selection methods and several classic textclassification methods with their advantages and disadvantages. Based on summaries ofexisting methods and technologies, this paper presents a new feature selection methodbased on matrix projection (MP, for short), and a new classification algorithm (NLV).In actual, MP feature selection method,which not only statistics the times a termappears in how many texts in all, but also counts the average frequency occurs in alltexts, belongs to the methods based on probabilistic model. In order to verify utility ofMP feature selection, this paper has done some well-designed experiments, whichcompare MP to4common used feature selection methods which includes IG, CHI, DFand MI, and apply MP to several typical classification algorithms.NLV classification algorithm is based on matrix operation, which projectshigher-dimensional feature space of training samples onto lower-dimensional featurespace and obtains a normalized feature vector through a specific normalized function,achieves the aims of reduction in feature dimensions and accurate computation offeature term weights. To verify utility of NLV classification algorithm, this paper hasdone enough well-designed experiments, which take three different corpus, they are20_Newgroups, TanCorpV1.0and SogouC, and five feature selection methods, they areDF, CHI, IG, MI and MP, and four classification algorithms, they are kNN, MBNB, MNNB and SVM. Finally, this paper analysis the comparison results and drawssome meaningful conclusions:1) NLV is the fastest one in term of time performance among the five classificationalgorithms.2) In term of accuracy performance, on the Chinese corpus, NLV is slightly lowerthan SVM, but, in term of time cost, NLV is absolutely superior to SVM.3) NLV algorithm can get optimal classification accuracy and time performanceamong the five classification algorithms on the corpus of20_Newgroups.
Keywords/Search Tags:information retrieval, text classification, feature selection, matrix projection, normalized vector
PDF Full Text Request
Related items