A Normalized-vector Classification Algorithm For Text Retrieval On Web

Posted on:2013-05-24

Degree:Master

Type:Thesis

Country:China

Candidate:Q G Sun

Full Text:PDF

GTID:2248330362974054

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Information retrieval, as an important part of Internet applications, extremely helpsto people’s daily lives, and web applications in form of text documents remain themainstream of Internet applications. So it is still a key problem for researchers that howto obtain useful information from large number of texts on web. The technology ofautomatic text classification is not only an important branch of natural languageprocessing, but also the important basis of information retrieval and data mining.Every day, hundreds of millions of text pages on web are updated, so, the automatictext classification technology used for web information retrieval, not only need consideraccuracy of the classification algorithm, but also care efficiency in time. For this, thispaper proposes a new classification algorithm with high accuracy and low time cost,called normalized vector classification algorithm (NLV, for short).This paper, firstly, introduces some background knowledge about informationretrieval and text classification, and the detail workflow of classification processing, andthen sets out several typical feature selection methods and several classic textclassification methods with their advantages and disadvantages. Based on summaries ofexisting methods and technologies, this paper presents a new feature selection methodbased on matrix projection (MP, for short), and a new classification algorithm (NLV).In actual, MP feature selection method，which not only statistics the times a termappears in how many texts in all, but also counts the average frequency occurs in alltexts, belongs to the methods based on probabilistic model. In order to verify utility ofMP feature selection, this paper has done some well-designed experiments, whichcompare MP to4common used feature selection methods which includes IG, CHI, DFand MI, and apply MP to several typical classification algorithms.NLV classification algorithm is based on matrix operation, which projectshigher-dimensional feature space of training samples onto lower-dimensional featurespace and obtains a normalized feature vector through a specific normalized function,achieves the aims of reduction in feature dimensions and accurate computation offeature term weights. To verify utility of NLV classification algorithm, this paper hasdone enough well-designed experiments, which take three different corpus, they are20_Newgroups, TanCorpV1.0and SogouC, and five feature selection methods, they areDF, CHI, IG, MI and MP, and four classification algorithms, they are kNN, MBNB, MNNB and SVM. Finally, this paper analysis the comparison results and drawssome meaningful conclusions:1) NLV is the fastest one in term of time performance among the five classificationalgorithms.2) In term of accuracy performance, on the Chinese corpus, NLV is slightly lowerthan SVM, but, in term of time cost, NLV is absolutely superior to SVM.3) NLV algorithm can get optimal classification accuracy and time performanceamong the five classification algorithms on the corpus of20_Newgroups.

Keywords/Search Tags:

information retrieval, text classification, feature selection, matrix projection, normalized vector

PDF Full Text Request

Related items

1	Research Of Sentiment Tendency Analysis For Goods Evaluation Based On Text Classification
2	Research On Text Emotion Classification Based On Improved Feature Selection Method
3	Term Weight-Based Chinese Text Classification Algorithm
4	Research On Several Problems In Text Retrieval
5	Research On High-Performance Text Categorization
6	Research And Implementation On Text Information Classification In Big Data
7	On Research For Chinese Automatic Text Categorization Technology Based On VSM Model And Feature Selection
8	The Design And Application Of SSVM's Text Classification Based On Feature Selection Optimization
9	Research On Improvement Of Chi-square Feature Selection And Word Vector Text Representation For News Classification
10	Research And Design On Livelihood Information Classification System