Font Size: a A A

The Application Of Similarity Measurement Method Face With BOW Model On Feature Dimension Reduction

Posted on:2016-11-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y TangFull Text:PDF
GTID:2308330461455879Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Bag of words model is a vector space model. In this model, text or image is seen as a collection of unordered words, the syntax and the sequential relationship among words are ignored. By choosing or designing a suitable similarity measure function for bag of words model could it be widely applied to classification,clustering and retrieval problems for texture or image. After a great quantity of work,researchers have summed up amount of common and effective similarity measure, including the Euclidean distance, cosine similarity, Manhattan distnce, and Mahalanobis distance etc. However, the Bow model for texture and image problems often exists some shortcomings, such as high-dimensional, redundant of features, polysemy and righteous words.These shortcomings will increase the computational complexity while handling Bow model, and interference the accuracy of learning algorithms.In this paper we propose a supervised method, which combining the word features of Bow model into word clusters. The main purpose of this papers’ work is to eliminate the negative impact of a multi-word meaning by transforming the expression of the space of the original words into the one of the new terms clusters space, thereby affecting the computation of similarity of text or image samples. Firstly, the class conditional probability distribution is used to describe the distribution of lexical terms, and Jensen-Shannon divergence serves as the portray of the correlation between the distributions. On the basis we design a algorithm called WCE which can reform the thesaurus into new words clusters by restructuring and merging. And we adopt a supervised evaluation model of loss function for evaluating the new words clusters generated by WCE.The parameters such as loss function, term-weighed method and similarity measure can be choosed flexibly in this model. The algorithm finally output the optional solution of the loss function and the corresponding words clusters set, so as to achieve the goal of dimension reduction.The experimental part verified the effectiveness and reasonability of the dimensionality reduction algorithm through the retrieval and classification. After experiment, we found that the algorithm have different dimension reduction result meanwhile the enhancement of retrieval effect and classification accuracy by using the algorithm is various. For low-dimensional bag of words, the effect of the algorithm is unconspicuous. But for high-dimensional ones, we can often achieve superior results. And among high-dimensional bags of words, the visual bag of words for image problems, combining with the ones for texture, can reach more obvious promotion on the retrieval result and classification accuracy after words dimensionality reduction. Overall,for high-dimensional bags, the method in this paper can attain a relative ideal effect on the basis of maintaining the precision of retrieval and classification.
Keywords/Search Tags:Bag of words, similarity measure, feature selection
PDF Full Text Request
Related items