Font Size: a A A

A Study On Concept-VSM And Its Application In Text Classification

Posted on:2003-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:H Y HuangFull Text:PDF
GTID:2168360122960482Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As the volume of information available on the Internet and corporate intranets continues to increase, there is growing interest in helping people better find, filter, and manage these resources. Text classification - the assignment of natural language texts to one or more predefined categories based on their content - is an important component in many information organization and management tasks. Its most widespread application has been for assigning subject categories of documents to support text retrieval, routing and filtering. In many contexts, trained professionals are employed to categorize new items. This process is very time-consuming and costly, thus limiting its applicability. Rule-based approaches similar to those used in expert systems are common, but they generally require manual construction of the rules, make rigid binary decisions about category membership, and are typically difficult to modify. Another strategy is to use statistical analysis to automatically construct classifiers using labeled training data. The resulting classifiers, however, have many advantages: they are easy to construct and update, they depend only on information that is easy for people to provide, they can be customized to specific categories of interest to individuals, and they allow users to smoothly trade-off precision and recall depending on their tasks. A growing number of statistical classifications have been applied to text categorization, including Vector Space model, Naive Bayes model and Support Vector Machine model. VSM ( Vector Space Model ) was presented by G.Salton in 20 centuries. In the model, each document is represented as a vector of words, as is typically done in the popular vector representation for information retrieval. Because text classification is essentially semantic categorization, the VSM represents the contents of documents and queries with a set of index terms, which can lead to poor classification performance.Latent semantic index (LSI) was presented by S.T.Dumains in 1988, it is a new algebraic model that has achieved good results in information retrieval, which maps documents and queries vectors into a lowerdimensional space by singular value decomposition, so that the inherent vagueness associated with a retrieval process based on keyword sets is considerably reduced, and semantic association among the documents is highlighted consequently. LSI is useful to find relation between terms, where human effort does not bring good results. Thus the synonymy can be solved, and the polysemy can be solved partially. With the guidance of LSI and VSM theory and taking paper [1][2] as the foundation, this paper will probe into the text classification based upon concept-VSM. First of all, the paper gives a brief introduction to the concept of information, information retrieval and computer information retrieval, and its development. Then the types of information retrieval model, the approach and basic contents of attribute theory will be dwelled upon. Third, this paper introduces the fundamental principles of LSI, and then using an illustration and an example elucidate LSI advantages. The focus of my work has been on building a concept space based on VSM and LSI, presenting the calculating method of the word-similarity and the text-similarity in the concept-space, acquiring concepts on large training set, converting the text to text vector, and constructing the basis vector. Finally, this paper discusses the future work - problem in the classification study problem in the concept space. At the end of this paper, theoretic analyses and experimental results all show that classification based upon concept-VSM can improve categorize performance significantly, and indicate it has high classification precision and recall on average. Because of existence of the synonymy and polysemy, the text classification based on words is of congenital lack, my thesis presents a text classification method based on concept-VSM with a small but more strong concept space instead of the text vector space ba...
Keywords/Search Tags:text classification, latent semantic indexing, vector space model
PDF Full Text Request
Related items