Font Size: a A A

Research On Concise Semantic Analysis For Text Categorization

Posted on:2012-06-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z X LiFull Text:PDF
GTID:1118330362454273Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text is one of the most efficient means of information dissemination and preservation. We are now facing an ocean of information which is mostly stored as text documents, such as journal articles, web pages, news reports and so on. These text documents have reserved the essence of human knowledge. They are the guarantees of the continuity and development of human civilization. However, the large amount of text documents brings a"sweet trouble": we can mine more knowledge from text documents than before but the mining process becomes more and more difficult and complex. Automatic text categorization is an application which classifies and sorts out text documents with the aids of computer. As a fundamental application of natural language processing and automatic text processing, text categorization has drawn a lot of attention in the past 10 years.Text categorization is a complex process which includes several stages such as text pre-processing, text representation, term weighting, classifier designing and so on. Among these stages, text representation is a key process which refers to how to transform human readable text into machine readable data. Computers have excellent computing capabilities and huge storage spaces, but they are disable on the analysis of the semantic of text as human brains do. Under the classical bag of words representation based on vector space model, documents are represented as collections of words and each word is assumed to be independent with other words. This method is simple and convenient but suffers from several disadvantage such as high dimensionality and loss of word order. To solve these problems, this paper proposed a concise semantic analysis (CSA) method for text categorization. By this way, words and text fragments are represented in a low dimensionality concept space instead of high dimensionality word space. To examine CSA's potential on large scale of data set, we have analyzed the scalability and parallelization of CSA; to make use of the word order, we proposed a new representation model called string of words by which text documents are represented as a string of vector in concept space. In addition, two methods to calculate the similarity between vector strings are proposed in this paper and a sim-k-NN classifier is designed to test the performance of string of words representation. The main contributions of this paper are list as follows:â… . A concise semantic analysis technique based on categories information is proposed. Concepts derived from category labels are used to build a concept space to interpret words and text fragments. Three derivation of concepts are designed to adapt different kinds of corpora. They are direct derivation, split derivation and combine derivation.â…¡. A new method to calculate the relationship between words and concepts is proposed. This method takes document length as a factor of term weight. The experimental results prove that CSA is efficient and effective in text categorization tasks.â…¢. Intensively study on CSA has been done to analyze its scalability and parallelizable capability. Analysis results show that CSA has excellent parallelizable capability and good scalability which make CSA adaptable on large scale datasets.â…£. A string of words representation is proposed which is able to reserve the word order of text. Under this representation, text documents are represented as ordered vector strings in concept space so that the semantic flow can be reproduced. Two methods to calculate the similarity between vector strings is proposed then to implement classification on vector strings. Experimental results show that string of words leads to better accuracy over bag of words.â…¤. A news recommendation system for cell phone is designed based on CSA. It has several advantages such as low bandwidth requirements, wide information coverage and effective privacy protection.
Keywords/Search Tags:Concise Semantic Analysis, Text Representation, String of Words, Scalability and Parallelization, Recommendation System
PDF Full Text Request
Related items