Font Size: a A A

Research On Key Problems In Text Mining

Posted on:2009-09-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:J N HuFull Text:PDF
GTID:1118360245469618Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Text mining refers generally to the process of deriving high quality information from text, which is an interdisciplinary research field across information retrieval, data mining, machine learning, statistics, and natural language processing. This dissertation focuses on the key problems, such as the feature extraction in text classification, clustering analysis, and the query expansion, and proposes the novel algorithms as follows.(1) Discriminative Semantic Analysis based Text Feature Extraction. This dissertation proposes a new Robust linear Discriminant analysis Model (RDM) for high dimensional text data. The RDM model applies regularized method to enhance the generalization ability of the traditional linear discriminant model, and utilizes an energy-adaptive criterion to avoid the complex selection of the regularization parameter. As a result, this model can avoid complicated process of parameter selection. Upon this robust model, the dissertation proposes a Discriminant Semantic Feature (DSF) algorithm. This algorithm first applies latent semantic analysis to the high dimensional feature vector and then use robust discriminant analysis in the semantic space in order to extract the most discriminant semantic feature of the text. Experiment results demonstrate that the DSF algorithm is superior to other common linear discriminant analysis algorithms. What is more, the results of this algorithm are not affected by the changing of latent semantic space's dimensions, which proves the robust character of the proposed RDM model.(2) Locality Discriminating Indexing based Text Feature Extraction. This dissertation conducts the research on the manifold based data modeling, and proposes a new method for text feature extraction method called Locality Discriminating Indexing (LDI). This algorithm uses nearest neighbor graph to describe local structure within the same class, and applies concept of invader graph which is used to depict manifold overlaps of different classes. LDI algorithm finds the optimal linear subspace through solving a generalized eigenvalue problem, which can enhance the compactness of within class manifold and at the same time reduce overlaps between different classes. The LDI algorithm successfully applies the manifold learning technique to enhance the separability of text categories. The experiment shows the proposed algorithm is superior to other feature extraction methods based on manifold learning.(3) Text Clustering using Adaptive Subcluster Merging. This dissertation proposes an Adaptive Subcluster Merging (ASM) to address the problem on discovering hetergenous text clustering structures. This algorithm has two stages: subcluster patition and subcluster mering. The strategy of the first stage is expanding by the nearest neighbor. That is to say when the variance of the current subcluster is below the threshold we use the subcluster's nearest neighbor to expand it. After this stage, every text in the database is partitioned into some subcluster of the same granularity. Subcluster merging procedure merges the subclusters if its "edge density" is larger than the average density, based on the assumption that the inner density of the cluster is larger than its outer density. The experimental results on the simulated data and text data validate that the proposed algorithm can overcome the homogenous results of the variance based clustering algorithms and also avoid the complicated selection of density parameter.(4) Semi-supervised Text Clustering using Local Consistency and Global Smoothing. The clustering results from unsupervised learning are often far from the real data clusters. In order to solve this problem, this dissertation studies semi-supervised clustering algorithms, and proposes a Local Consistency and Global Smoothing (LCGS) based semi-supervised clustering algorithm. LCGS algorithm uses a restricted equation to reflect the supervised information and achieves the Local consistency, and imposes the global smoothing hypothesis by the cost function. Then LCGS converts a semi-supervised clustering process into a restricted quadratic optimization problem so that the optimal clustering result can be obtained. Experiment on 20-Newsgroups dataset indicates that by using only 2% of label information, the LCGS algorithm can improve the cluster validity by 60%.(5) Fusion of Statistical Relevant and Semantic Similarity for Query Expansion. In text retrieval system, query expansion algorithms can optimize the query expressions offered by users and enhance the precision and efficiency of the system. This dissertation first introduces a Global Analysis (GA) based query expansion algorithm. This algorithm first computes the cooccurrence and the distance of term pairs, and then expands the query by the most relevant terms in order to clarify the fuzzy query. Then this dissertation further proposes a query expansion algorithm which integrates the statistical relevant and semantic similarity computed by HowNet. It aims to insure that the expanded terms are not only relevant to the original query but also semantically similar to the query. Experimental results show that the proposed GA method outperforms the Rocchio algorithm, and the fusion of GA and HowNet can further enhance the retrieval performance.
Keywords/Search Tags:text categorization, feature extraction, text clustering, semi-supervised clustering, text retrieval, query expansion
PDF Full Text Request
Related items