Font Size: a A A

Research On Key Techniques In Text Mining

Posted on:2011-08-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:F LiFull Text:PDF
GTID:1118360305985123Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
The popularity of Internet has caused an ever-increasing amount of textual documents, which is a big challege for people to deal with. Therefore, text mining techniques were proposed and developed.This thesis focuses on the following technical points in text mining:Firstly, we study the whole process of constructing vector space model in text corpus and construct the vector space models based on two benchmark data sets, TanCorp (Chinese) and Reuters (English). A novel scheme using association rule in textual data was introduced to simplify the original high dimensional data set. Then an incremental update algorithm is also put forward.Next, we made some research on non-negative matrix factorization (NMF) and its application in text clustering. Two novel transformation matrix based NMF algorithms were proposed to improve the converge performance. The proposed method, as shown in theoretical analysis and computer simulation, was more efficient than the priori schemes.Considering the nonlinear problem in text data and also the difficulty in classify textual data, we put another emphasis on the kernel clustering method. After summarizing the prior kernel scheme, such as kernel clustering algorithm, Fuzzy Kernel clustering algorithm and semantic-kernel based local adaptive clustering algorithm, we proposed the Gaussian-Semantic kernel based LAC algorithm to improve the efficiency of kernel clustering algorithm. The proposed algorithm was verified to be effective by simulations using artifical generated data set and Reuters data corpus respectively.Then, in order to deal with hierarchical relations in text data, hierarchical clustering methods were studied. We put forward NMF based hierarchical clustering algorithm, and give two different schemes of hierarchical clustering based on NMF. The simulation based on TanCorp data corpus shows that NMF on the feature-document matrix will achieve nearly the same performance as the current scheme and obtain great computation complexity reduction.In the end, a prototype system of Science and Research information's Auto-Suggestion based on text mining methods was proposed to extract useful technique information, which was validated by an initial simulation.
Keywords/Search Tags:text mining, vector space model, text clustering, non-negative matrix factorization, hierarchical clustering, kernel function, local adaptive clustering, testor theory, Science and Research information's Auto-Suggestion
PDF Full Text Request
Related items