Research On Key Techniques In Text Mining

Posted on:2011-08-23

Degree:Doctor

Type:Dissertation

Country:China

Candidate:F Li

Full Text:PDF

GTID:1118360305985123

Subject:Control theory and control engineering

Abstract/Summary:

PDF Full Text Request

The popularity of Internet has caused an ever-increasing amount of textual documents, which is a big challege for people to deal with. Therefore, text mining techniques were proposed and developed.This thesis focuses on the following technical points in text mining:Firstly, we study the whole process of constructing vector space model in text corpus and construct the vector space models based on two benchmark data sets, TanCorp (Chinese) and Reuters (English). A novel scheme using association rule in textual data was introduced to simplify the original high dimensional data set. Then an incremental update algorithm is also put forward.Next, we made some research on non-negative matrix factorization (NMF) and its application in text clustering. Two novel transformation matrix based NMF algorithms were proposed to improve the converge performance. The proposed method, as shown in theoretical analysis and computer simulation, was more efficient than the priori schemes.Considering the nonlinear problem in text data and also the difficulty in classify textual data, we put another emphasis on the kernel clustering method. After summarizing the prior kernel scheme, such as kernel clustering algorithm, Fuzzy Kernel clustering algorithm and semantic-kernel based local adaptive clustering algorithm, we proposed the Gaussian-Semantic kernel based LAC algorithm to improve the efficiency of kernel clustering algorithm. The proposed algorithm was verified to be effective by simulations using artifical generated data set and Reuters data corpus respectively.Then, in order to deal with hierarchical relations in text data, hierarchical clustering methods were studied. We put forward NMF based hierarchical clustering algorithm, and give two different schemes of hierarchical clustering based on NMF. The simulation based on TanCorp data corpus shows that NMF on the feature-document matrix will achieve nearly the same performance as the current scheme and obtain great computation complexity reduction.In the end, a prototype system of Science and Research information's Auto-Suggestion based on text mining methods was proposed to extract useful technique information, which was validated by an initial simulation.

Keywords/Search Tags:

text mining, vector space model, text clustering, non-negative matrix factorization, hierarchical clustering, kernel function, local adaptive clustering, testor theory, Science and Research information's Auto-Suggestion

PDF Full Text Request

Related items

1	The Research On Web Text Clustering Based On DBSCAN Optimized Algorithm
2	Research Of Text Clustering Based On NMF Algorithm
3	Research On Text Clustering Problems Of Kernel Function And Self-definite Category Number
4	Researching The Kernel Clustering Algorithm And Its Application In Text Clustering
5	The Research Of Clustring Analysis's Application In Web Text Mining
6	Semi-supervised Image Clustering Based On Non-negative Matrix Factorization
7	A SOM-based Text Clustering And Apply To Search Result
8	Clustering Algorithm Based On Robust Non-negative Matrix Factorization
9	The Application Of Rough-Set-Model Based Text Clustering Algorithm In The Text Filtering
10	Research Of Text Clustering Technology Based On Colony Intelligence