Font Size: a A A

The Research On Web Text Clustering Based On DBSCAN Optimized Algorithm

Posted on:2012-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:F F XuFull Text:PDF
GTID:2178330335965814Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the rapid development of Internet and Web technology, Web has already become a massive dynamic information resource database. Text is the most important carrier of information, and the research shows that 80% of the information has been contained in the text. So people urgently need to discover tools of resources and knowledge from the Web text data rapidly and efficiently. While Web text clustering is not only an effective method of text mining, but also one of the core technologies in Web data mining. In this thesis, the research on Web text clustering is conducted as follows:(1) Firstly, this thesis paper has made deep research on the key technology of Web text clustering and discussed the related technologies of text pretreatment, such as homepage gathered, text denoised and segemented, text expressed, feather dimension reduction and so on. For the disadvantages of TFIDF, the SDI-TFIDF is presented to computer the weight of the features.(2) Secondly, it has also introduced the method of text similarity measure in the vector space model, analyzed the Web text clustering algorithms, compared several typical cluster methods from many aspects, and introduced the evaluation standards of text clustering;(3) Thirdly, I introduce the thought of traditional DBSCAN algorithm and analyze its limitations. Based on the IF-DBSCAN algorithm, aiming at some defects such as establishing complicated R*-tree time-consuming and poor effect the traditional DBSCAN algorithm has made on inhomogeneous data, I present a strategy which is composed of neighborhood inquiry of hash table and kernel function clustering and the DBSCAN algorithm has been refined.(4) Finally, I have clustered the gathered Web text and it has been proved by the followed experiments that the refined algorithm has good performance of clustering Web text.
Keywords/Search Tags:Web text clustering, Vector Space Model, DBSCAN, hash table method, kernel function clustering
PDF Full Text Request
Related items