Font Size: a A A

WEB Mining System

Posted on:2008-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:M J GuanFull Text:PDF
GTID:2178360215491308Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet led to the rapid growth of online information. We can not ignore the "information explosion" problem any more, which has resulted in enormous problems especially for inaccessible to information knowledge. Currently, 300 million WEB pages have been developed into a huge distributed information space, where abundant knowledge resources are contained. WEB information collection, WEB page purification, text clustering and Chinese word segmentation are studied in this paper shown as follows.(1) Based on the theory for acquisition of website information, current useful algorithms in this field are studied and compared. (2) In order to handle the network information efficiently, it is of necessity to purify WEB pages. Elementary principles of WEB page purification are explained in this paper, and various purification technologies are analyzed. (3) A band new WEB page purification algorithm is brought up based on dom tree, which is realized through comparing dom tree of pages in the same website. The noises of pages in the same website are relatively similar. (4) Popular domestic segmentation algorithms are compared, including segmentation method on basis of matching the thesaurus dictionary, segmentation method grounded on statistical frequency of word, and segmentation method based on knowledge of the word. (5) How WEB document Eigenvector established by WEB vector space model is described in detail. (6) Two typical clustering algorithms, k average algorithm and som algorithm are implemented. (7) A novel WEB clustering algorithm named projection WEB clustering algorithm is put forward finally.
Keywords/Search Tags:WEB Text Mining, Page collected, Page purify, Chinese word segmentation, WEB clustering
PDF Full Text Request
Related items