Font Size: a A A

Based On The K - Means Cluster Research And Realization Of The Web Information Retrieval

Posted on:2013-01-13Degree:MasterType:Thesis
Country:ChinaCandidate:H M HuangFull Text:PDF
GTID:2248330395451096Subject:Computer technology
Abstract/Summary:PDF Full Text Request
There are some new challenges and difficulties when the traditional technology of Information Retrieval (IR) applied to the Web Search area. The Web Search absorbed some advantages of the traditional IR technology, and applied some unique methods as well as provided some new research area and approaches for the IR.The approach of this thesis is that combine the web page parsing technology, pre-process the content of the aimed web page that is acquiring the tag node information in the web page, removing the stop words of the node information, counting the word frequency of the node and applying the word frequency statistics to the web page’s vector. Meanwhile, adopt the K-means algorithm of the data mining to cluster analysis the retrieval information results from the web and return the clustered results to the user. After the cluster analysis, some redundant information has been filtered from the original retrieval results and the user would be convenient to acquire their interesting information from the clustered results.The Eclipse IDE and Tomcat web server are used to implement the idea that mentioned in this thesis and the development framework is the Struts framework. The completed system include some key modules such as web information extraction module, information feature extraction and transform module, cluster analysis the feature information module, and cluster results presentation module, etc. The experiment results show that the approach is feasible in the application.
Keywords/Search Tags:K-means, Cluster Analysis, Web Information Extraction
PDF Full Text Request
Related items