Font Size: a A A

Research And Implementation On Key Technology Of Web Text Collection And Analysis

Posted on:2010-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:B Q DingFull Text:PDF
GTID:2178330332478616Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the fast development of Internet, Web has become a very important platform for people to send and obtain messages. The number of Internet pages also increased fiercely. Nowadays, one of the hottest topics on Internet technology is how to find the information that netizens care with high effectiveness and in short time.To reach this target, and aiming at the problem of most of current web information searching software, this paper tried to complete below studies:Firstly, based on further analysis on the construction form of web document, and comprehensive consideration into both strength and weakness of Web Segmentation Algorithm, this paper introduces DOM tree based on VIPS algorithm, which fulfills the exact web segmentation. Furthermore, based on this segmentation, it is realized to eliminate network easily and obtain the correct information.Secondly, this paper made studies on updating & detecting system of web, and put forward a new one based on web segmentation by analyzing the weakness of current popular algorithm. This new system fulfils to search internet data incrementally, and simplifies the complex information on web.Thirdly, this paper also made research on webpage ranking algorithm, and proposed a new method–BHITS,which based on weight setting for each web section and avoided suffusing and deviation of the data, and realized data searching on basis of theme.With practical and effective designing principal, an internet information researching system was developed in this paper through Focused spider technology, Text Categorization technology and the above key technology. Fast and multi-mode search of internet text information was also realized accordingly.
Keywords/Search Tags:Focused spider, Web page noise, text classification, Web page sub-block, update detection, Web page ranking
PDF Full Text Request
Related items