Font Size: a A A

Research And Implementation Of Topic Web Crawler Oriented To Web Mining

Posted on:2013-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:X L ZhangFull Text:PDF
GTID:2248330395955658Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, more and more information is presented infront of people, and Search engine becomes the mainstream way for people to accessinginformation. However, due to the explosive growth of Web resources, and due to thecharacters of them, such as discrete, heterogeneous, half-structure and real-time, how tocarry out mining analysis on them and extracting information about a particular, customtopic required have already become an important research subject.The research content of this paper is Web-based Topic Search oriented EnterpriseCompetitive Information. After the introduction of research background and currentsituation, the key technology of Web mining and search engine is emphaticallydiscussed. The main research work can be described as follows:Topic Web Crawler: With a comprehensive analysis of existing search algorithm ofsearch engine in Web mining, system improves the relevant search strategy andproposes a non-greedy genetic search algorithm.Web Document Analysis: In this paper the crawler converts a Web document to itscorresponding tree structure by using HTML Tidy tool, and extracts relevantinformation with different traversing algorithm according to the needs of users; afterextraction and segmentation of the web page content, the feature vector of the text isestablished by using an improved calculating method of the weight of feature items.Topic Correlation Evaluation: With the basis of the topic correlation evaluation ofthe web page content in the use of vector space model, the topic web crawler calculatesthe topic correlation of web hyperlinks combined with their anchor texts, URL strings aswell as the pages relevant to the topic.On the basis of above research work, The Topic Web Crawler system is designedand implemented oriented Enterprise Competitive Information.
Keywords/Search Tags:Web Mining Topic, Web Crawler, Correlation Calculate, SearchAlgorithm, Text Classification Algorithm
PDF Full Text Request
Related items