Font Size: a A A

Based On The Design And Implementation Of The Theme Of The Breadth-first Crawler

Posted on:2012-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:H WangFull Text:PDF
GTID:2248330371465294Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of a flood of Internet information in the form of blowout erupted. Information in these Internet contains a wealth of business opportunities and human wisdom. Web search applications, health, and quickly became a hot topic in computer science, how to maximize the search to meet the information needs of software developers has been the pursuit of goals to a large search engine company has led the development of a wide range of search is very mature. And can provide a variety of services around the data search. At the same time, a variety of Internet-based information research digging the actual project gradually. However, no difference in the search for the search engines often provide data and research needs of many independent data, in particular, filled with a variety of advertising and links to related web pages do not make the final search results and users predict very different. The literature on the subject based on the needs of different search procedures applied in practice do not give a specific use in the engineering of operational implementation.In this paper, to borrow the current text analysis techniques mature, the value of the site’s content to do analysis, and given the connection rate is given, taking into account the reasons for engineering high-value articles link only to sites and content to do the second excavation crawling, so there is a network link to the subject matter was concentrated into the theme of clouds, and in engineering and can be achieved. Secondly, taking into account the theme and theme groups link between cloud weak "tunnel phenomenon" is easy to form a so-called "dark web" phenomenon. In this paper, breadth-first algorithm and dynamic text correlation threshold approach, drawing on the Internet is now a mature technology and the Nutch crawler ICTCLAS Chinese text analysis techniques, data storage through the web, URL re-extinction performance optimized to achieve a breadth-first based the theme of reptiles. Although the actual process of using reptile reptiles need to rely on data correction and data administrator training, the completion of a specific subject matter of crawling work. However, the actual results, relatively simple person crawling data, the overall quality and high efficiency is also quite good.
Keywords/Search Tags:Breadth-first search, Multithreading, Spider, Web Robot, Clustering Search
PDF Full Text Request
Related items