Font Size: a A A

Research And Implementation Of Focused Crawler Based On URL Patterns

Posted on:2018-11-16Degree:MasterType:Thesis
Country:ChinaCandidate:P R HuFull Text:PDF
GTID:2428330512983561Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development and large-scale expansion of the Internet,efficient and accurate retrieval the necessary information has become more and more urgent need.The traditional generic search engine uses generic crawler to traverse the entire network to collect data as the basic source of data.It can provide users with more extensive data,but can not meet the requirements of users to directional search for specific fields.So the vertical search engine based on specific fields came into being,and the focused crawler as the basic source of data of vertical search engine,can affect the search results directly,and is related to the user experience.How to improve the quality of data crawled by the focused crawler and the performance of crawler effectively has become a hotspot gradually.As the core component of the focused crawler,web page information extraction and focused crawler strategy are closely related to the data quality and performance.Therefore,this paper focuses on these two aspects,the main work is as follows:(1)The existing web page title extraction frequently based on HTML tags or simple heuristic extraction method,extraction accuracy is low.This paper takes into account the construction of the site,the parent page references the subpage,often using the anchor text to summarize the subpage core content so that there is a great association between the anchor text and the subpage title usually,and the crawling process of crawler can obtain the anchor text information directly.This paper proposes a web page title extraction algorithm based on multiple conditions decision(TEBC),in order to improve the precision rate and recall rate of the title extraction.(2)The existing web page content extraction technology,the extraction algorithms of high accuracy are usually based on the syntax structure of the HTML code,they need to convert the HTML code to DOM tree in advance,and consum more time.This paper proposes a web page content extraction algorithm based on the comprehensive multi-characteristics of HTML valid line(CELCC),which ignores the structural features of HTML,uses the HTML valid line as the basic calculation unit,and constructs statistics called comprehensive text density based on multiple characteristics of the HTML valid line.Then the candidate text blocks are obtained by calculating maximum continuous subsequence of the packet.Finally,the content is extracted by verifying the candidate text blocks.Experimental results show that the proposed approach can extract the text accurately and efficiently.(3)The traditional focused crawler strategies are mostly based on the topical locality that often need to use the appropriate tunneling technology to solve the problem of the island or the link analysis that need to spend massive calculation.In order to improve the crawler performance,according to the features of site information organization and the features of URL,this paper introduces the URL pattern into the focused crawler,proposes an focused crawler strategy based on URL patterns(UPFC)which in a double-crawler framework.In the learning crawler stage,use the various characteristics obtained by web page information extraction to calculate the relevance of the subject,and collect the required sample data from seeds,then use the URL pattern construction algorithm based on MDL to bulid URL patterns,and use HITS algorithm to analyze the generated pattern graph,comprehensive content authority and link authority to calculate the importance of each URL pattern.In the focused crawler stage,the topical relevance and the guiding significance of page is determined by the generated URL patterns,and the priority of links to be crawled are predicted according to the importance of URL pattern.Experimental results prove that the crawler strategy can guide the crawler to crawl the relevant pages effectively,guarantee the precision rate and recall rate,and improve the crawling efficiency.(4)The emphasis researches of the web page information extraction technology and focused crawler strategy mentioned above,as well as other crawler core modules,including crawler initialization,web page resolution and web page de-duplicate and so on are applied to the actual crawler system,and a complete focused crawler system focused on tax field is designed and implemented.The actual production verifies the effectiveness and practicability of the proposed algorithms and strategy.
Keywords/Search Tags:Information extraction, Focused crawler strategy, HTML valid line, URL pattern, URL pattern graph
PDF Full Text Request
Related items