Font Size: a A A

Research On The Topic Crawler Search Strategy And Subject Discrimination

Posted on:2018-10-27Degree:MasterType:Thesis
Country:ChinaCandidate:Z J LiuFull Text:PDF
GTID:2348330542470082Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,the information resources in the network are rapidly increasing.Due to its open-shared features,the Internet,while providing rich information,makes it difficult for universal search engines to provide better results in the face of user-specific needs.In response to this situation,the study of intelligent search engine has become a trend,in which the theme of web crawler plays an important role.Therefore,it is of great importance to study how to effectively acquire thematic information quickly and efficiently and how to effectively feedback the results to users.Based on the analysis of the theme of the web crawler,this paper focuses on the text crawling crawling process,the theme page discrimination and crawling crawling strategy,the main research work is as follows:(1)Analyze the structural distribution of the text content in the page,and provide a method of extracting the text information of the web page based on the text line characteristics.The method first removes the noise of the non-body content,converts the page into the set of the text and the line number through the preprocessing,further deletes the non-compliant text lines according to the text features,and finally obtains the text information of the web page.(2)Studied the distribution characteristics of words in the page content,combined with the research of naive Bayesian classification and vector space model,an improved method based on vector space model is proposed,and the formula is improved when calculating the word frequency Word frequency weight adjustment,used to determine the page theme relevance.(3)The distribution characteristics of link structure and page content are analyzed,and a topic search strategy based on web page content and link structure is given.This strategy uses different types of search strategies based on crawling stages to consider the content and structure of web pages simultaneously.At the same time,this strategy adds the high and low threshold queues and optimizes the crawler search strategy.In this paper,different types of page links and corpus contents are used to test the above methods,and the results are calculated according to the experimental evaluation index.The experimental results show that the method of extracting text information from the page in this paper has a good effect on the extraction of page text information,Page theme discrimination method and the theme of crawler search strategy on crawler crawling related topics content has improved.
Keywords/Search Tags:subject crawler, text extraction, correlation discrimination, search strategy
PDF Full Text Request
Related items