Research On The Topic Crawler Search Strategy And Subject Discrimination

Posted on:2018-10-27

Degree:Master

Type:Thesis

Country:China

Candidate:Z J Liu

Full Text:PDF

GTID:2348330542470082

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology,the information resources in the network are rapidly increasing.Due to its open-shared features,the Internet,while providing rich information,makes it difficult for universal search engines to provide better results in the face of user-specific needs.In response to this situation,the study of intelligent search engine has become a trend,in which the theme of web crawler plays an important role.Therefore,it is of great importance to study how to effectively acquire thematic information quickly and efficiently and how to effectively feedback the results to users.Based on the analysis of the theme of the web crawler,this paper focuses on the text crawling crawling process,the theme page discrimination and crawling crawling strategy,the main research work is as follows:(1)Analyze the structural distribution of the text content in the page,and provide a method of extracting the text information of the web page based on the text line characteristics.The method first removes the noise of the non-body content,converts the page into the set of the text and the line number through the preprocessing,further deletes the non-compliant text lines according to the text features,and finally obtains the text information of the web page.(2)Studied the distribution characteristics of words in the page content,combined with the research of naive Bayesian classification and vector space model,an improved method based on vector space model is proposed,and the formula is improved when calculating the word frequency Word frequency weight adjustment,used to determine the page theme relevance.(3)The distribution characteristics of link structure and page content are analyzed,and a topic search strategy based on web page content and link structure is given.This strategy uses different types of search strategies based on crawling stages to consider the content and structure of web pages simultaneously.At the same time,this strategy adds the high and low threshold queues and optimizes the crawler search strategy.In this paper,different types of page links and corpus contents are used to test the above methods,and the results are calculated according to the experimental evaluation index.The experimental results show that the method of extracting text information from the page in this paper has a good effect on the extraction of page text information,Page theme discrimination method and the theme of crawler search strategy on crawler crawling related topics content has improved.

Keywords/Search Tags:

subject crawler, text extraction, correlation discrimination, search strategy

PDF Full Text Request

Related items

1	Intelligent Microblog Information Generation Strategy Based On Subject Crawler And Text Categorization
2	Research On Key Technology Of Subject Network Crawler
3	Theme Research And Implementation Of The Search Engine
4	Research And Implementation Of Subject-oriented Dual-bound Web Page Crawling Methods
5	Study On Subject Search Technology Of Web-oriented Text Mining
6	The Research Of Topical Crawler Search Strategy In Web Page
7	Realization Of Focused Crawler And Research Of Its Key Technologies
8	Design And Implemention Of Focused Crawler To Application Store
9	Research On APK Crawler With Automatic Pagination Detection And Search Results Extraction
10	Research And Implementation On Focused Crawler With Search Strategy