Research And Implementation Of Large-Scale Vertical Search Method

Posted on:2019-07-04

Degree:Master

Type:Thesis

Country:China

Candidate:Y G Bai

Full Text:PDF

GTID:2348330542991627

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the development of the Internet,there are a lot of pages in the Web every day,and these pages are different.In the face of massive Web page information resources,search engine is an important tool for obtaining information,and it is increasingly difficult to provide accurate query service.Therefore,a vertical search engine that provides a more timely and accurate query service with relevant topics in a particular field for users is provided.Focused crawler is the core module of the vertical search engine,which is responsible for vertically searching Web pages,storing the Web pages that are relevant to the topic locally,providing vertical search engines to indexing and providing query services for users.In the case of large-scale vertical search,how to accurately determine whether a Web page is relevant to the topic and what search strategy should be used to search Web pages are the two key problems.Search strategy based on web content takes the entire page content as the key factor to judge the topic of the page,so the advertising information,images,Flash animation and other interference factors in the page make the discriminant accuracy rate very low.In addition,if the focused crawler only extracts links from a topic-related page,it is likely to ignore the valuable links in some navigation pages.Aiming at these problems,this paper is mainly focused on search strategy of the focused crawler and the value assessment of the page links,and both a topic discrimination algorithm that based on the web page feature weighting and a link value evaluation method based on block extraction are proposed.The main work and innovation of this paper are as follows:(1)This paper presents a topic discriminant algorithm based on web feature weighting.Through research and analysis of the characteristic of HTML page tags,this paper finds that the text in different HTML tags has a different degree of contribution to discriminating web content topic.When the web page feature words are to be extracted,the TF-IDF algorithm is utilized in the paper,and the weighting factor of HTML tags are introduced,the Naive Bayesian classifier based on web page feature weighted will be carried out on the relationship between the topic and the target web pages.Numerical analysis results show that the method significantly reduces the effect of the interference factors in discriminating the relevance of the topic.The influence of discriminant accuracy could be improved by more than 2.5%,and the recall rate is up by more than 3.5%,and what's more,it saves the storage space of a web page.(2)A link value assessment method based on block extraction is proposed.Through research and analysis of the characteristic of the structure and layout of web pages,this paper finds that when using the div tag and the table tag to process navigation page and relevant pages,Naive Bayes classifier can be introduced to judge the theme of a block of a web page,and it can extract relevant web page links from the block of web page that is related to the target topic,at the same time,using the topic similarity of the link anchor text and the topic similarity of the parent page to evaluate the value of the web links.The experimental results show that the algorithm of this paper is better than the Best-First algorithm and the PageRank algorithm,and the search efficiency and searching accuracy are both improved.

Keywords/Search Tags:

Vertical search, Focused crawler, Link evaluation, Web feature extraction

PDF Full Text Request

Related items

1	Research And Realization On Focused Crawler Key Technologies Of Vertical Search Engine
2	Research On An Algorithm Of Focused Crawler In Vertical Search Engine
3	The Research On Focused Crawling Algorithm In Vertical Search Engine
4	Research And Implementation On Focused Crawler With New Strategy For The Vertical Search Engine
5	The Internet Public Document Search System Based On Vertical Search Technology
6	The Optimization And Achieve For Focused Crawling Algorithm Based On The Website Content Framework
7	Customizable Focused Crawler
8	Research On A Method Of Focused Crawler For Vertical Search System
9	Research Of Main Technologies Of Vertical Search Engine
10	Research And Implementation On Focused Crawler With Search Strategy