Font Size: a A A

Research And Implementation Of Large-Scale Vertical Search Method

Posted on:2019-07-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y G BaiFull Text:PDF
GTID:2348330542991627Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the development of the Internet,there are a lot of pages in the Web every day,and these pages are different.In the face of massive Web page information resources,search engine is an important tool for obtaining information,and it is increasingly difficult to provide accurate query service.Therefore,a vertical search engine that provides a more timely and accurate query service with relevant topics in a particular field for users is provided.Focused crawler is the core module of the vertical search engine,which is responsible for vertically searching Web pages,storing the Web pages that are relevant to the topic locally,providing vertical search engines to indexing and providing query services for users.In the case of large-scale vertical search,how to accurately determine whether a Web page is relevant to the topic and what search strategy should be used to search Web pages are the two key problems.Search strategy based on web content takes the entire page content as the key factor to judge the topic of the page,so the advertising information,images,Flash animation and other interference factors in the page make the discriminant accuracy rate very low.In addition,if the focused crawler only extracts links from a topic-related page,it is likely to ignore the valuable links in some navigation pages.Aiming at these problems,this paper is mainly focused on search strategy of the focused crawler and the value assessment of the page links,and both a topic discrimination algorithm that based on the web page feature weighting and a link value evaluation method based on block extraction are proposed.The main work and innovation of this paper are as follows:(1)This paper presents a topic discriminant algorithm based on web feature weighting.Through research and analysis of the characteristic of HTML page tags,this paper finds that the text in different HTML tags has a different degree of contribution to discriminating web content topic.When the web page feature words are to be extracted,the TF-IDF algorithm is utilized in the paper,and the weighting factor of HTML tags are introduced,the Naive Bayesian classifier based on web page feature weighted will be carried out on the relationship between the topic and the target web pages.Numerical analysis results show that the method significantly reduces the effect of the interference factors in discriminating the relevance of the topic.The influence of discriminant accuracy could be improved by more than 2.5%,and the recall rate is up by more than 3.5%,and what's more,it saves the storage space of a web page.(2)A link value assessment method based on block extraction is proposed.Through research and analysis of the characteristic of the structure and layout of web pages,this paper finds that when using the div tag and the table tag to process navigation page and relevant pages,Naive Bayes classifier can be introduced to judge the theme of a block of a web page,and it can extract relevant web page links from the block of web page that is related to the target topic,at the same time,using the topic similarity of the link anchor text and the topic similarity of the parent page to evaluate the value of the web links.The experimental results show that the algorithm of this paper is better than the Best-First algorithm and the PageRank algorithm,and the search efficiency and searching accuracy are both improved.
Keywords/Search Tags:Vertical search, Focused crawler, Link evaluation, Web feature extraction
PDF Full Text Request
Related items