Font Size: a A A

The Focused Crawler Based On URL And Context

Posted on:2015-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:H J JiaFull Text:PDF
GTID:2298330431968869Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of computers, the Internet has become the world’s largestrepository of information. General search engines such as Baidu Browser, GoogleBrowser, can query a large number of related results by keywords, and meet most ofthe user’s query needs. But for a small number of users, they are interested in theinformation of a certain industry or field, so they just want the search engine to returninformation of interested. In order to obtain information about a cetrain industry orfield, using the focused crawler algorithm improves of the general search engine.In this paper, application entity analysis, net structural analysis and the improvedalgorithms. It is proposed the focused crawler based on url and context, this algorithmis dealing with the focused of using entity analysis, and extended the focused wordsbased on Chinese synonyms thesaurus as inputting of analysis algorithm. Meanwhile,the algorithm will be dividing into several blocks, the url and content in each blockwill be analyzing about net structure and text content. According to term frequencyand weight derived rating score on a text block information. If the score is greaterthan the threshold, the url is considered to be related to the topic. The experimentresults illustrate the focused crawler based on url and content can achieve good searchresults.This article includes the following contents:1.At querying, in order to improve query speed using a high-performancefull-text search tool Lucene.Net,will create indexes about links, web content,anchor text and context information, achieve index search. Creating indexeswill consume some time, but create indexes usually execute in thebackground, created indexes can be reused.2.In this paper, by comparing Lucene.Net offers various segmentations andPangu segmentation, ultimately choose to use Pangu method. In order toachieve good segmentation results, the experiment carefully studied the latestversion of the toolkit of Pangu differences with other versions.3.In calculation the topic relevance, using the vector space model to calculate the cosine similarity score as a result of the correlation, if the score is greaterthan then threshold, it is considered to be relevant, otherwise it is consideredirrelevant.
Keywords/Search Tags:Search Engines, Natural Language Processing, Chinese WordSegmentation, Information Retrieval, Vector Space Model
PDF Full Text Request
Related items