The Focused Crawler Based On URL And Context

Posted on:2015-04-06

Degree:Master

Type:Thesis

Country:China

Candidate:H J Jia

Full Text:PDF

GTID:2298330431968869

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of computers, the Internet has become the world’s largestrepository of information. General search engines such as Baidu Browser, GoogleBrowser, can query a large number of related results by keywords, and meet most ofthe user’s query needs. But for a small number of users, they are interested in theinformation of a certain industry or field, so they just want the search engine to returninformation of interested. In order to obtain information about a cetrain industry orfield, using the focused crawler algorithm improves of the general search engine.In this paper, application entity analysis, net structural analysis and the improvedalgorithms. It is proposed the focused crawler based on url and context, this algorithmis dealing with the focused of using entity analysis, and extended the focused wordsbased on Chinese synonyms thesaurus as inputting of analysis algorithm. Meanwhile,the algorithm will be dividing into several blocks, the url and content in each blockwill be analyzing about net structure and text content. According to term frequencyand weight derived rating score on a text block information. If the score is greaterthan the threshold, the url is considered to be related to the topic. The experimentresults illustrate the focused crawler based on url and content can achieve good searchresults.This article includes the following contents:1.At querying, in order to improve query speed using a high-performancefull-text search tool Lucene.Net，will create indexes about links, web content,anchor text and context information, achieve index search. Creating indexeswill consume some time, but create indexes usually execute in thebackground, created indexes can be reused.2.In this paper, by comparing Lucene.Net offers various segmentations andPangu segmentation, ultimately choose to use Pangu method. In order toachieve good segmentation results, the experiment carefully studied the latestversion of the toolkit of Pangu differences with other versions.3.In calculation the topic relevance, using the vector space model to calculate the cosine similarity score as a result of the correlation, if the score is greaterthan then threshold, it is considered to be relevant, otherwise it is consideredirrelevant.

Keywords/Search Tags:

Search Engines, Natural Language Processing, Chinese WordSegmentation, Information Retrieval, Vector Space Model

PDF Full Text Request

Related items

1	Research And Application Of Intelligent Search Interface Technology Based On Natural Language Processing
2	Research On NLP Technologies And Application In Chinese Information Processing
3	Research On Pivotal Technology Of Focused Search Engine
4	Research And Implementation On Chinese Information Retrieval System Based On Structured Vector Space Model
5	Natural Language Processing-A Study Of Vectorization Of Chinese Words And Short Texts
6	Research On The Chinese Science And Technology Document Information Retrieval System Based On The Vector Space
7	Design And Implementation Of Based On Vector Space Model Of Local Search Engine
8	Research Of Chinese Full Text Retrieval Technology
9	The Study Of Automatic Function Information Extraction And Classification Approach For Chinese Patent
10	The Search Engine Based On Chinese Natural Language Processing