Font Size: a A A

Research On Entity-level Search Crawler And Information Extraction

Posted on:2012-05-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:N Z ZhangFull Text:PDF
GTID:1118330344951672Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the number of Web pages increased exponentially, and this makes it become a giant repository. But the current information retrieval techniques are facing with unprecedented challenges because of magnanimity, heterogeneity and dynamic characteristic of Web. On the one hand, at present, search engine has made great progress on meeting people's common information needs. On the other hand, the keyword based search engine, however, has some inherent deficiencies. For example, the relevancy between current page and query is mainly determined by the occurrence features (e.g., term frequency, document frequency) of query keyword in the page and link features of it, but it is still difficult for the current search engine techniques to exactly understand the user query intent and contents of Web pages. In addition, query results are only several list of documents sorted, from which users have to find out desired information manually, whereas exisiting search engines cannot automatically integrate them. So, they essentially belong to page-level search. However, the user query, in many cases, is closely related entity such as person, paper, organization, product, even abstract event. If information in search engine is represented, extracted, integrated and deliveried by entities, they will offer more accurate and rich search results to users and are better able to meet their information needsTo solve these problems mentioned above, in this paper, we mainly study the entity-level search techniques. In summary, we make the following contributions:(1) For the problem of obtaining the domain-specific resources on Web, we propose an algorithm based on joint link similarity evaluation. The main idea of this algorithm is that firstly the topic similarity of the anchor text corresponding to the current link is computed to obtain the direct evidence. Then a Web link graph is built by exploiting the on-topic Web pages fetched by focused crawler. We present a Q learning based algorithm for incrementally learning Web Link graph to acquire the map relationship between the current link and topic similarity. By that, the indirect evidence can be obtained. Finally, we combine the direct evidence with the indirect evidence to calculate the topic similarity of the current link in order to guide focused crawling. The experimental results show that this algorithm can significantly improve the efficiency and precision of focused crawling.(2) To the coarser-granularity information extraction problem that existing Web information extraction tasks most treat the whole Web page as a basic unit to process, we present a vision-based Webpage Block Labeling (WPBL) based algorithm. This algorithm first transforms a HTML document to a DOM tree. According to the features of text and link of nodes in the DOM tree, the Web page is segmented into three different type blocks: text block, mix block and link block, and then they are automatically labeled. To identify the importance of different blocks, an effective ranking algorithm based on a block's location and its visual features (e.g., width, high, background color, font) on the Web page is proposed, which contributes to find important contents or links on a Web page, and eliminate noisy information such as navigation bar, copyright, privacy notice, advertisement and decoration. By means of WPBL algorithm, the Web information task can be performed on the finer-granularity level. The experimental results show that the WPBL algorithm can obviously improve the performance of the information extraction.(3) We systematically investigate the techniques related to entity-level Web information extraction, and propose an iterative extraction based entity information extraction framework. The process of entity information extraction is as follows:we first model the domain-specific Web entity, and then the basic attribute information is extract by using CRFs model. We exploit the technology based on keyword search and basic attribute information of entity to obtain all Web pages related to specific entity. Then WPBL algorithm is utilized to segment web pages, and associated text blocks are extracted. Finally, a naive bayes classifier is used to identify target text blocks, from which associated attribute information of entity is extracted. Afer several iterative extractions, a specific entity with integral description information can be obtained.(4) We discuss some key techniques associated with integrating user social data recommendation into entity search engine, which can provide entity search engine more accurate information and make up for automatically fetching information on Web. The goal of social data recommendation is to make search engine become a content provider, and solve some challenges that traditional architecture of search engine has faced with, such as limited resources, accurate search, etc. To this end, we describe the storage format of the user social recommended data and submission methods for them. For the purpose of fusing this structural information into entity search engine, we present formal definitions related to Web entity fusion, and give several important fusion operators, and discuss their properties. Finally, we propose a Web entity fusion algorithm, which exploits some techniques related to natural language processing such as sentence similarity computation and sentence fusion. Our experimental results show that the proposed algorithms are effective.
Keywords/Search Tags:Entity Search, Focused Crawler, Web Page Segmentation, Entity Modeling, Web Information Extraction, User Recommendation, Entity Fusion
PDF Full Text Request
Related items