Rearch On Information Extraction And Search Based On Web

Posted on:2015-04-27

Degree:Master

Type:Thesis

Country:China

Candidate:A L Zhou

Full Text:PDF

GTID:2308330473952033

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, the data on the Internet increase explosively. So the efficient information retrieval technology which helps users to get useful information is of great importance. Progress has been made on search engine technology currently that can basically meet user’s demands for information retrieval. But most of current search engine technologies are based on the page searching that has inherent defects. The query results which return to the users are in the form of web links, and the users need to find the information in these pages. But in many cases, users’ search target is entity information, such as paper, place names, commodity information, etc. The technology of entity search is used for extraction and integration of web information, which has made the returning results more accurate. In this paper, the technology of entity search engine has been studied. Some new solutions are put forward on the basis of current research in the paper. The main work includes,1. This paper proposes a focused crawler using link template tree technology on the basis of traditional crawler algorithm. The paper draws a conclusion through the analysis of website links that the web link can be summed up in templates. We apply the technology, induction of category links in the site, to solve the problems of tunneling in web pages. In the experiment, we utilize the open source crawler Nutch, realize the algorithm proposed in the form of a plugin-in of Nutch in this paper. Through the comparisons, the proposed method in the paper obtain better recall rate.2. We propose an entity information extraction method based on DOM tree and XSL. In this method, firstly the Web pages are preprocessed, and then the path rules of the entity information in web pages are extracted through the training data. At last, the entity information in the web pages is extracted into XML file. And, we propose the corresponding solutions of the extraction of complex entity in single web page. On the basis of single entity extraction, extracting the maximum data in the page’s sub-tree and then extracting the rules of complex entity in the maximum sub-tree, and then we realize the multiple entity information extraction. Experimental results show that the method can extract the entity information effectively, which we propose in this paper.3. We research the technology of entity information search through the analysis of architecture and code of the open source full-text index development kit Lucene. On the basis of document indexing structure, the index structure applicable to the entity information is presented. Lucene grading mechanism is improved. The IDF values of the words in entity data is calculated and the database of IDF values is established. During the query execution, we set the importance of words based on the IDF value, and then, calculate the scores each entity obtain, finally get the sorted results. Experiments show that this method can get better results.

Keywords/Search Tags:

search engine, focused crawler, entity information extraction, entity search

PDF Full Text Request

Related items

1	Rearch On Information Extraction And Search Based On Web
2	The Design, Realization And Research For A Campus-Objected Entity And Social Search Engine
3	Research And Realization On Focused Crawler Key Technologies Of Vertical Search Engine
4	Research On Key Techniques Of Entity Search For Deep Web
5	The Research On Focused Crawling Algorithm In Vertical Search Engine
6	Research And Implementation Of A Time-based Focused Search Engine
7	Research And Design On Focused Crawler Of Search Engine
8	Research On Entity Linking Using The Extended Information From Search Engine
9	The Design And Implementation Of Enterprise Information-Oriented Web Focused Search
10	Research Of Main Technologies Of Vertical Search Engine