Font Size: a A A

Rearch On Information Extraction And Search Based On Web

Posted on:2015-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:A L ZhouFull Text:PDF
GTID:2308330473952033Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the data on the Internet increase explosively. So the efficient information retrieval technology which helps users to get useful information is of great importance. Progress has been made on search engine technology currently that can basically meet user’s demands for information retrieval. But most of current search engine technologies are based on the page searching that has inherent defects. The query results which return to the users are in the form of web links, and the users need to find the information in these pages. But in many cases, users’ search target is entity information, such as paper, place names, commodity information, etc. The technology of entity search is used for extraction and integration of web information, which has made the returning results more accurate. In this paper, the technology of entity search engine has been studied. Some new solutions are put forward on the basis of current research in the paper. The main work includes,1. This paper proposes a focused crawler using link template tree technology on the basis of traditional crawler algorithm. The paper draws a conclusion through the analysis of website links that the web link can be summed up in templates. We apply the technology, induction of category links in the site, to solve the problems of tunneling in web pages. In the experiment, we utilize the open source crawler Nutch, realize the algorithm proposed in the form of a plugin-in of Nutch in this paper. Through the comparisons, the proposed method in the paper obtain better recall rate.2. We propose an entity information extraction method based on DOM tree and XSL. In this method, firstly the Web pages are preprocessed, and then the path rules of the entity information in web pages are extracted through the training data. At last, the entity information in the web pages is extracted into XML file. And, we propose the corresponding solutions of the extraction of complex entity in single web page. On the basis of single entity extraction, extracting the maximum data in the page’s sub-tree and then extracting the rules of complex entity in the maximum sub-tree, and then we realize the multiple entity information extraction. Experimental results show that the method can extract the entity information effectively, which we propose in this paper.3. We research the technology of entity information search through the analysis of architecture and code of the open source full-text index development kit Lucene. On the basis of document indexing structure, the index structure applicable to the entity information is presented. Lucene grading mechanism is improved. The IDF values of the words in entity data is calculated and the database of IDF values is established. During the query execution, we set the importance of words based on the IDF value, and then, calculate the scores each entity obtain, finally get the sorted results. Experiments show that this method can get better results.
Keywords/Search Tags:search engine, focused crawler, entity information extraction, entity search
PDF Full Text Request
Related items