Font Size: a A A

Research On Extraction Of Web Data Entities Based On Domain Features

Posted on:2010-06-16Degree:MasterType:Thesis
Country:ChinaCandidate:S H WangFull Text:PDF
GTID:2178360278472410Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the process of informationization of our society, most people's activities heavily depend on rich information, thus how to extract useful information efficiently from the massive source has become one of our key researches. It is effective to do research on information integration, through which people can find information that is interest and useful to themselves. Nowadays, 90% of the global top 500 enterprises have established their own well-defined market intelligence analysis systems, as it is of great significance for the survival and development of enterprises to do effective analysis of market intelligence.Web has become now a major source of information, but as the amount of Web data has been increasing, it is not easy for people to get real information from the web by themselves. Therefore, how to filter information of less important and find important ones from the Web, so as to aid people easily find the "real data", becomes a critical issue. Some information processing systems focus on certain industry to collect information, considering structured data as the smallest unit, and to find data that is organized by some structure, vertical search system recognizes the information through recognition of fields, then return to the user after some kind of processing. However, how to extract data entities from non-structured pages is a big issue.This paper studied the methods of extract data entities based on domain information from the web. Combining with traditional information retrieval technology, this paper proposed a framework for data entities extraction based on domain information, and designed a system for extraction of tourism information data entities. Around the framework, the paper studied some key issues for data entities extraction from web pages of certain domain. The main research work includes:1. First, describe the categories of entities of Surface Web. Then propose a data entities extraction framework for tourist route information to support the study of this paper. Based on the traditional information retrieval and indexing technology, the framework adds specific steps to recognize domain entities after words segmentation and filtering of web pages according to specific features.2. Present how to do filter according domain vocabulary, and how to store information by XML. The framework recognizes information step by step, and it's easier to show the relation between cities and beauty spots through XML, thus it facilitates the improvement of information and presentation.3. Based on pre-existing research on named entity, the framework can recognize new vocabularies to improve the tourist route entities. And by dividing web pages into blocks, we can get more accurate information for description of each block has high relevance.4. Combine XML with the index model of traditional information retrieval to facility information inquiries. In this paper, information of tourist routes is stored in the form of XML, combined with the index model, thus it is easier to do inquiry to locate specific geographic information and related information along tourist routes.This paper dose exploratory study on how to effectively locate entities for some domain, and hopes to provide an effective idea and method to this issue. This paper bases on current information processing technologies, to propose ideas or methods for Internet information search and information integration. This makes the research of this paper have theoretical research value and practical value.
Keywords/Search Tags:market intelligence, information retrieval, Chinese word segmentation, page segmentation, named entity
PDF Full Text Request
Related items