Font Size: a A A

Extracting Enterprise Competitive Intelligence From The Web

Posted on:2010-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChenFull Text:PDF
GTID:2178360302959904Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, more and more enterprise put information on the Web. The obtaining of Enterprise Competitive Intelligence changes from traditional way to obtaining from the Web gradually. How to obtain Enterprise Competitive Intelligence from the Web has become hot topic in the research of Information Extraction from the Web and Enterprise Competitive Intelligence.In this paper, we take Web pages for research, and mostly discuss collecting Enterprise Competitive Intelligence from Web pages by using techniques of Web Information Extraction and Relation Extraction. We focus on ontology-based obtaining Enterprise Competitive Intelligence, take the Web pages for extraction, and research on extracting approach of Enterprise Competitive Intelligence: and . The experiments on large-scale Web pages show that it can effectively extract competitive intelligence included in the Web pages. Ontology-based extraction Enterprise Competitive Intelligence has several advantages in this paper: providing a unified model for obtaining Enterprise Competitive Intelligence reducing work in the following analysis intelligence, improving accuracy of generated Enterprise Competitive Intelligence.The main contribution of this thesis is as follows:1. Propose ontology-based frame of obtaining Enterprise Competitive Intelligence. A unified structural approach to describe Enterprise Competitive Intelligence included in the Web pages provides domain ontology reference for obtaining Enterprise Intelligence. We can construct Enterprise Competitive Intelligence by instantiation of ontology. In this paper, we firstly analyze why using ontology as a basis for CI information extracting process in the Web. Then we describe designing enterprise CI ontology detailed.2. Propose new named entity recognition algorithm based on DOM-tree and hierarchical roles HMM tagging. The experiment results show that the algorithm has better recognition results. The approach firstly use DOM tree to take off the HTML tags. Product named entity is recognized based on segment part-of-speech tagging in hierarchical roles HMM in the leaf content. We tag the category,series,type in the first hierarchy model and the product entity in the second. We tag the greatest probability of role sequence using viterbi algorithm and recognize named entity based on role sequence by defining patterns.3. Propose and implement entity relation extraction algorithm based on pattern matching which extracts relation from Chinese Web pages. The algorithm is universal, can extract different types of relation. To improve accuracy of extraction, we calculate the reliability of the patterns and entity instances. The experiment results show that the algorithm has a good extraction results in the Chinese free texts. The content of the product Web pages is divided into table and free text. Table texts are mainly handled by the location of the top and bottom. Free texts are handled based on pattern matching approachs. We use the bootstrapping approach to generate patterns and compute the confidence of patterns and entity relation instances to control the quality of the patterns and entity relation instances.
Keywords/Search Tags:Competitive Intelligence, Web information extraction, domain ontology, named entity recognition, entity relation extraction
PDF Full Text Request
Related items