Font Size: a A A

Business Information Extraction Based On Internet

Posted on:2011-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y H LiuFull Text:PDF
GTID:2178360308455379Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet is expanding rapidly and the large amount of data makes it become an important source for competitive intelligence acquisition. However, it is still a difficult task for enterprises to obtain the competitive intelligence needed form this information ocean. To solve this problem, the technology of business information extraction is developed by researchers. In this way, the result of information extraction becomes a vital factor on quality of final competitive intelligence.In this paper, we research the technology of business information extraction from the Web and principally focus on two aspects: relation extraction and entity recognition. For different extracting objects, we analyze their distinctive features and develop appropriate methods to extract these objects in order to improve the effect of business information extraction. Position relation extraction is set as an example for business relation extraction. We investigate the appearance features of position relation instances on the Web and adopt structure-based algorithm to extract position relations from the Web. For entity recognition, we research the organization name entity recognition and present an organization name entity recognition algorithm based on Semantic Hidden Markov Model. Two algorithms effectively improve the effect of two kinds of information extraction respectively and provide reference information for other business information extraction.The main contribution of this paper can be summed up as follows:(1) We present an algorithm to extract position relations from the Web. People's position in a corporation, which the term position relation refers to, is a kind of significant competitive intelligence for enterprises. Our algorithm is based on the structural feature of position relation in Web contents. We first introduce structural coefficient and structural file segment to describe these features and then employ a pattern-matching method to extract position relations from the structural file segments. Finally, we conduct experiments on a real data set and evaluate the precision and recall of our approach. The experimental results show that our algorithm has a high precision over 96% as well as a recall over 87%.(2) We bring forward a SHMM-based Chinese organization name recognition algorithm. Semantic Hidden Markov Model is based on two important linguistics viewpoints: the dependence of syntax on semantics and symbiotic word field. A sentence is considered as a sequence of words, this sequence implies a semantic sequence which decides the construction of the sentence. We first conduct semantic tagging on the words from organization name interior and its context, and then construct semantic Hidden Markov Model for organization name recognition. During the selection of organization name context, we employ symbiotic word field phenomenon to decide the boundary of the context. In fact , the algorithm attempt to make use of the relevancy between organization name and its context to improve the effect of organization name recognition. The experimental results show that our algorithm gains better outcome compared to other approaches and has a stronger ability to process different type of contents.
Keywords/Search Tags:business information, competitive intelligence, information extraction, relation extraction, named entity recognition
PDF Full Text Request
Related items