Font Size: a A A

Commercial Social Network Creation Based On Information Extraction Technology

Posted on:2011-07-15Degree:MasterType:Thesis
Country:ChinaCandidate:N X JiFull Text:PDF
GTID:2178330338989575Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In many types of Internet information, there are many electronic documents about financial analysis and stock analysts, which relate to a lot of commercial entities and business relations. Commercial Social Network (CSN) is one of the social networks that use text processing technology to build these texts. Structure CSN is a graph, which links commercial organizations together by complex business relationships. In this graph, node is the name of commercial entity, and arc represents the commercial relations. Information Extraction (IE) is used to extract specific information from the structure of semi-structure text, and then extracted information is formatted and stored in the database for query and analysis by user.In this paper, combining the financial characteristics, IE technology is applied to achieve automatic construction of CSN. We focused on the commercial entity identification that uses NLP and business relationship extraction based on Bootstrapping Algorithm. For commercial entity name recognition, we firstly use word segmentation and statistical methods to determine POS composition of company names, as a trained feature of CRF; and then use statistical methods to identify context of company; finally, we use CRF to integrate selected features, according the train, achieve language model, which can recognize the name of company from plain text. Using N-folder cross to evaluate it, company full name recognition accuracy achieved 94.6%, recall is 91.4%, and F-Value is 92.9%. We also use CRF to mark abbreviations of company; and the new model is trained by characteristics feature. After the N-folder cross evaluation, we achieved accuracy 93.4%, recall is 85.6%, and F value is 89.3%.For realizing the relationship auto-extraction, we firstly label the name of company in the plain text, and then used Bootstrapping Algorithm, in the provision of good seed set to achieve the business relationship extraction. By comparing to a small sample randomly selected with the manual annotation, accuracy reached 66.8%.
Keywords/Search Tags:Commercial Social Network, IE, Conditional Random Field, Bootstrapping
PDF Full Text Request
Related items