Font Size: a A A

Research On Information Extraction Method For Multi-sources Data

Posted on:2016-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y L LinFull Text:PDF
GTID:2298330467479678Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Along with the development of Internet technology, the amount of electronic documents becomes much huger, these massive documents contain many useful information. Usually, the users can only directly utilize the structured information, while the information the users need is always contained in the unstructured or semi-structured text, so it’s difficult to directly utilize the information in electronic documents. In order to make a better use of the information contained in the electronic documents, there is a need to extract the target information from this documents and restored it in a structured form.In this paper, we extract the enterprise competitive entity form the prospectus. Through the observation of the document, we found that each prospectus has a specific section describing the competitors of a company. We called it Competitor Description Section (CDS). The competitors are described in three forms:list, table and free text. Different kinds of CDS require different extraction methods and have different levels of difficulties. Therefore, we proposed a Multi-strategy learning algorithm to solve the information extraction task on the multi-source data.It is clear that the work is divided into two main parts:Competitor Description Section Detection and Multi-Strategy Learning. The previous part is used to locate the CDS, while the other part is used to extract company names from the located CDS. For the position of CDS, we first use the heuristic rules to extract most of the CDS, and then to label positive and negative tags on the CDS, and then work feature selection on all the sections to select a representative collection of words. At last, we use a classifier to categorize these sections to find the corresponding CDS. On the other hand, the idea of Multi-strategy learning algorithm likes this:list type is the most easy to deal, so we extract the competitors in list type corpus first. Then we use the extracted competitors as a seed for the annotation of the other two types. And the last, we use the annotation information to automatically generated extracting model, and boost pumping out company names from the other two corpora. Distant supervised learning is used in these processes to avoid manual labeling efforts. The benefit of our approach is that the named entity recognition (NER) step is not required to identify competitors. Experimental results show our approach achieves higher precision and recall than those of the traditional NER methods.
Keywords/Search Tags:Competitor Mining, Unsupervised Learning, Distant Supervision, WrapperInduction
PDF Full Text Request
Related items