Font Size: a A A

The Method Of Extracting Complex Indicators From Long Text

Posted on:2017-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:X M FanFull Text:PDF
GTID:2308330503487055Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the field of Power Grid Engineering Design Review, there exists such a problem all the time that experts can not accurately extract key technical indicators for review from power grid engineering design documents. In such a situation, experts who are responsible for evaluation will have to obtain large amount of information of key technical indicators on site. This will lead to some evaluation problem such as the not uniform evaluation criteria, low evaluation efficiency and poor results. In this thesis, we will propose an intelligent method that can automatically extract a large number of complex indicators based on long text of power grid field. This method not only improve work efficiency and quality of work for review, but also help to save more manpower and material resources. The preliminary design documents of power grid engineering and the feasibility study report of power grid engineering may contain tens of thousands or even hundred s of thousands of characters, the structure between paragraphs is complex, and there are much more key technical indicators to extract. These indicators including transformer substation indicators and line indicators can divide into 6 categories, which are total about 276 indicators.In the face of this complex problem, in this thesis, we propose a sequence labeling method based on conditional random field to extract index value, index attributes, alternative plans and sub-project names. According to the procedure of extracting indicators, we focus on the construction of indicator system and label system; how to extract features and construct dictionary; how to train our model with CRFs and improve the model based on error analysis. As to pre-processing and post-processing, we introduce a method that can segment documents and combine into six sections by the way of identifying the structure of documents in order to ignore the effect of different indicators, achieve parallel processing and reduce response time. At the same time, by making use of structure information recognition, the method can also match the acronym or nickname of sub-project with its complete specification name; identify the scope of each sub-project. Besides, by the method, we can identify the scope of each plan and the document’s authors recommend one and all the attributes of each indicator. We integrated the system with the South Network Transmission Design Review Platform based on the communication mechanism by the database flag.Based on the comparison of experimental results and the evaluation results of the system. The result shows that our system substantially improves the performance compared to those methods based on rules. Testing dataset contains ten documents whose indicators are manually extracted by experts fr om power grid text. Our system performs 80.28% of the F1 score, while rule based method only get 30.24% of F1 score. In this paper, the model is optimized by horizontal contrast experiment, and the performance of the sub task of structural information identification is evaluated. Our method has been applied to the South Network Transmission design review platform.
Keywords/Search Tags:complex indicators extraction, sequence labeling, information extraction, structure information recognition
PDF Full Text Request
Related items