Hierarchical Information Extraction From Research Papers Based On Conditional Random Fields

Posted on:2010-04-02

Degree:Master

Type:Thesis

Country:China

Candidate:L L Mo

Full Text:PDF

GTID:2178360278962399

Subject:Computer system architecture

Abstract/Summary:

Faced with the massive text information generated by the information explosion, how to obtain the necessary information more quickly and accurately is an issue of common concern. The research of text information extraction is arising in such background to meet the demand. Its purpose is to provide useful information tools and methods to obtain information from massive online text quickly and accurately.Through the extraction of information from research papers, not only can we effectively organize and manage these papers, enhance user's retrieval efficiency, but also we would be able to carry out many statistical works, such as topic analysis, related papers statistical, citation analysis of journals, research institutes, certain papers or scholars. In addition, it also helps to find out the hot spots and trends of the research. So extracting information of research papers automatically is of great value in research.At present, the method based on statistical learning is a relatively new text information extraction model. It has achieved good effect and been thought to be of great value in the application. Among them, text information extraction based on conditional random fields (CRFs) has been of considerable concern in particular.After a comprehensively analysis of various text information extraction approaches, the approaches of information extraction from research papers based on CRFs were mainly studied, and during them the traditional approaches were found that they had two limitations:â‘ the granularity of text object to be extracted was fixed at the level of word, or fixed at the level of text block, so the traditional approaches could not segment and extract the text flexibly at the proper granularity in accordance with different circumstances;â‘¡in the extraction process, the traditional approaches were not able to adequately utilize the rich integral characteristics information contained in the text, as well as rich context information in the text. Such limitations had been particularly evident when they dealt with the text composed by complex fields or containing much information.On the basis of research results by the related scholars at home and abroad, a hierarchical method of information extraction from research papers based on CRFs was proposed. Firstly, according to the layout information, the lines with the first character not spaces were combined with the former lines into big lines, which were processed as the basic units in exaction. Secondly, according to the requirements of the information extraction from research papers based on CRFs, appropriate feature functions were developed for the CRFs. Thirdly, the algorithm made use of the format information such as list separator, new line character and line header character, and combined them with the feature functions of CRFs to segment the text hierarchically into proper lines, blocks and words. Finally, the parameters of CRFs were obtained through training, and then the CRFs was applied to the information extraction of research papers in special fields. Experimental results show that the proposed method possesses better performance than that based on the CRFs simply segment text into total words or blocks.

Keywords/Search Tags:

Information extraction, Conditional random fields, Research papers, Hierarchy, Text lines

Related items

1	Information Recognition And Extraction From Chinese Periodical Papers Based On Conditional Random Fields
2	Research Of The Automatic Metadata Extraction Based On The Conditional Random Fields
3	Web Information Extraction Research Based On Conditonal Random Fields
4	Research On Personnel Resume Intelligent Extraction System Based On Conditional Random Fields
5	Research On Web Text Segmentation Based On Conditional Random Fields
6	The Research On Short Text Mining With Conditional Random Fields And Improved LSTM
7	Text Categorization Based On The Conditional Random Fields
8	Research Of Web Text Named Entity Recognition Based On Conditional Random Fields
9	Metadata Extraction Based On Third-order Conditional Random Fields
10	Research And Applications On Text Feathurs Extraction From Science And Technical Literatures