Font Size: a A A

Research Of The Automatic Metadata Extraction Based On The Conditional Random Fields

Posted on:2010-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:N HouFull Text:PDF
GTID:2178360302959005Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the digital library, the electronic documents become the main source for people who want to obtain the information. In order to help people finding the research papers efficiently and effectively, the technology about the metadata extraction attracts many researchers'attention. The automatic metadata extraction accounts for the trouble of the metadata which mainly request people read documents to locate the metadata and input them into the database by handwork in tradition. It helps organizing the information orderly, controlling them appropriately and finding them easily. As the theory of the machine learning becomes well-rounded gradually, the automatic metadata extraction becomes the research hotspot at present. This paper mainly focuses on the automatic metadata extraction which based on the conditional random fields.Firstly, it proposes a text segmentation technology to segment the text, regarding to the existing problems of the traditional metadata extracting technologies which based on the words composing the research paper header are the extracting task was large and the accuracy was low. The process of the segmentation is introduced in detail. So the extracting fields are corresponded to the blocks. Because some states contain special words, so some blocks can be decided using the extracting rules. Then the state of remaining blocks can be calculated using the heuristic search algorithm.Secondly, in order to extract the citation metadata accurately, considering the formats about the citation information are different and extracting fields are next to each other, the reranking based approach is proposed to extract the citation metadata. This method must use the result which were gotten by the conditional random fields, then rerank the candidates labels to achieve the citation metadata extraction.Finally, it also gives out the analysis and verification to all the technologies which are mentioned in this paper. Subsequently, it is compared with the existed typical algorithm and also makes the prospects for the future research.
Keywords/Search Tags:Metadata extraction, Conditional random fields, Text block, Heuristic research, Reranking
PDF Full Text Request
Related items