Font Size: a A A

A Research To CRF-based Automatic Subject Indexing For Chinese Books

Posted on:2014-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:J L ZouFull Text:PDF
GTID:2248330395995920Subject:Information Science
Abstract/Summary:PDF Full Text Request
With the amount of information exploding,to get full and effective uses of infor-mation resources, we have to make good organization and description about those in-formation resources, to create efficient information retrieval systems. Books, as one of those most important information resource carriers, with them well described, includ-ing classification and subject indexing, have important practical significance.Different from western languages,firstly there is no clear separation mark, the second is the complexity of semantics,it’s not easy for Chinese books subject automatic indexing. This paper attempts to convert this difficult issue into sequence labeling problems, the-reby we pull those machine learning methods of information extraction fileds into this.We train and learn lots of the existing Chinese books manual subject indexing data, produce a template which contains the sequence semantic relationships between enti-ties and the rule definitions.And then we use the template to machine predict to get those book subject words we needed. In addition,as to the choice of machine learning models, taking into account that the Naive Bayes Model and the Maximum Entropy Model both require conditional independence assumption,which ignore the fact that there is some connection between the random variables; and Hidden Markov Model has the mark-bias problem and can not reflect the long-distance dependencies between entities; but the Conditional Random Fields model can avoid the problems of above models,and it has good performance in sequence labeling issues, therefore in this pa-per, we chose the Conditional Random Fields model. At the same time, we note that those Conditional Random Fields Model parameter selection will affect the labeling performance of the system, and therefore the author makes multiple comparison tests to determine the best values of those parameters involved in this specific automatic subject indexing for Chinese books, including the size of the training set, the word-window length of feature template, the group-feature number of feature template, the characteristic function frequency threshold,and the soft boundary parameter. And be- sides, we make several experiments to investigate the influences about different ob-servable characteristics of subject indexing.Then this paper identified four observable characteristics which can improve the indexing performance. Eventually, we estab-lished a Conditional Random Fields-based Automatic Subject Indexing for Chinese Books Model.And this paper uses experiments to show the feasibility and practicality of the model. Then we summarize some key issues about the model building process and follow-up.
Keywords/Search Tags:CRF, Information Extraction, Subject Indexing, Automatic Indexing, Information D-escription
PDF Full Text Request
Related items