Font Size: a A A

Domain Term Automatic Acquisition From Unstructured Texts

Posted on:2008-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z M SuFull Text:PDF
GTID:2178360215458077Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of new technologies, digit literature as technology document and white paper, which is a kind of unstructured text, is increasing dramatically. The efficient acquisition from these unstructured texts plays great important role in constructing Digit Library, Domain Ontology, Domain Gazetteers and so on. Compared to Dictionary approach, Rule based approach, and Statistical approach, which have some shortages like Dictionary approach and Rule based approach need domain specialists' help and cost considerable time and manual labor, and statistical approach cannot represent various kinds of features of domain terms, three approaches on domain term automatic Acquisition as classification approach, sequence data labeling approach, and Reranking approaches are studied in this thesis, following by the theory of statistical learning and the research on information extraction.Firstly, this thesis converts the domain term automatic acquisition problem into a task of information extraction (IE), and defines the input, output, and task descriptions from the perspective of IE, and then proposes the mechanism and procedure of domain term acquisition based on statistical learning theory. Also, this thesis discusses that there're three core research works for term acquisition: text preprocess, feature representation, and the comparison and choice of the statistical learning model.And then, this thesis studies the mechanism of classification approach, sequence data labeling approach, and Reranking approach, and analysis the problems for term acquisition, respectively. Also, this thesis proposes different feature representation strategy for the three approaches above, and does lots of experiments in order to verify the performance of proposed approaches. Experimental results show, our feature representation strategy can support domain term automatic acquisition from the unstructured text pretty well and have a great performance upgrade than baseline approach provided by Genia project. Moreover, we combine Reranking approach, which is not studied by other researchers recently, and sequence data labeling model such as CRF, and acquire terms serially, and then rerank the several candidates by ranking SVM, at last only the top candidate of the reranked results is used. Then we can get the terms from the top candidate and its sentence. Further experiments show, Reranking approach outperforms the two approaches as classification model and sequence data labeling model.Although the three proposed statistical learning based approaches performs better than the baseline approach provided by Genia project, their performance as Recall/Prediction/F-measure can improve by using more rich feature set and external resources as gazetteers and MEDLINE Corpus. Also, if we bring the idea of cost-sensitive learning into ranking SVM, we may reduce the error rate of the top candidate in order to acquire domain term more accurately.
Keywords/Search Tags:Domain term, Statistical learning, Information extraction, Sequence data labeling, Reranking
PDF Full Text Request
Related items