Font Size: a A A

The Learning Of Single Class Center And The Application In Binary Relation Extraction

Posted on:2009-12-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z S LiFull Text:PDF
GTID:1118360272485625Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
To extract the binary relation from web is an important research direction in the field of information extraction. Many literatures had presented learning methods based on self-training mechanism. In these methods, an initial system is trained on a small labeled data set. Then the system labels the reliable candidate data to re-train itself for better performance.These Literatures show that the above methods are efficient in extraction of binary relation. But no literature tries to analyze the methods strictly.This paper transforms the pattern learning in the extraction of binary relation into the learning of centre of single text class. In text vector space, the vectors in the small neighborhood region of the initial centre could be labeled as the reliable data. This paper aims to answer the key problem: what nature the data set should owns so that the self-labeled data can definitely improve the learning of single class centre.This paper solves the key problem through the study on the nature of text vector space. For conquer the defects in the description for distribution of"hard"data set by Gaussian mixture model, this paper presents a new model: TGMK model based on the partitions acquired by k-means algorithm, and exposes the relations among the TGMK model, k-means algorithm and Gaussian mixture model. The experiment result shows that TGMK model is suitable as the description for the text data set of multiple classes.Based on k-means algorithm, this paper presents a new algorithm: single-mean algorithm. This paper proves that if the data set of multiple classes is suitable to be described by 1-TGMR model which is the generalization version of TGMK model, the output centre of single-mean algorithm will definitely converge to the actual centre of data set from the initial centre. The above researches solve the key problem perfectly. The experiment shows that single-mean algorithm is efficient on the text data set of multiple classes, which also shows that the learning method based on self-training mechanism is efficient in the extraction of binary relation.This paper creates a formal learning model for the extraction of binary relation based on single-mean algorithm, and presents a new score method for candidates and a new self-labeled method against the particularity of the extraction of binary relation from web. This paper uses the formal model to acquire Chinese Q-A patterns and Chinese-English terminology pairs. Differently with the previous work, this paper presents new methods to learn Chinese Q-A patterns and Chinese-English terminology patterns based on self-training mechanism, which reduces the dependence of labeled data set to the maximum extent. This paper also utilizes the heuristic rules to improve the scoring methods for the patterns and candidates. The experiment results show that compared with the same kind of systems, our systems have better performances on smaller labeled data set.
Keywords/Search Tags:self-training, the learning of centre of single class, the extraction of binary relation, Gaussian mixture model, learning of Q-A patterns, learning of terminology patterns
PDF Full Text Request
Related items