Font Size: a A A

Research Of Chinese Personal Social Relation Extraction Based On News Data

Posted on:2017-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:W LiuFull Text:PDF
GTID:2308330485451676Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the scope of Internet is enlarging, the information and data among it are in-creasing continuously. The aim of information extraction is to extract the structured data from the vast amount of unstructured data on the Internet. The entity relation ex-traction is the subtask of information extraction, which has become a hot research topic in the field of data mining and information retrieval. Personal relation extraction be-longs to entity relation extraction, the personal relation tuples can be used for the con-struction of personal relation network and question answering system, and has quiet high application value. But, current research of relation extraction mainly focuses on the processing of English corpus, the research progress based on Chinese data is slow and the research is more complex. The relation extraction methods based on machine learning have become the research focus currently for the methods perform quiet well on relation extraction results. According to the difference of the training data acqui-sition methods, this dissertation performs research on three different methods, which are based on semi-supervised learning, distant supervised learning and unsupervised learning, the main contributions are as follows:1. The relation extraction method based on supervised learning depends highly on manually annotation training data, and the cost of manual annotation is too high. To improve the relation extraction performance with less annotation data, the disser-tation performs research on the semi-supervised relation extraction method. Using the semi-supervised learning algorithm based on label propagation can improve the relation extraction performance with small amount of annotation data, but selecting the training samples randomly influences the performance of relation extraction. To improve the performance of label propagation relation extraction, the dissertation combines the la-bel propagation and active learning for personal relation extraction. The method selects the samples which are most helpful for the relation classification actively, which can reduce the amount of invalid annotation samples, and can improve the system perfor-mance with the same amount of labeled data.2. In the current research of relation extraction, the distant supervision method usu-ally be used for constructing training data automatically, but the assumption of distant supervision has the problem of inaccurate and will introduce noise data in the training data. The dissertation focuses on the problem and proposes the noise filtering method for training data based on the scoring function, which can reduce the noise data in the training data obtained with distant supervision method. In addition, for the problem that the precision of relation extraction result in the current relation extraction system is not satisfy, the dissertation applies the word embedding technology extracting the features based on word embeddings from the sentences text to improve the relation extraction performance of the personal relation extraction system.3. The methods above need pre-defined relation classes at first, and then perform the relation extraction to acquire the relation samples. However, these methods limit the relation classes acquired by relation extraction models and cannot obtain relation tuples data of new relation classes. The dissertation proposes a relation extraction method based on unsupervised learning, which don’t need training data and pre-defined relation classes. Firstly, the highly related person pairs were gotten from the news titles data for the research of relation extraction. Secondly, the news data which contain related person pairs were crawled and performed pre-processing, and the keywords in the sentences which contain person pairs were gotten by the TF-IDF. Thirdly, the correlations between the words were acquired by the words co-occurrence information, and the key-words correlation network was constructed. Finally, the personal relations were acquired by the graph clustering analysis on the correlation network.
Keywords/Search Tags:Information Extraction, Personal Social Relation Extraction, Label Prop- agation, Active Leaning, Distant Supervision Learning, Word Embedding, Word Cor- relation, Correlation Network, Graph Clustering
PDF Full Text Request
Related items