Font Size: a A A

Research And Implementation Of Entity Relation Extraction In Massive Internet Text

Posted on:2018-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:H M XinFull Text:PDF
GTID:2348330518494410Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the massive Internet text data has brought new opportunities and challenges to the research of entity relation extraction. Open entity relation extraction technology is different from the traditional entity relation extraction method, does not need to identify the relationship between type system and also does not need labeled training set, which can be used to solve the task for Internet text. Because of the restriction of hardware environment, the open entity relation extraction method can not deal with the huge amount of Internet text data. In this paper, we use the Hadoop distributed computing framework to parallelize the algorithm in single machine environment, so as to realize the processing of massive Internet text. Relation indicated words gained from open entity relation extraction contains a lot of synonyms. Through the clustering of these words, the relationship type can be further abstracted to describe the entity relationship better. Based on the above, the main work of this paper is as follows:1. This paper proposes a new method of Chinese open entity relation extraction UCOERE. UCOERE includes preprocessing, sentence decomposition, three tuple extraction, three tuple filtering four stages.The preprocessing part accomplish sentences segmentation, word segmentation, entity recognition. In the extraction phase of relational triples, this paper presents an algorithm based on the syntax analysis tree called shortest connected distance algorithm. Use of complex sentence decomposition improves the performance of the parser. Finally the algorithm will filter the relation triples according to the dictionary.2. At present,there doesn't exist a unified evaluation criterion for open entity relation extraction, and it is difficult to directly calculate the correct rate, recall rate, and other indicators for open entity relation extraction. Propose the calculation formula of the correct rate, recall rate and F1 value based on the thought of sentence sampling, and evaluate UCOERE algorithm.3. Aiming at the problem that single machine environment can not deal with massive data, this paper combined with Hadoop framework,proposes PUCOERE which parallel algorithm of UCOERE is implemented.4. Spectral clustering algorithm has good robustness, excellent performance, and only need to provide a similarity matrix, which is highly abstract words very convenient. In this paper, we propose a new method for constructing relational type automatically by spectral clustering and similarity calculation between words.
Keywords/Search Tags:Massive internet text, Open entity extraction, Hadoop, Relation types automated building, Spectral Clustering
PDF Full Text Request
Related items