Font Size: a A A

Research On Distributed Representation Based On Bigram

Posted on:2018-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:C Y MaFull Text:PDF
GTID:2348330518495431Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In the field of natural language processing, words and sentences are the most basic units of representation. Word is an abstract representation,often containing multiple meanings, the relation between different words are also different. Sentence can be regarded as word sequence, with a specific syntax structure, connotation and more abundant. The objective of distributed representation research is to assign appropriate vector representations to each word and sentence, and to serve for tasks such as subsequent information retrieval and semantic mining.The choice of language model is the basis of distributed representation research. At present, n-gram language model is adopted in the distributed research method based on neural network. Based on the independent hypothesis of text condition, the n-gram model can be simplified as bigram model, reduce the parameter space and solve the data sparsity problem. In this paper, an improved method of distributed representation based on bigram language model is proposed, which integrates the positional information and syntactic dependency information into the distributed representation. At the same time, the construction of Chinese relational data set is completed. The main research contents and results are as follows:First, in terms of the research of word distribution representation, a method of word distribution representation based on location information is proposed. This paper argues that the existing weighting methods in the dynamic window method can not reflect the relation between words through the artificial setting. So two dynamic window weight improvement schemes are proposed. The first is the adaptive weighting factor method, and the different weighting factors are studied for different corpus. And a weight vector method based on the KL divergence to compute its own weight vector for each target word. In the word similarity and semantic, grammar assessment indicators, have significantly improved.Second, a Chinese relation extraction data set is constructed. In this paper, a weak supervised and semiautomatic Chinese relation extraction dataset construction method is proposed. With the aid of Wikipedia,sogouCA news corpus and Baidu API, weak supervised sentence extraction is achieved, and semantic annotation is realized by cyclic neural network. Finally, . The data set was selected as the corpus of Chinese tendency analysis and evaluation (COAE) task, which played a role in the development of Chinese relation extraction.Thirdly, for the research of the distributed representation of sentences, this paper proposes an improved algorithm for relational extraction based on dependency paths. By using dependency syntax analysis, the input structure of neural network is changed. In this paper, a series of experiments are compared and found that the traditional natural language processing features into the neural network structure is very effective.
Keywords/Search Tags:distributed representation, bigram, position weight, dataset, relation extraction
PDF Full Text Request
Related items