Font Size: a A A

Research On Immunology Principles Based Word Representation And Its Application

Posted on:2016-12-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:F YangFull Text:PDF
GTID:1108330479478648Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Machine learning is applied as the main research method in the area of natural language processing(NLP), and words need to be represented as somewhat mathematical objects. A word representation is a mathematical object associated with each word. In current research a word representation is often a vector, and each dimension’s value corresponds to a feature. Word representations can be learned in advance in a task-unspecific and model-agnostic manner. These word representations, once learned, are easily shared with other resea rchers, and easily integrated into existing NLP systems. Word representations have been widely applied to many NLP tasks, such as word similarity computing, relational similarity computing, part of speech tagging(POS), sentence parsing, named entity recognition(NER), sentiment analysis.Word representation methods are inspired from the distributional hypothesis, i.e. words that occur in the same contexts tend to have similar meanings. Therefore, most current researches on word representation apply statistical machine learning methods to learning a word vector from the context of each word based on a large corpus. Since most statistical machine learning methods lack the ability of continuous learning, word representations must be learned in one lump sum on a certain scale corpus. So the critical problem of existing word representation methods is lacking the ability of continuous learning, which constrains their further application. For the purpose of attempt to overcome the problem, this research introduces human immunology principles to construct a multi-word-agent autonomous learning model to learn word representations from a corpus. First, analogies between language and the immune system are systematically analyzed; Then, on the basis of analogies, words are simulated as lymphocytes and then clonal selection and immune network are introduced to construct a multi-word-agent autonomous learning model to learn word representations; Finall y, the proposed word representation method is applied to the task of word similarity computing, relational similarity computing and NER on Chinese electronic medical records(CEMR) to validate the effectiveness of the proposed word representation method. This thesis consists the following five parts.First, analogies between language and the immune system are systematically compared and inspirations from the analogies are analyzed comprehensively. Being theoretical foundation of this research, the analogies are compared in detail from three aspects. The first analogy is the same characteristic of continuous learning; The second exists between words, the basic structural units of language and bone cells(B cells), the important lymphocytes of the immune system; The third is that both the language network and the immune network share the same property of complex network. The three aspects of analogies greatly inspire the design and construction of the proposed model.Secondly, an immunology principle based word representation method is proposed and a multi-word-agent autonomous learning model is proposed to learn word representations. With the agent based modeling method under the framework of autonomy oriented computing, the model is constructed based on the clonal selection principle and the immune network theory. In the model, words are simulated as B cell, and properties of words divided into dominant properties and dependent properties are simulated as receptors of B cells. Moreover, dependenc y relations between words are viewed as recognition between B cells by the way of match between dominant properties of one word and dependent properties of another word. Following the immunology principles, the proposed model tries to regulate combination strength between words resulting in learning word representations.Thirdly, a word similarity computing method is proposed based on the word representation method, and experiments are conducted to validate the effectiveness of the word representation method and the word similarity computing method. In this thesis, each word is represented as two vectors, one for dominant properties and the other for dependent properties. The distributional hypothesis is expanded as “if two words both share dominant contexts and dependent contexts, the two words may be similar”. So, word similarity needs to combine the similarity of both dominant properties and dependent properties. The proposed word similarity computing method performs effectively on the evaluation data.Fourthly, a relation representation method and a relational similarity computing method are proposed based on the word representation method, and experiments are conducted to validate the effectiveness of the two methods. Combinative relations between words are generalized to semantic relations. Therefore, a word relation is represented as a vector from the match between dominant properties of one word and dependent properties of another word. Since a word relation is directional, the word relation is also represented as two vectors, one for the positive direction and the other for the negative direction, corresponding to two ways of match between word properties. Considering the direction of a word relation, two relations are similar if they are both similar on the positive direction and similar on the negative direction. Relational similarity computing needs to combine the similarity o n the both direction. The proposed relational similarity computing method performs effectively on the evaluation data.Finally, word representations are intrduced into an NER model to augment its performance on the corpus of CEMR. Words are extracted from the text of CEMR and obtain their verctor representrations from the repository of word representations learned from the Gigaword corpus. Based on the vector representrations, words in the corpus of CEMR are clustered into a certain number of clusteres. Each word belongs a cluster, and its cluster is used as a feature of the NER model. Comparisons of experimental results give the evidence that word clusters based on the B cell style word representations can augment the performance of the NER model on CEMR effectively.In conclusion, focusing on the challenges of continuous learning of word representation, this research is inspired by the analogies between language and the immune system, simulates words as lymphocytes, and constructs a multi-word-agent autonomous learning model to learning word representations based on adaptive immunology principles. The proposed word representation method is validated on the task of word similarity computing, relational similarity computing and NER on CEMR. This research has achieved some preliminary results, which we expect can further motivate the progress of researches on continuous learning on the area of NLP.
Keywords/Search Tags:word representation, word agent, word similarity, relational similarity, named entity recognition, adaptive immunology principle, agent-based modeling
PDF Full Text Request
Related items