Font Size: a A A

Research On Text Representation Based On Distributed Representations Of Words

Posted on:2018-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z P ChenFull Text:PDF
GTID:2348330542465249Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text representation is a fundamental and important task in natural language processing(NLP).Word is a basic semantic unit in text representation.Traditional representation of word cannot completely reflect its semantic information.Researchers usually need to design better features of words for specific NLP task.In recent years,while deep learning methods are becoming popular,researchers induce distributed representations of words by neural language model.Each dimension of distributed word representation is considered as a latent feature of word,capturing useful syntactic and semantic properties of word.In this thesis,we work on how to use distributed representations of words to improve text representations.(1)This thesis proposes a method to improve traditional text representation model.This method is called word-extended which aims to add word features of text representation.In this method,we use distributed representations of words to find similar words for target words.Unlike traditional word-extended method based on knowledge base,we use unsupervised learning method to establish word semantic relationship,and what we need is large scale of raw texts.In text correlation experiments,our model outperforms the traditional model.(2)This thesis studies on the off-topic phenomenon of text representation.We regard student's essay as a kind of text representation,so our task is to detect which essay is offtopic.Our solution is first to select a model essay from essays set based on center vector and then calculate the similarity between student's essay and the model essay.So that we can identify the off-topic essay by comparing the similarity value with off-topic threshold.In order to improve system performance,we analyze the relationship between essays' divergence and off-topic threshold and then propose an approach to set off-topic threshold dynamically.The experimental results show that our methods can detect off-topic essay effectively.(3)This thesis studies on the summary representation of large scale comment texts.After analyzing the weakness of existing approach,we propose a new representation method using “hierarchical property word with sentiment word” labels.Our method focus on property words.We firstly use distributed representations of words and hypernym-hyponym relationship matrix to identify property words.Then we construct the hierarchical relationship of property words by knowledge base and prior knowledge.At last,we build a joint classification model to deal with property words not in hierarchical structure.The whole process requires very little manual work.Experiments demonstrate the effectiveness of these methods.Finally,we build a summary representation system of comment texts based on these methods.
Keywords/Search Tags:Distributed Representations of Words, Text Representation, Off-topic Detection, Hierarchy Construction
PDF Full Text Request
Related items