Font Size: a A A

A Research On Text Vector Representation Based On Semantics

Posted on:2018-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:W K RuiFull Text:PDF
GTID:2348330512482618Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The development of the Internet,especially the mobile Internet,makes people ac-cess to a lot of information fast.In return,people rely on the Internet more to get infor-mation they want.Information in the Internet exists as texts mainly which are growing explosively.To help users find information they need,service providers need to clas-sify,cluster and rank texts.Those tasks normally convert texts to vectors so as to be used in machine learning models.From users' perspective,the aim of those tasks is to classify,cluster and rank those texts according to their meanings which are higher level and more abstract features compare with word features in widely used Bag Of Words text representation.The Bag Of Words representation lacks generalization and extracting high-level semantic information of text has been studied a lot.Topic models like LDA and pLSI extract latent topic information by learn words' distributions over topics and topics' distributions over documents given a collection of texts.Deep neural networks can learn multiple levels of features or representations of data and are used to learn high level semantic representations of texts.The main concern of this paper is representing texts based on their semantic infor-mation and works that conducted in this paper are follows:1.The Bag Of Words representation fails capture similarities between words which makes it lack generalization and faces curse of dimensionality problem.We pro-pose Bag of Word cLuster(BOWL)representation.Each word cluster consists of semantic close words and it is an "aspect" or "concept" which is a higher level feature than word,which makes BOWL representation includes words' semantic information.BOWL weights each dimension by k-max pooling.The experiments demonstrate its representation effectiveness and efficiency.2.Neural networks with complicated structure can capture more accurate informa-tion.But its training is time consuming and relies on GPU.In this paper,a simple neural network which use word embeddings' average as input layer is used.The hidden layers project texts to a higher level semantic feature space by nonlinear transformation and classification is conducted in that space.Experiments show its better performance than low level BOW representation.Also,how the neural network works and optimization process are analyzed.3.In the task of extracting opinion tags from products' comment texts,exact match-ing of words lacks generalization.In this paper,extracting opinion tags by calcu-lating semantic similarity is proposed.We design different methods to calculate semantic similarity for short and long sentence.This kernel method transfers texts into a semantic space and calculate their distances implicitly.The experiments show it improve recall significantly which indicates a more general model.
Keywords/Search Tags:Text representation, Semantics, Text classification, Opinion extraction, Word embeddings, Neural networks
PDF Full Text Request
Related items