A Research On Text Vector Representation Based On Semantics

Posted on:2018-02-01

Degree:Master

Type:Thesis

Country:China

Candidate:W K Rui

Full Text:PDF

GTID:2348330512482618

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The development of the Internet,especially the mobile Internet,makes people ac-cess to a lot of information fast.In return,people rely on the Internet more to get infor-mation they want.Information in the Internet exists as texts mainly which are growing explosively.To help users find information they need,service providers need to clas-sify,cluster and rank texts.Those tasks normally convert texts to vectors so as to be used in machine learning models.From users' perspective,the aim of those tasks is to classify,cluster and rank those texts according to their meanings which are higher level and more abstract features compare with word features in widely used Bag Of Words text representation.The Bag Of Words representation lacks generalization and extracting high-level semantic information of text has been studied a lot.Topic models like LDA and pLSI extract latent topic information by learn words' distributions over topics and topics' distributions over documents given a collection of texts.Deep neural networks can learn multiple levels of features or representations of data and are used to learn high level semantic representations of texts.The main concern of this paper is representing texts based on their semantic infor-mation and works that conducted in this paper are follows:1.The Bag Of Words representation fails capture similarities between words which makes it lack generalization and faces curse of dimensionality problem.We pro-pose Bag of Word cLuster(BOWL)representation.Each word cluster consists of semantic close words and it is an "aspect" or "concept" which is a higher level feature than word,which makes BOWL representation includes words' semantic information.BOWL weights each dimension by k-max pooling.The experiments demonstrate its representation effectiveness and efficiency.2.Neural networks with complicated structure can capture more accurate informa-tion.But its training is time consuming and relies on GPU.In this paper,a simple neural network which use word embeddings' average as input layer is used.The hidden layers project texts to a higher level semantic feature space by nonlinear transformation and classification is conducted in that space.Experiments show its better performance than low level BOW representation.Also,how the neural network works and optimization process are analyzed.3.In the task of extracting opinion tags from products' comment texts,exact match-ing of words lacks generalization.In this paper,extracting opinion tags by calcu-lating semantic similarity is proposed.We design different methods to calculate semantic similarity for short and long sentence.This kernel method transfers texts into a semantic space and calculate their distances implicitly.The experiments show it improve recall significantly which indicates a more general model.

Keywords/Search Tags:

Text representation, Semantics, Text classification, Opinion extraction, Word embeddings, Neural networks

PDF Full Text Request

Related items

1	A Research On Text Vector Representations And Modelling Based On Neural Networks
2	The Study And Application Of Text Embeddings With Deep Learning Technique
3	Word Embeddings Towards Text Classification Of Emotion And Topic
4	Research On Key Problems In WEB Text Mining
5	Document-level Sentiment Classification Based On Dynamic Word Embeddings And Hierarchical Neural Networks
6	Research On Text Classification Algorithms Based On Machine Learning
7	Jointly Learning Chinese Word Embeddings With Heterogeneous Morphemes
8	Exploring Dialogue Text Classification Based On Word Mixture Vectors
9	Research On Improvement Of Chi-square Feature Selection And Word Vector Text Representation For News Classification
10	Design And Implementation Of Long Text Classification Algorithm Based On Deep Neural Network