Research On Semantic Analysis And Generation Technology For Text Sequence Data

Posted on:2022-05-21

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Q Yang

Full Text:PDF

GTID:1488306572974759

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Text data is the most common data type in natural language processing tasks treating the word as basic semantic item.As words in a natural sentence could be regarded as the observations that appears in a discrete time,sequential words constitute the corresponding text sequence data.Obviously,the words in a sequence are not independent of each other,and there exists hidden correlations.However,the �bag-of-words(BOW)� model that treats words as independent individuals is still the most commonly used feature representation method for text data processing.The methods of representation learning and semantic understanding based on BOW model do not use the order-dependent features,and cannot accurately model the dependency correlation of the context.In addition,multiple hashtags in social media can also form a text sequence.However,there is currently a lack of analysis for the semantic correlation between hashtags.Based on the semantic dependency features in sequence data,we carry out the research on the distributed representation of text,the understanding of word meaning,and the text generation pattern and applications.The learning process of word vectors ignores the latent semantic correlations between words.Considering the dependency between the latent topics of words in the sequence data,it is a good choice to train and learn the enhanced word embeddings that uses the implicit semantic dependence information of the words to improve the representation ability,while the sequence topic model is adopted to capture the dependency features.Thus,an enhanced word embedding learning method combined with topic correlation is proposed.Then,the hidden topic model is used to solve the word sense disambiguation task.Since the good scalability and modeling ability for the text sequence,the sense dependency with the first-order Markov chain assumption can be modelled by the hidden topic model.In addition,there are also semantic correlations between hashtags in social networks.Considering that hashtags are closely related to data content,multihashtags are regarded as the target sequence of text generation,to achieve the modeling of the semantic correlations of hashtags with the hashtag generation model for multi-modal social network data.Hence a text sequence generation architecture oriented to multi-modal features is proposed.More details are as follows.First,the word vector representation learning method that incorporates topic dependency is proposed.To achieve this target,the learning process of word vectors is expanded by incorporating the latent topic dependency information.The typical word vector learning method relies on the co-occurrence features of words in the context,without considering the topic-dependent information.Topic dependency information is integrated into the learning process without changing the original learning process in the proposed model,which first obtains the corresponding topic dependency information through the sequence topic model,then incorporates the weights of the latent topic dependency between words for training the vector representation via the original learning framework.Only a minimal amount of space and additional calculations are increased when effectively integrating the semantic information for improving the performance of word vectors.The performance is validated by the word similarity task and word analogy task.The experimental results indicate that the proposed model,which training with latent topic dependency information,outperforms baseline methods.The performance advantage is more significant when the amount of training data is smaller.Second,the hidden sense correlation is modeled by the knowledge-based method using sequence topic model.As the same word may appear in different contexts,the word will express different word sense within different content.The selection of word sense not only depends on the word content of the current context,but also closely related to their sense.According to the characteristics of sequence data,it is necessary to model the word sense dependency correlation through the sequence topic model and solve the word sense disambiguation based on word sense correlations.Considering the correlations of words in the text sequence,the hidden dependency assumptions is proposed,that adopts first-order Markov chain based on word order to model the correlation between word sense.The proposed approach incorporates contextual dependency information and global sense correlation to better model the semantic correlations by hidden topic model,which makes full use of the prior knowledge provided by the Word Net.Besides,different strategies for information enrichment is adopted to mitigate the sparse problem.Experimental results show that the proposed approach has achieved the comparable performance on different evaluation datasets,and achieve a significant performance improvement compared with the knowledge-based methods.Third,a text sequence generation architecture based on multimodal features is presented,and then applied to hashtag recommendation tasks.Hashtags are common auxiliary information in social networks.They usually extract topics and summarize content in the form of words or phrases.Therefore,hashtags are strongly correlated with data content and have weakly semantic relationships between each other.Considering the sequential properties of hashtags,a sequence generation architecture is proposed for modeling the correlations and recommending hashtags jointly.The model uses a neural network model based on the attention mechanism to extract the multi-modal features of the data,and uses the Encoder-Decoder architecture to generate a label sequence corresponding to the data content.The attention mechanism is added in the encoder to obtain features that are more relevant to the tag sequence.Recurrent neural network is used in the decoder to generate the label sequence.Traditional methods directly conduct the problem as multi-label classification or multi-class classification,ignoring the correlations existing in the label sequence.The model employs a sequence generation architecture to model the correlations between hashtags,which overcomes the problems of insufficient flexibility of previous methods and single recommendation results.The performance of the proposed model was verified on different public datasets,and good performance is achieved in different experimental environments.The experimental results demonstrate the effectiveness of the text generation method for the hashtag recommendation task,confirm the semantic correlation between hashtags,and also show the good performance of the model in a multimodal data environment.

Keywords/Search Tags:

Text sequence, semantic analysis, representation learning, word meaning understanding, text generation

PDF Full Text Request

Related items

1	A Research On Text Analysis And Representation Based On Semantic Infomation
2	Study On Method To Automatically Analyze The Text Structure Based On The Relevancy Computing Of Text Content
3	Research On Text Abstract Generation Method Based On Deep Neural Network
4	Research On Text Summarization Technology Based On Abstract Meaning Representation Graph
5	Researching Text Classification Using Semantic And Sequence Information
6	Research On Classification Of Short Text Sequences With Multi-Views Based On Semantic Representation
7	Research On Deep Learning Method Based On Word Vector Representation In Text Classification
8	Text Understanding Based On Semantic Relevance Under Internet Environment
9	Text Representation And Classification Based On Deep Learning
10	Deep Neural Networks For Text Representation And Application