Font Size: a A A

Research On Semantic Analysis And Generation Technology For Text Sequence Data

Posted on:2022-05-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q YangFull Text:PDF
GTID:1488306572974759Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text data is the most common data type in natural language processing tasks treating the word as basic semantic item.As words in a natural sentence could be regarded as the observations that appears in a discrete time,sequential words constitute the corresponding text sequence data.Obviously,the words in a sequence are not independent of each other,and there exists hidden correlations.However,the “bag-of-words(BOW)” model that treats words as independent individuals is still the most commonly used feature representation method for text data processing.The methods of representation learning and semantic understanding based on BOW model do not use the order-dependent features,and cannot accurately model the dependency correlation of the context.In addition,multiple hashtags in social media can also form a text sequence.However,there is currently a lack of analysis for the semantic correlation between hashtags.Based on the semantic dependency features in sequence data,we carry out the research on the distributed representation of text,the understanding of word meaning,and the text generation pattern and applications.The learning process of word vectors ignores the latent semantic correlations between words.Considering the dependency between the latent topics of words in the sequence data,it is a good choice to train and learn the enhanced word embeddings that uses the implicit semantic dependence information of the words to improve the representation ability,while the sequence topic model is adopted to capture the dependency features.Thus,an enhanced word embedding learning method combined with topic correlation is proposed.Then,the hidden topic model is used to solve the word sense disambiguation task.Since the good scalability and modeling ability for the text sequence,the sense dependency with the first-order Markov chain assumption can be modelled by the hidden topic model.In addition,there are also semantic correlations between hashtags in social networks.Considering that hashtags are closely related to data content,multihashtags are regarded as the target sequence of text generation,to achieve the modeling of the semantic correlations of hashtags with the hashtag generation model for multi-modal social network data.Hence a text sequence generation architecture oriented to multi-modal features is proposed.More details are as follows.First,the word vector representation learning method that incorporates topic dependency is proposed.To achieve this target,the learning process of word vectors is expanded by incorporating the latent topic dependency information.The typical word vector learning method relies on the co-occurrence features of words in the context,without considering the topic-dependent information.Topic dependency information is integrated into the learning process without changing the original learning process in the proposed model,which first obtains the corresponding topic dependency information through the sequence topic model,then incorporates the weights of the latent topic dependency between words for training the vector representation via the original learning framework.Only a minimal amount of space and additional calculations are increased when effectively integrating the semantic information for improving the performance of word vectors.The performance is validated by the word similarity task and word analogy task.The experimental results indicate that the proposed model,which training with latent topic dependency information,outperforms baseline methods.The performance advantage is more significant when the amount of training data is smaller.Second,the hidden sense correlation is modeled by the knowledge-based method using sequence topic model.As the same word may appear in different contexts,the word will express different word sense within different content.The selection of word sense not only depends on the word content of the current context,but also closely related to their sense.According to the characteristics of sequence data,it is necessary to model the word sense dependency correlation through the sequence topic model and solve the word sense disambiguation based on word sense correlations.Considering the correlations of words in the text sequence,the hidden dependency assumptions is proposed,that adopts first-order Markov chain based on word order to model the correlation between word sense.The proposed approach incorporates contextual dependency information and global sense correlation to better model the semantic correlations by hidden topic model,which makes full use of the prior knowledge provided by the Word Net.Besides,different strategies for information enrichment is adopted to mitigate the sparse problem.Experimental results show that the proposed approach has achieved the comparable performance on different evaluation datasets,and achieve a significant performance improvement compared with the knowledge-based methods.Third,a text sequence generation architecture based on multimodal features is presented,and then applied to hashtag recommendation tasks.Hashtags are common auxiliary information in social networks.They usually extract topics and summarize content in the form of words or phrases.Therefore,hashtags are strongly correlated with data content and have weakly semantic relationships between each other.Considering the sequential properties of hashtags,a sequence generation architecture is proposed for modeling the correlations and recommending hashtags jointly.The model uses a neural network model based on the attention mechanism to extract the multi-modal features of the data,and uses the Encoder-Decoder architecture to generate a label sequence corresponding to the data content.The attention mechanism is added in the encoder to obtain features that are more relevant to the tag sequence.Recurrent neural network is used in the decoder to generate the label sequence.Traditional methods directly conduct the problem as multi-label classification or multi-class classification,ignoring the correlations existing in the label sequence.The model employs a sequence generation architecture to model the correlations between hashtags,which overcomes the problems of insufficient flexibility of previous methods and single recommendation results.The performance of the proposed model was verified on different public datasets,and good performance is achieved in different experimental environments.The experimental results demonstrate the effectiveness of the text generation method for the hashtag recommendation task,confirm the semantic correlation between hashtags,and also show the good performance of the model in a multimodal data environment.
Keywords/Search Tags:Text sequence, semantic analysis, representation learning, word meaning understanding, text generation
PDF Full Text Request
Related items