Font Size: a A A

Research On Technologies And Methods Of User-oriented Short Text

Posted on:2019-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:G ChenFull Text:PDF
GTID:2428330566473964Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development and popularization of Internet mobile technology,text information has become an important part of daily life,work and social interaction,most of the textual information derived from the Internet is short text.There are many sources of short text,such as a variety of chat systems,social software,question answering systems,and so on.The rapid increase in the number of short text also impeded the rapid acquisition of the main information.Especially in systems that require feedback quickly,such as a question-answering system,we need to analyze the core issues of the user-advisory statement first,and then give responses in a short time.These requirements are all challenging.Therefore,it is of great significance to use computer technology to mine and analyze short texts,clustering is an important means of organizing,abstracting and navigating effectively the text information,and it can also excavate the relationship between different texts,which helps to further process these texts.In short text,the characters are limited,the amount of information is limited,the noise is greatly influenced and the context information is insufficient,so the characteristics are relatively sparse.These features lead to short texts that can not be modeled using common long text modeling methods,posing many challenges for short text research.At present,the short text clustering technology faces the following problems: how to mitigate the influence of unrelated information? How to represent the sparse feature of short text? How to improve the quality of short text clustering? How to improve the efficiency of the short text clustering? In view of the above problems,this paper proposes a short text clustering method applied to user consultation short text.The main work is as follows:1.In this paper,we use the two order Hidden Markov Model to identify irrelevant words in user-oriented short text,and then we build a dictionaries of irrelevant words so that we can filter the irrelevant words2.In order to alleviate the problem of the sparse feature of the short text,we represent the short text by analyzing the characteristics of the short text and using the word vector to express the short text,we also use the selective weighting method to construct the text vector and use the similarity degree of the word vector to express the similarity between the short texts.3.In order to make the clustering algorithm adapt to the incremental data set,and prompt the efficiency of clustering algorithm,the clustering process is divided into two steps: off-line clustering and online clustering.We use user consultation short texts to carry out clustering experiments.The final result proves the validity of the similarity calculation method adopted in this paper,the accuracy of the cluster results is 82% and the recall is 73%.The clustering experiment on incremental data sets proves that the combination of offline clustering and online clustering can indeed greatly improve the efficiency of short text clustering.
Keywords/Search Tags:Short Text, Irrelevant Words, Clustering, Word Vector, word2vec
PDF Full Text Request
Related items