Font Size: a A A

Research On Short Text Similarity Measure Based On Semantic Coupling

Posted on:2020-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:W LiuFull Text:PDF
GTID:2428330572985934Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,the massive information manifests an explosive growth.As the emergence of various social media,short texts,such as microblogs,and instant messages,chat software and question-answering system are very prevalent on today's websites.Text similarity measures play a vital role in text related applications in tasks such as NLP,information retrieval,text classification,document clustering,text filtering,topic tracking,question answering,machine translation,text summarization and other.The similarity measurement for short texts is complex and can be influenced by numerous factors.For example,text representation,terms weighting strategy,semantic relation modeling and similarity algorithm,etc.Through analyzing the limitations of the traditional short text similarity algorithms,effective approaches are presented to measure the relationship between terms by capturing both the intra-relation(explicit)and inter-relation(implicit),which is implemented via utilizing modified intra-relation and inter-relation between texts.In addition,we also take the discrimination and indication of the strong category feature of the terms into account,and design the corresponding strong classification feature-based similarity function.Finally,two kinds of similarity methods are considered to capture the final short text similarity.The major contributions of the paper are summarized as follows:(1)We propose a novel short text similarity measure based on coupled semantic relation.First of all,the method considers the co-occurrence information and the distance between terms to get the co-occurrence correlation degree.The related weights of the terms are calculated based on co-occurrence correlation degree,and then the inter-relations and intra-relation of the terms are calculated by using the related weights.Related weights and general Jaccard are combined to define intra-relation.The inter-relationship is defined as the shared entropy of the path formed between the two terms on the intra-path-graph.The greater the shared entropy,the stronger the inter-relation is,and the stronger the relationship between the terms is.Both intra-relation and inter-relation between a pair of terms are combined to define coupled semantic relation.Finally,this paper obtains the improved similarity of coupling relation based on the coupling semantic relation of terms.(2)We design a strong classification feature-based similarity function.The improved expected cross entropy is utilized to extract the strong category features of each class from labeling data set.The expected cross entropy is descended ordering and the top K features are selected to form strong classification features dictionary.Besides,we propose a novel terms sense disambiguation by utilizing terms context similarity.The basic idea of strong classification features similarity is that the more similar two texts are,the more features of strong classification they share.(3)The similarity algorithm based on coupling relation and strong classification features is designed.On the basis of the first two algorithms,a more efficient and advanced similarity algorithm is designed,considering the coupling relation of terms and strong category features.In order to verify the validity of short text similarity,clustering task is performed on DBLP data set,20newsgroups and Sogou corpus data set.The experimental results show that the proposed method has superiority clustering effect than the benchmark methods.
Keywords/Search Tags:Text similarity computing, Terms weighting, Coupled semantic relation, Strong classification, Term sense disambiguation
PDF Full Text Request
Related items