Research On Chinese Text Similarity Detection Technology Based On Word Weight Analysis

Posted on:2022-02-11

Degree:Master

Type:Thesis

Country:China

Candidate:J Y Chen

Full Text:PDF

GTID:2518306326966119

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

Under the era of information age nowadays,the amount of information on the Internet sustains a dramatic growing,since more and more netizens join the tide of Internet information production and dissemination.Among these information,especially textual data,is often observed holding similar or even same core content,for documents are widely copied,modified,formatted,replaced by synonyms during disseminating.Problems therefore caused by wasting time in calculation,file storage and retrieval tasks lead to a great negative effect on Internet information quality and dissemination.In order to reduce the waste caused by duplicate documents,an efficient and accurate similar document retrieval technology is necessary.Simhash is one of the common technologies in similar document detection tasks,which own the ability to map a text into a low-dimensional token.The similarity between any two documents can be judged by comparing different Simhash tokens.However,the original intention of the Simhash is to detect near-duplicates for web crawling,without considering the semantic information of each word in the text expression process.Therefore,the expression accuracy of the original Simhash token holds a problem of deficiency.In order to improve the accuracy of similar text detection,this thesis studies and improves the word weight calculation strategy and the similar text detection algorithm.This thesis makes main contributions as follow:(1)This thesis proposed a word similarity algorithm that integrates How Net and Ci Lin,in order to solve the problem while single-knowledge-based Chinese word similarity algorithms own incomplete information.Based on word similarity algorithm applicated to the word weight calculation strategy,we integrate two algorithms dynamically according to the distribution of words on the basis of the existing outstanding word similarity algorithms based on How Net IC and Ci Lin IC,which makes full use of the hierarchical structure information in How Net and Ci Lin to improve the limitations of existing methods.(2)Focusing on the problem that traditional TF-IDF has poor accuracy in displaying the importance of words in a text,this thesis proposed an improved calculation method of word weight.The traditional TF-IDF algorithm neglects the characteristics of words themselves but only considers the frequency of the words in the text and document sets.After analyzing the expression habit of Chinese users and the semantic information contained in words,an algorithm integrates the length,the part of speech,the positions,and title matching of the word is proposed in this thesis.(3)Aiming at the shortcomings of Simhash on similar text detection performance,we applied a multi-feature word weight algorithm to improve the generation process of Simhash token.This method applied the new word weight calculation method to generate Simhash token which maps a text in document sets into a low-dimensional token,then compare the similarity between different Simhash tokens to make the final result of similar document detection.Compared with the traditional Simhash,our method improves the precision and recall in Chinese text similarity detection tasks.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Research Of Comprehensive Weighted Word Semantic Similarity Computation
2	Research On Text Similarity Algorithm Based On VSM Combined With Word Semantics
3	Research On Word Similarity Computation Method Based On Non-IID Learning
4	Study On Chinese Text Similarity Computing Based On Word Segmentation
5	Research On Text Similarity Detection Algorithm Based On Simhash
6	Study On Chinese Text Replication Detection Based On Sentence Similarity
7	Research And Implementation Of Subjective Question Scoring System Based On Chinese Word Segmentation And Text Similarity
8	Exploring Dialogue Text Classification Based On Word Mixture Vectors
9	Study Of Chinese Text Similarity Research Based On Markov Word Order Gene
10	Research And Application Of Word Similarity Based On Context