| Text similarity detection algorithm is widely used in large-scale natural language text processing,including the commonly duplicate checking of scientific papers,mass duplicated webpages deletion,generating abstracts of scientific papers,etc.Especially,the fingerprints based on simhash algorithm have local sensitive characteristic,also can reflect the content similarity through the distance of them.At the same time,the retrieving and matching process generates high efficiency due to the adopting way of the index,which means the detection based on the algorithm can work efficiently in the large file system.By lots of verification experiments of the researchers over the years,simhash algorithm has good performance in the process of text similarity detection.However,different from the simple copy checking,the complexity of the natural language has caused great difficulties to the text semantic similarity calculation.Simhash algorithm is designed for reducing the duplication in the large number of webpages,which just demands filtering out the fully or partially same content,it does not involve the semantic information of text,unable to support the synonyms and polysemy semantic problems in natural language processing yet.Therefore,aiming at the condition that simhash can’t identify semantic similarity of synonyms,this article has carried out the semantic similarity detection algorithm research based on simhash algorithm,which takes advantage of the "dimension reduction" on text processing,as well as has high efficiency in the retrieval process.First of all,this article analyzes the characteristics of the common text similarity computing algorithms.Through the comparative analysis,I elaborate the reason of choosing the simhash algorithm as the foundation to research,as well as point out the existing problems and the research thought for further improving.Secondly,aiming at the shortcomings of the simhash on text semantic performance,I propose the semantic coding design based on synonym CiLin and the context,through the research of the existing synonyms extension scheme.Further,according to the granularity characteristics of the text block,this paper modifies the determination of the weight of the fingerprint,and puts forward the adjustment scheme using the part of speech as the weight.This paper puts forward a new algorithm of semantic fingerprint generation based on synonym information,and solves the problem that the similar text can not be identified.In addition,a large number of text detection produces a huge number of fingerprints.In order to improve the efficiency of matching retrieval,based on the idea of index,the paper proposes a method to segment the fingerprint and generates the segmented index with the position information.In theory,the redundant comparison operation is reduced,and the detection speed is improved.Finally,a prototype system is developed and compared with other text similarity detection algorithms.That similar text detection algorithm can be used to solve the synonyms recognition and polysemy problems,which simhash algorithm cannot support.Meanwhile this algorithm improves the detection efficiency,which can have good performance in more large-scale text similarity detection system. |