Font Size: a A A

Research And Implementation Of Text Similarity Computing Based On Semantic Understanding

Posted on:2016-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:R Z SunFull Text:PDF
GTID:2308330479976765Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text similarity computing is mainly to compute the similarity of content, syntax, structure between two or more text information through the establishment algorithm model, and it is a key technology related to lots of important applications in the text information processing. Text similarity computing mostly uses word frequency statistics, and the most representative method is vector space model(VSM). VSM expresses text into the feature item vectors, then text similarity is represented by vectors angle cosine. In addition there are GVSM algorithm based on generalized vector space model, latent semantic indexing LSI algorithm, string matching algorithm, fingerprint recognition algorithm, and so on. Text similarity computing based on semantic understanding uses some kind of knowledge bases, which is added to word semantic, sentence semantic and paragraph semantic and other factors, the calculation result is more suitable for practical applications.Traditional text similarity algorithm based on the How Net is building on VSM, the text feature item vectors are represented by How Net sememe vector space, added to word semantic considerations. This paper makes improvements on the basis of original algorithm, on the one hand improved computing on How Net sememe similarity by using of sememe hierarchical structure, adding semantic factors on depth and density, to make the results more perfect; on the other hand added the paragraph similarity compared to the original algorithm, increased the influence on the whole text similarity. This paper use text clustering experiment to verify the effectiveness for modified algorithm, and it also proved that modified algorithm achieved a better performance.Based on theoretical research, this paper implements a text similarity system using the J2 EE platform and open source technology. According to the system function, this system is divided into four modules: How Net data processing module, text pretreatment module, text vector structuring module, synthesis computing module. It provides the solution of design and implement for different modules. The system has achieved process of text sememe vector representation and similarity computation with NLPIR, Lucene, SSH and other open source software. Finally, the similarity system is applied to the actual engineering project and achieved a well performance.
Keywords/Search Tags:Text similarity, Semantic understanding, VSM, How Net, Similarity computing
PDF Full Text Request
Related items