Font Size: a A A

Research And Implementation Of Text Similarity Computing Based On HowNet Sememe Space

Posted on:2014-11-17Degree:MasterType:Thesis
Country:ChinaCandidate:K ZhangFull Text:PDF
GTID:2268330392471761Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Text Similarity computing is the gordian technique in fields of intellectual propertyprotection, machine translation, natural language processing, copy detection, questionanswering, text classification information retrieval, and so on.Current appoaches to textsimilarity can be divided in two types: one is based on texts statistics information, theother one is based on sematic understanding. The statistics methods achieve wellperformance in paragraphs, text and other large-grained entity similarity calculation.The most typical statistic methods are VSM (Vector Space Model) and GVSM (GeneralVector Space Model). GVSM improves the orthogonal term assumption of VSM byusing term co-occurrence. The semantic understanding methods usually use some kindknowledge base to calculate the similarity of words and sentences. The statisticsmethods are usaully simple and efficient, but lack of semanteme. So statistics methodscan not slove the problem of polysemy and synonyms in natural language. The semanticunderstanding methods are often too complex to use in large scale of texts.Inspired by GVSM, this paper propses a text semantic similarity computing methodbasing on the HowNet sememes space(Sememe Vector Space Model). SVSM combinesstastic methods and semantic understanding methods by transforming texts into vectorsof sememe space. The text similarity is calculated by the included angle of text sememevectors. To verify SVSM, this paper use text clustering contrast experiment with classicVSM and GVSM. The result shows that SVSM achieves a better performance in textsemantic similarity computing compared with VSM and GVSM.Based on SVSM, this paper designs and implements a text duplicate checkingsystem using the J2EE platform. In this System, Sememes, concecpts,words in HowNet,the similairties between sememes and word’s sememe vectors are designed and storedas the retional table in database. In this way the requiring data could directly beretrieved during text similarity computing avoiding repetitive computing and improvethe efficiency. This paper use the open source tool such as Lucence, ICTCLAS,hibernate Search to accomplish the construction of text sememe vector and similaritycomputing. The text duplicate checking system is adopted as part of a actualengineering application and achieves a well performace.
Keywords/Search Tags:text Similarity, VSM, GVSM, Sematic similarity, HowNet, Text duplicatingsystem
PDF Full Text Request
Related items