Research And Implementation Of Text Similarity Computing Based On HowNet Sememe Space

Posted on:2014-11-17

Degree:Master

Type:Thesis

Country:China

Candidate:K Zhang

Full Text:PDF

GTID:2268330392471761

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Text Similarity computing is the gordian technique in fields of intellectual propertyprotection, machine translation, natural language processing, copy detection, questionanswering, text classification information retrieval, and so on.Current appoaches to textsimilarity can be divided in two types: one is based on texts statistics information, theother one is based on sematic understanding. The statistics methods achieve wellperformance in paragraphs, text and other large-grained entity similarity calculation.The most typical statistic methods are VSM (Vector Space Model) and GVSM (GeneralVector Space Model). GVSM improves the orthogonal term assumption of VSM byusing term co-occurrence. The semantic understanding methods usually use some kindknowledge base to calculate the similarity of words and sentences. The statisticsmethods are usaully simple and efficient, but lack of semanteme. So statistics methodscan not slove the problem of polysemy and synonyms in natural language. The semanticunderstanding methods are often too complex to use in large scale of texts.Inspired by GVSM, this paper propses a text semantic similarity computing methodbasing on the HowNet sememes space(Sememe Vector Space Model). SVSM combinesstastic methods and semantic understanding methods by transforming texts into vectorsof sememe space. The text similarity is calculated by the included angle of text sememevectors. To verify SVSM, this paper use text clustering contrast experiment with classicVSM and GVSM. The result shows that SVSM achieves a better performance in textsemantic similarity computing compared with VSM and GVSM.Based on SVSM, this paper designs and implements a text duplicate checkingsystem using the J2EE platform. In this System, Sememes, concecpts,words in HowNet,the similairties between sememes and word’s sememe vectors are designed and storedas the retional table in database. In this way the requiring data could directly beretrieved during text similarity computing avoiding repetitive computing and improvethe efficiency. This paper use the open source tool such as Lucence, ICTCLAS,hibernate Search to accomplish the construction of text sememe vector and similaritycomputing. The text duplicate checking system is adopted as part of a actualengineering application and achieves a well performace.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	The Text Similarity Study Base On Hownet
2	Research On Algorithm Of Chinese Text Similarity Based On Semantics
3	Research On Chinese Text Similarity Computing Based On Semantic Weighted
4	A Chinese Text Similarity Algorithm Based On Semantic Networks
5	Research On Text Similarity Measure Method Of Combining New Word Analysis And Semantic Analysis
6	Research On Text Similarity Algorithm Based On WMD Distance
7	Research On Text Clustering Based On Hownet
8	The Study Of Measures And Applications Of Short Text Semantic Similarity
9	Research On Semantic Similarity Measurement For Text
10	Research And Implementation Of The Text Cluster Based On Text Similarity Caculation