Font Size: a A A

Research On Similarity Computing Method For Domain Texts

Posted on:2011-06-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y B LuoFull Text:PDF
GTID:2178330305960422Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the accumulation of domain textual data, there are more and more non-structural or semi-structured data such as doc, pdf and other formats in every domain including education, finance, dining, tourism and so on. It is more difficult to manage these data than normal structured data. In recent years, there have been many applications toward domains, such as ticket information provided by KoXoo, real estate information provided by SOFUN, and so on. It is convenient for people to use these information services in their daily life. Most information processing and services are based on structured data within domains, therefore, the main research goal of this thesis focus on information processing of unstructured data, especially text similarity computing in textual information processing. Text similarity computing is one of the host and important techniques in many NLP applications such as text clustering, information recommendation and so on. Traditional text similarity computing is based on the vector space model of keywords. It only considers the similarity of the simple shape of keywords, but does not take account of semantic information between keywords in the text, so it lacks of understanding of the text subject and affects the effect of similarity calculation.Therefore, the research topics are addressed on domain knowledge extraction from domain texts and the applications of the domain knowledge to acquire textual semantic features for computing text similarity. The main contributions of this thesis are as follows:(1) An approach to identify new words. Based on the characteristics of new words, we design and verify the method to identify new words, which uses statistics to search strings based on large-scale corpus and threshold filtering in the new candidate words.(2) A kind of model for domain knowledge acquisition. This model uses chi-square distribution with positive and negative symbols to compute the correlation between terms and specific domains, adds the high relevance words to domain dictionary, and combines domain dictionary and "is-a" relation pattern to identify words pair of the upper and lower semantic relation. (3) An approach on extracting domain feature and semantic feature of texts. First, we extract domain keyword features with domain dictionary to reduce the impact on the text topic. Second, the system uses semantic relation (upper and lower relation) words to expand domain keywords vector model.(4) A new computing method for semantic similarity of domain texts. It uses vector model of domain keywords which is expanded by the words of upper concepts and computes text similarity by domain similarity computing method.The experimental results show that text semantic similarity computing method based on domain knowledge is better than the traditional methods. It can extract semantic features of texts and measure of semantic similarity between the domain texts.
Keywords/Search Tags:Domain Text, Text Similarity, Domain Words, Semantic Relation, Semantic Extension
PDF Full Text Request
Related items