Research On Similarity Computing Method For Domain Texts

Posted on:2011-06-25

Degree:Master

Type:Thesis

Country:China

Candidate:Y B Luo

Full Text:PDF

GTID:2178330305960422

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the accumulation of domain textual data, there are more and more non-structural or semi-structured data such as doc, pdf and other formats in every domain including education, finance, dining, tourism and so on. It is more difficult to manage these data than normal structured data. In recent years, there have been many applications toward domains, such as ticket information provided by KoXoo, real estate information provided by SOFUN, and so on. It is convenient for people to use these information services in their daily life. Most information processing and services are based on structured data within domains, therefore, the main research goal of this thesis focus on information processing of unstructured data, especially text similarity computing in textual information processing. Text similarity computing is one of the host and important techniques in many NLP applications such as text clustering, information recommendation and so on. Traditional text similarity computing is based on the vector space model of keywords. It only considers the similarity of the simple shape of keywords, but does not take account of semantic information between keywords in the text, so it lacks of understanding of the text subject and affects the effect of similarity calculation.Therefore, the research topics are addressed on domain knowledge extraction from domain texts and the applications of the domain knowledge to acquire textual semantic features for computing text similarity. The main contributions of this thesis are as follows:(1) An approach to identify new words. Based on the characteristics of new words, we design and verify the method to identify new words, which uses statistics to search strings based on large-scale corpus and threshold filtering in the new candidate words.(2) A kind of model for domain knowledge acquisition. This model uses chi-square distribution with positive and negative symbols to compute the correlation between terms and specific domains, adds the high relevance words to domain dictionary, and combines domain dictionary and "is-a" relation pattern to identify words pair of the upper and lower semantic relation. (3) An approach on extracting domain feature and semantic feature of texts. First, we extract domain keyword features with domain dictionary to reduce the impact on the text topic. Second, the system uses semantic relation (upper and lower relation) words to expand domain keywords vector model.(4) A new computing method for semantic similarity of domain texts. It uses vector model of domain keywords which is expanded by the words of upper concepts and computes text similarity by domain similarity computing method.The experimental results show that text semantic similarity computing method based on domain knowledge is better than the traditional methods. It can extract semantic features of texts and measure of semantic similarity between the domain texts.

Keywords/Search Tags:

Domain Text, Text Similarity, Domain Words, Semantic Relation, Semantic Extension

PDF Full Text Request

Related items

1	Ontology-based Domain-specific Semantic Similarity Analysis and Application
2	Research Of Text Mining About Semantic Relation Recognition
3	Research On Short Text Similarity Measure Based On Semantic Coupling
4	A Research On Semantic Relevancy Computational Method For Text Based On Hypertension Domain Ontology
5	Research On Text Similarity Measure Method Of Combining New Word Analysis And Semantic Analysis
6	Chinese Text Similarity Matching Based On Domain Dictionary
7	Chinese Text Similarity Research Based On Semantic And Text Structure
8	Research On Ontology-Based Semantic Text Categorization
9	Research Of Word Semantic Similarity Based On Domain Knowledge
10	The Study Of Measures And Applications Of Short Text Semantic Similarity