Font Size: a A A

Research On Concept Extraction Of Ontology Learning For Chinese Text

Posted on:2011-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:J GuanFull Text:PDF
GTID:2178360305955319Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As a conceptualization model of knowledge sharing and reusing, ontology has been used widely used in many areas, such as Semantic Web, Knowledge Management, E-commerce and so on. But it is difficult to construct ontology over these areas, the construction of domain ontology requires not only numbers of domain experts, but also easily lead to knowledge acquisition bottleneck, and it is very difficult to dynamically update timely. So ontology learning technology occurs, which can automatically or semi-automatic construct ontology by using machine learning and natural language processing techniques to acquire knowledge from the data source. Concept extraction and relation extraction are important for ontology learning while the concept extraction is the foundation of ontology learning, which decides the quality of concept relation extraction and it is essential to construct ontology.There are three concepts extraction methods: linguistics, statistics, hybrid method. And statistical method is the main method of concept extraction. Chinese language and Western languages has some different in the basic units of language, Chinese language is more complex then Western languages, so both statistical methods and linguistic methods have deficiencies in concept extraction separately for Chinese text, this paper is focused on the concept extraction of ontology learning from Chinese text, and introduced a multi-strategy method for concept extraction, which is based on linguistics and statistics.Chinese language and Western language are different, Chinese words can not be segmented by spaces, because character is the basic unit of Chinese text. Chinese characters form the word, which is the basic unit to express semantic information, so it requires text segmentation pre-process before extracting concepts from Chinese text. We use ICTCLAS as text pre-process tool in this paper, text pre-process including: Segmentation,POS-tagging, Removal Stop Words. Punctuation was as a special word during the text pre-process, so after text pre-process, the content of the document will be segmented into a listed string of symbols which are composed by lexical items and POS-tagging. Punctuation was as a special word during the text pre-process, so after text pre-process, the content of the document will be segmented into a listed string of symbols which are composed by lexical items and POS-tagging. Meanwhile, we use the stop word list filtering conjunction, adverb, which is commonly used in the domain document. And this will improve the accuracy of extraction of domain concepts.We analysis commonly used concept extraction method based on statistics in this paper, because statistical methods can not extract low-frequency and multi-word concept, so we introduce a multi-strategy concept extraction method based on statistics and linguistics:(1) Statistical method is an improved TFIDF method,We called TFIDFE method. The traditional TFIDF method extract domain concept by TF and IDF, it did not consider the word distribution of documents in the domain document set, so it may be extracted word which is merely in individual documents of the domain document sets, such words can not express domain knowledge. Therefore, we improve the traditional TFIDF method by using Entropy, We use Entropy to examine the distribution of term in domain documents set, which can improve the accuracy of concept extraction.(2) Linguistic template Matching method. For statistical methods can not handle multi-word concept, as well as low-frequency concept, we use template matching methods to extract multi-word concept and low-frequency concept. We use KMP matching algorithm match the result of pre-process document, recording the number of matching success multi-word phrases, then set a threshold value, if the phrases number greater than threshold value, we chose the multi-word phrase as domain concept. So we can extract the multi-word concept and the low-frequency concepts which are satisfy the matching rules.We build a multi-strategy concept extraction system, which can examine the effect of the multi-strategy concept extraction method by experiment. We use Recall, Precision and F1 value as the evaluation methods to evaluate the experimental results, choose 100 articles in Artificial Intelligence-related disciplines as domain corpus. First, We use the multi-strategy method to extract domain concept; then respectively, using the traditional TFIDF method, the TFIDFE method and template matching method to extract concept with the same corpus. We compare the experimental results of these methods, experiments results show that the concept extraction accuracy of TFIDFE method is better than the traditional TFIDF method; the multi-strategy concept extraction method can obtain a higher F1 value.Concept extraction is the foundation of ontology learning from Chinese text, and it is also a crucial step for the construction of domain ontology. The work we have done in this paper can be part of the entire ontology learning from Chinese text. We introduce a multi-strategy concept extraction method in this paper, and build a multi-strategy concept extraction system, which can be used as the concept extraction module of ontology learning system.There are still insufficient in our work. Such as, when establishing the matching template only consider a few conditions to build the matching rules; To build a domain ontology learning system, we need to extract relation between concepts, such as taxonomic relation and not- taxonomic relation. We will improve the work in future.
Keywords/Search Tags:Ontology Learning, Concept Extraction, TFIDF, Template Matching
PDF Full Text Request
Related items