Font Size: a A A

The Study Of Domain Dictionary Construction Based On Web

Posted on:2009-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:R GaoFull Text:PDF
GTID:2178360278964555Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Domain-specific terms which can represent the characteristics of corresponding domains can be extracted from corpora automatically. Automatic domain-specific term extraction is an important task in natural language processing, which can be applied to domain ontology construction, vertical search, text classification, class-based language model etc. At the same time, resources in specific areas on the Internet are very abundant.So how to extract domain dictionary from large scale of special domain corpora from the Internet has challenging meaning and actual value.This paper mainly introduce the technology on construction of domain dictionary and also explain how to realize it. By analyzing the functions of the system,we divide the system into four parts,such as gathering domain texts , pretreating domain resources, detecting new words and extracting domain terms.Unlike the general domain terms extractor, we need to collect web pages, In this part,we collects the web pages of special catalog using Breadth First Algorithm.We research the technology of focused crawling which is the key point of this part,design and implement a module of topic filter based on Vector Space Model.On the part of preprocessing the resources,we adopt a statistical approach for extracting text content.The method usesa tree to represent a web page according to HTML tags, and then chooses the node which contains text content by using the number of the Chinese characters in each node of the tree.It is simple and used widely.It can satisfies the need of the system in accuracy and efficiency.On the part of detecting new domain words,we adopt the method based on statistics and rules.whether a repetitive string should be filtrated or not,according to independent word probability parameters and so on.Recently,the F-measure of the module equals to over 70%.On the part of extracting domain terms,we use the strategy based on normalized distribution entropy,and also introduce the inside word probability parameters into the method. On the whole,through the experiment results of analysis, The author proposes a solution for WEB domain dictionary construction,implements it and gets good results based on the test.
Keywords/Search Tags:Term, Terminology, Automatic domain-specific term extraction, New word identification
PDF Full Text Request
Related items