The Study Of Domain Dictionary Construction Based On Web

Posted on:2009-01-26

Degree:Master

Type:Thesis

Country:China

Candidate:R Gao

Full Text:PDF

GTID:2178360278964555

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Domain-specific terms which can represent the characteristics of corresponding domains can be extracted from corpora automatically. Automatic domain-specific term extraction is an important task in natural language processing, which can be applied to domain ontology construction, vertical search, text classification, class-based language model etc. At the same time, resources in specific areas on the Internet are very abundant.So how to extract domain dictionary from large scale of special domain corpora from the Internet has challenging meaning and actual value.This paper mainly introduce the technology on construction of domain dictionary and also explain how to realize it. By analyzing the functions of the system,we divide the system into four parts,such as gathering domain texts , pretreating domain resources, detecting new words and extracting domain terms.Unlike the general domain terms extractor, we need to collect web pages, In this part,we collects the web pages of special catalog using Breadth First Algorithm.We research the technology of focused crawling which is the key point of this part,design and implement a module of topic filter based on Vector Space Model.On the part of preprocessing the resources,we adopt a statistical approach for extracting text content.The method usesa tree to represent a web page according to HTML tags, and then chooses the node which contains text content by using the number of the Chinese characters in each node of the tree.It is simple and used widely.It can satisfies the need of the system in accuracy and efficiency.On the part of detecting new domain words,we adopt the method based on statistics and rules.whether a repetitive string should be filtrated or not,according to independent word probability parameters and so on.Recently,the F-measure of the module equals to over 70%.On the part of extracting domain terms,we use the strategy based on normalized distribution entropy,and also introduce the inside word probability parameters into the method. On the whole,through the experiment results of analysis, The author proposes a solution for WEB domain dictionary construction,implements it and gets good results based on the test.

Keywords/Search Tags:

Term, Terminology, Automatic domain-specific term extraction, New word identification

PDF Full Text Request

Related items

1	Research On Domain-Specific Term Extraction Based On Semi-Supervised Learning
2	Research On Latent Semantic Analysis For Domain-specific Chinese Textual Information Processing
3	Design Of Automatic Term Extraction System And Study Of Key Techniques
4	Chinese Term Extraction In Specific Domain
5	Research Of Chinese Word Segmentation Oriented To Special Domain
6	Ccd-based Terminology Extraction Study
7	Chinese Terminology System Design And Implementation Based On Maximum Entropy
8	Research On Automatic Extraction Of Chinese Terms
9	The Study Of Automatic Chinese Term Extraction
10	Domain Term Automatic Acquisition From Unstructured Texts