Domain Knowledge Acquisition

Posted on:2009-12-17

Degree:Doctor

Type:Dissertation

Country:China

Candidate:W Li

Full Text:PDF

GTID:1118360245470120

Subject:Signal and Information Processing

Abstract/Summary:

Knowledge base is "brain" of natural language processing systems and enables them to "understand" and process natural language. This dissertation makes effort to explore new technologies of domain knowledge acquisition. The main contributions are as follow:1. To solve web redundancy information during the domain knowledge source acquisition, a web document duplicate removal algorithm based on keyword sequences (i.e. KSM) is presented. Referring to comprehensive information theory, KSM uses keyword sequences of web document to represent its structure feature and intension feature, then judges information redundancy by comparing keyword sequences between similar documents. In the various obscure duplicate detection experiments, the overall precision and recall rate of KSM is 99.2% and 97.7% respectively.2. To improve the recall of terms with low frequency, an automatic Chinese term extraction algorithm based on language cognition theory is presented. Making use of discourse markers in research papers, this algorithm introduces "weighed frequency" factor to C-Value and SCP_f measures, then proposes MC-SCP measure to evaluate both "unithood" and "termhood" of candidate terms. In the "License Plate Recognition" domain term extraction, the overall recall and precision is 96.5% and 77.8% respectively, while the recall and precision for terms with low frequency is 96.2%. and 79.3% respectively.3. To acquire various relations of terms, a multi-strategies based relation acquisition model is designed, including a) rule-based synonymical relation acquisition, b) hierarchical relation acquisition based on terms' morphologic similarities, c) non-hierarchical relation acquisition based on all weighted association rules, and d) PSO-based term clustering.4. To alleviate the conflict between swarming of multi-domain research papers and limitation of editors' knowledge, a domain-knowledge-guided first review assistant system is presented. According to the editors' experience, the first review is refined into four judgments. In the experiment of 2365 research papers, this system can assist editors with filtering 35% unqualified manuscripts.

Keywords/Search Tags:

term extraction, terms' relation acquisition, document duplicate removal, All-Weighted Association Rules Mining, first review assistant system for research papers

Related items

1	Research On A Model And Mining Algorithm Of Weighted Association Rules
2	Distributed Network Alarm Weighted Association Rules Mining System Research And Design
3	Research On Weighted Association Rules
4	The Research And Application Of Weighted Association Rules Mining Algorithm
5	The Research On Association Rules Of The Products
6	Research And Application Of Weighted Association Rules Algorithm Based On Cluster And Compression Matrix
7	Optimization Algorithm Of Weighted Association Rules Mining
8	Studies On Query Expansion Based On Item-All-Weighted Association Rules Mining
9	The Field Of Term Extraction And The Relationship Between The Classification Study
10	The Application And The Research Of The Association Rules In The Search Engine