Font Size: a A A

Domain Knowledge Acquisition

Posted on:2009-12-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:W LiFull Text:PDF
GTID:1118360245470120Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Knowledge base is "brain" of natural language processing systems and enables them to "understand" and process natural language. This dissertation makes effort to explore new technologies of domain knowledge acquisition. The main contributions are as follow:1. To solve web redundancy information during the domain knowledge source acquisition, a web document duplicate removal algorithm based on keyword sequences (i.e. KSM) is presented. Referring to comprehensive information theory, KSM uses keyword sequences of web document to represent its structure feature and intension feature, then judges information redundancy by comparing keyword sequences between similar documents. In the various obscure duplicate detection experiments, the overall precision and recall rate of KSM is 99.2% and 97.7% respectively.2. To improve the recall of terms with low frequency, an automatic Chinese term extraction algorithm based on language cognition theory is presented. Making use of discourse markers in research papers, this algorithm introduces "weighed frequency" factor to C-Value and SCP_f measures, then proposes MC-SCP measure to evaluate both "unithood" and "termhood" of candidate terms. In the "License Plate Recognition" domain term extraction, the overall recall and precision is 96.5% and 77.8% respectively, while the recall and precision for terms with low frequency is 96.2%. and 79.3% respectively.3. To acquire various relations of terms, a multi-strategies based relation acquisition model is designed, including a) rule-based synonymical relation acquisition, b) hierarchical relation acquisition based on terms' morphologic similarities, c) non-hierarchical relation acquisition based on all weighted association rules, and d) PSO-based term clustering.4. To alleviate the conflict between swarming of multi-domain research papers and limitation of editors' knowledge, a domain-knowledge-guided first review assistant system is presented. According to the editors' experience, the first review is refined into four judgments. In the experiment of 2365 research papers, this system can assist editors with filtering 35% unqualified manuscripts.
Keywords/Search Tags:term extraction, terms' relation acquisition, document duplicate removal, All-Weighted Association Rules Mining, first review assistant system for research papers
PDF Full Text Request
Related items