Font Size: a A A

Research On Domain-Specific Term Extraction Based On Semi-Supervised Learning

Posted on:2010-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:D N ShiFull Text:PDF
GTID:2178360278466405Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Term is concentrated expression of the core knowledge in subject domains. To some extends, its variation presents the development and change of subject domains. Automatic term extraction is an important topic in Chinese information processing, which is applied in many areas widely, such as text classification, syntax analysis, lexicography.Traditional term extraction mainly depends on handiwork, and it calls for a lot of money and manpower. Along with the development of subjects, the term updates rapidly. Automatic term extraction method is the necessary tendency. Therefore, the paper proposes a term extraction algorithm based on statistics and knowledge. By analyzing the philological feature and document structure of scientific paper, we indicate the interested fields and sensitive fields. According to the above analysis, we add word frequency weighted factor to the SCP and C-value measures in order to improve them. Then we use the improved SCP to eliminate the unithood and the improved C-value to eliminate the termhood of candidate terms. This algorithm ameliorates the recall rate of low frequency terms while assures the integral effect. We do experiment on some scientific papers of license plate recognition. The result verifies the algorithm can improve the extraction of low frequency terms, and its F-measure attains 85.7%.Meanwhile, we count the relevant of terms extracted above. The word relevant reflects correlative degree of words, and it is widely used in natural language processing, information retrieval, text classification and so on. According to the context information, the paper generates local and global co-occurrence word pair. And then we add word frequency weighted factor and co-occurrence distance impact factor to the log frequency and global entropy weight method by analyzing semantic information that the punctuations imply. The algorithm improves the precision of word relevant calculation.
Keywords/Search Tags:Term Extraction, Unithood, Termhood, Word Relevant, Word co-occurrence Information
PDF Full Text Request
Related items