Research On Domain-Specific Term Extraction Based On Semi-Supervised Learning

Posted on:2010-11-21

Degree:Master

Type:Thesis

Country:China

Candidate:D N Shi

Full Text:PDF

GTID:2178360278466405

Subject:Control theory and control engineering

Abstract/Summary:

PDF Full Text Request

Term is concentrated expression of the core knowledge in subject domains. To some extends, its variation presents the development and change of subject domains. Automatic term extraction is an important topic in Chinese information processing, which is applied in many areas widely, such as text classification, syntax analysis, lexicography.Traditional term extraction mainly depends on handiwork, and it calls for a lot of money and manpower. Along with the development of subjects, the term updates rapidly. Automatic term extraction method is the necessary tendency. Therefore, the paper proposes a term extraction algorithm based on statistics and knowledge. By analyzing the philological feature and document structure of scientific paper, we indicate the interested fields and sensitive fields. According to the above analysis, we add word frequency weighted factor to the SCP and C-value measures in order to improve them. Then we use the improved SCP to eliminate the unithood and the improved C-value to eliminate the termhood of candidate terms. This algorithm ameliorates the recall rate of low frequency terms while assures the integral effect. We do experiment on some scientific papers of license plate recognition. The result verifies the algorithm can improve the extraction of low frequency terms, and its F-measure attains 85.7%.Meanwhile, we count the relevant of terms extracted above. The word relevant reflects correlative degree of words, and it is widely used in natural language processing, information retrieval, text classification and so on. According to the context information, the paper generates local and global co-occurrence word pair. And then we add word frequency weighted factor and co-occurrence distance impact factor to the log frequency and global entropy weight method by analyzing semantic information that the punctuations imply. The algorithm improves the precision of word relevant calculation.

Keywords/Search Tags:

Term Extraction, Unithood, Termhood, Word Relevant, Word co-occurrence Information

PDF Full Text Request

Related items

1	A Study On The Chinese Term Extraction
2	Research On Terminology Extraction Of Academic Paper Based On Multi-Strategy Method
3	Research On Keyword Extraction And Improved LSA Based On Co-occurrence Word
4	The Description Of Text's Feature Based On Semanteme Concept
5	Research On The Language Model Information Retrieval Method Based On Word Co-occurrence
6	Hot Topics Detected From Micro-bloggings Based On Word Co-occurrence Model
7	Research Of Chinese Word Segmentation Oriented To Special Domain
8	Bilingual Term Extraction Based On Parallel Corpus
9	Design And Implementation Of The Uighur Word Frequencyâ€™s Statistics System
10	Term Co-occurrence Analysis And Opinion Leader Recognition Of Micro-blog