Parallel automatic term extraction from large Web corpora

Posted on:2005-05-28

Degree:M.C.Sc

Type:Thesis

University:Dalhousie University (Canada)

Candidate:Zhang, Lingyan

Full Text:PDF

GTID:2458390008489671

Subject:Computer Science

Abstract/Summary:

Automatic term extraction using linguistic and statistical measures has been shown to be effective for special text corpora. For large text corpora (of the order of tens of gigabytes), however, sequential computation is prohibitively expensive. We are investigating the feasibility of parallelizing automatic term extraction on a cluster of distributed-memory workstations. The large text corpus is divided among the nodes of the cluster for part-of-speech tagging and candidate term extraction in parallel. Candidate term lists from each node are then merged into a single list, and sorted according to a "termhood" measure. The approach is being tested on the Neural Network data set and the 18GB .GOV web collection.

Keywords/Search Tags:

Term, Large

Related items

1	Study of document retrieval using Latent Semantic Indexing (LSI) on a very large data set
2	Large Trucks Long-term On Illegal Lane Discriminant Model Research
3	Study And Implementation Of Content-based Mandarin Spoken Term Detection System
4	The Study Of Automatic Chinese Term Extraction
5	Design And Implementation Of Deep Learning Based Area Term Recogonition System
6	Research On Personalized Recommendation Based On Long-term And Short-term Preference Depth Joint Modeling
7	Research On Extracting Term Relationships Based On Semantic Grammar
8	The Research On A Term Weight Calculation Method Based On The Term Mathmatical Expection
9	Research On Confidence Measure For Chinese Spoken Term Detection
10	A Study On The Chinese Term Extraction