Font Size: a A A

Parallel automatic term extraction from large Web corpora

Posted on:2005-05-28Degree:M.C.ScType:Thesis
University:Dalhousie University (Canada)Candidate:Zhang, LingyanFull Text:PDF
GTID:2458390008489671Subject:Computer Science
Abstract/Summary:
Automatic term extraction using linguistic and statistical measures has been shown to be effective for special text corpora. For large text corpora (of the order of tens of gigabytes), however, sequential computation is prohibitively expensive. We are investigating the feasibility of parallelizing automatic term extraction on a cluster of distributed-memory workstations. The large text corpus is divided among the nodes of the cluster for part-of-speech tagging and candidate term extraction in parallel. Candidate term lists from each node are then merged into a single list, and sorted according to a "termhood" measure. The approach is being tested on the Neural Network data set and the 18GB .GOV web collection.
Keywords/Search Tags:Term, Large
Related items