Extracting Useful Semantic Information from Large Scale Corpora of Text

Posted on:2013-01-31

Degree:Ph.D

Type:Dissertation

University:University of California, Irvine

Candidate:Mendoza, Ray Padilla, Jr

Full Text:PDF

GTID:1458390008466838

Subject:Language

Abstract/Summary:

PDF Full Text Request

Extracting and representing semantic information from large scale corpora is at the crux of computer-assisted knowledge generation. Semantic information depends on collocation extraction methods, mathematical models used to represent distributional information, and weighting functions which transform the space. This dissertation provides a solution to the problem of extracting useful collocations, improves the standard vector space model, and posits non-frequency based transformations on the vector space model space.;First, several collocation extraction methods exist based on linear proximity, or syntactic structure. Syntactic structure can be generated using parsers, or be provided by treebanks. However, using syntactic structure is computationally expensive and cannot scale well to large-scale corpora. Two algorithms are proposed which approximate those extracted by parser-based methods. They are computationally inexpensive and exclude semantically irrelevant collocations, produce collocations which are more statistically significant than linear proximity-based collocations, and produce tighter, more well-separated clusters.;Second, the problem of embedding collocations into a useful mathematical model is commonly addressed with the use of the Vector Space Model (Salton, 1975). However, it implicitly assumes an orthonormal basis. This contradicts the reality that words associated to dimensions which form the basis can be related. A general solution to this issue will be provided which partially relaxes the assumption of orthonormality. The generalized vector space shows improved semantic category separation for known semantic categories.;Lastly, weighting functions used on vector spaces are generally frequency based. This is necessary because relationships between points in the vector space do not immediately reflect their distributional relatedness, though they should (Harris, 1954). This is due to frequency effects in language use (Zipf, 1932). The correlation between word frequency and relevant features for specific semantic categories (Overschelde, 2004) will be explored. This dissertation proposes a weighting function based on a quantitative measure of confidence that the asymptotic limit of collected distribution has been reached. When combined with frequency-based weighting, cluster separation improves for known semantic classes.;In summary, this work provides an efficient and accurate collocation extraction method, generalizes the vector space model, and offers alternative non-frequency based weighting functions for vector space transformation.

Keywords/Search Tags:

Semantic information, Vector space, Corpora, Scale, Collocation extraction, Weighting functions, Useful

PDF Full Text Request

Related items

1	Automatic Chinese Collocation Extraction Based On Large-scale Corpus
2	Design And Implementation Of Collocation Repositories In Chinese Intelligent Input Method Based On Grammar And Semantics
3	The Construction Of Large-scale Chinese-English Comparable Corpora
4	Research On Key Technology In Mining Web Bilingual Corpora
5	A Research Of Chinese Collocation Extraction System Based On Hadoop
6	Research On The Information Techn Ology Word Collacation Extraction Method Based On Multi Statistical Method Cascade
7	Research And Implemention Of An English Collocation Error Detection And Correction System
8	Research On Cross-Language Information Retrieval For Biomedicine
9	Modeling social systems processes found in *text corpora through windowed latent semantic analysis and simulation of concept refreshment events
10	The Research And Construction Of Comparable Corpora