Font Size: a A A

Research On Latent Semantic Analysis For Domain-specific Chinese Textual Information Processing

Posted on:2011-12-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H CenFull Text:PDF
GTID:1488304802468934Subject:Information Science
Abstract/Summary:PDF Full Text Request
With the rapid development of computer and network technologies,domain-specific textual information based on natural language are spouting out,which is an important knowledge source for scientific research and enterprise competitive intelligence research.Chinese textual information also increased greatly in recent years.However,processing to Chinese textual information encountered many difficulties.The first work we have to do is to automatically split the Chinese sentences into meaningful words and terms,in which the ambiguity must be well eliminated.Then the problem is how to correctly recognize and extract the newly developed named entities,conceptions or terms.For all these difficulties,traditional Chinese textual information processing lacks the capability to understand the inner semantics underlying the text organization.On the other hand,most previous research and applications of textual information processing take document-term mapping as the basis for information organization modeling,for example,each document are transformed into a word vector.There is a strong assumption,namely word/term independencies,underlying these organization models.Although this assumption did make the model design of information organization,retrieval and processing much easier,strict independencies amongst the words/terms in a certain language are not tenable.What effects does this untenable assumption place on text information processing?This is a problem that has been overlooked for a long time,and also a problem my thesis attempts to reveal.In detail,the semantics of a document is made up of words,and words should conversely be understood in the context of the document.There is a dual probabilistic relationship between document and term.A document is a point in the space dimensioned by the words(while a word can also be considered as a point in the space dimensioned by the documents).The distribution of a document is absolutely not random,but subject to a certain semantic structure.This semantic structure underlies the text,and affects latently the term occurrences and document constitution.However,because of the inconsistencies of word usage and uncertainties of document topics,this semantic structure may be buried in noises.Traditional text information processing overpasses this semantic structure,and therefore may not correctly represent the knowledge units such as documents and terms.A more reasonable processing mechanism is to consider the semantic relationship amongst the semantic units(terms or words)inside the text,and further conduct semantic representation and information processing to the documents,conceptions,authors,organizations and so forth,based on these semantics.Here the semantic relationship can either be formal correlations,such as including,belonging,equaling,synonym,antonym and so on,or semantic features like properties,functions,axioms and instances of conceptions from the perspective of Ontology,or even some latent relationships which objectively exist but are hard to qualify through formal definition.Whichever kind of semantic relationship is worth discovering for intelligent domain-specific information processing.Furthermore,the document representation model based on sparse matrix Of high dimensions in traditional text information processing also brings a heavy obstacle to the efficiency of.information retrieving,clustering,classification and similarity measurement targeted to large-scale domain-specific textual information.In response to the above challenges,my dissertation tries to probe the following problems:(1)Chinese text segmentation and POS tagging.(2)Term recognition and extraction based on CRF(Conditional Random Fields).(3)Term domain-specificity measurement and term weighting schemes.(4)Latent Semantic Analysis(LSA)based on matrix decomposition:theories,algorithms,experiments and comparisons of LSA based on SVD(Singular Value Decomposition),SDD(Semi-Discrete Decomposition)and NMF(Non-negative Matrix Factorization).An improved ?-SVD method is proposed and emphasized,in which a singular value rescaling vector? is added to the decomposition results and evaluated through a supervised machine learning method.Besides,a NMF-based LSA with sparseness constraints and the corresponding updating rules and algorithms are proposed and examined.(5)LSA based on topic models:theories,model inference,comparisons of PLSA,LDA(Latent Dirichlet Allocation),CTM(Correlated Topic Model),PAM(Pachinko Allocation Model),hLDA(Hierarchal Topic Model).A hierarchical topic model based on term-weighted Gibbs sampling is highlighted.(6)Applications of LSA in text information retrieval,clustering,term similarity measurement and scientific topic discovery.Also with the help of social network analysis techniques,the term correlations and topic correlations are visualized.The conclusions of this research are:(1)Considering the grammar construction(e.g.POS),domain-specific language construction characteristics,context characteristics(mutual information or conditional probabilities),inner statistical characteristics(frequency,independent probability)of the sequential pieces of a potential term which is incorrectly sliced after the rough Chinese text segmentation,a Chinese domain-specific term bound tagging and extraction method mainly based on CRF(Conditional Random Fields)is proposed,which shows good performance when applied to a relatively large-scale patent documents from different fields.(2)Since the vocabulary of a domain-specific corpus may be composed of several hundred thousand words,the computation cost will be amazing because of the huge word dimension of documents.So it is significant to select the most meaningful terms to represent the documents in the corpus and therefore reduce the term dimension and increase the processing efficiency.These terms should have the ability to distinguish the semantic difference of different documents.To solve this problem,a term quadrants made of inter-domain general terms area,inner-domain general terms area,subject-specific terms area and novel terms area is put forward,with four measuring indicators,namely domain relevancy,inter-domain consensus,inner-domain consensus and novelty.By estimating the values of each term related to these four indicators,the task of selecting the most meaningful terms is converted to a problem of classification.This terms selecting solution is proved highly applicable.(3)Many research works related to text retrieval,clustering etc.have taken the occurrence frequency directly as the weight of a term in a document.However,this method may distort the nature of term occurrence in documents and therefore impair the performance.In this research,a combined term weighting scheme consisting of local weight of term(computed through a boosted normalized term frequency),global weight of term(evaluated by information gain ratio),global weight of document(also obtained by information gain ratio)and a normalization is used to measure the term significance.This measurement approach highly increased the accuracy of LSA and other text processing applications.(4)Based on previous studies,this thesis have further proved that as a method to discover the hidden semantic structure of terms and documents by statistical learning,the nature of SVD-based LSA is an encapsulation of higher order term co-occurrences.This can be taken as a theoretical explanation of SVD-based LSA.(5)Given a same k,the storage required by k-SVD may be much more than that required by k-SDD,while the accuracy of k-SVD is much better than k-SDD.When set a larger k value to k-SDD,the accuracy of k-SDD can be comparable to k-SVD,while k-SDD is better than k-SVD in efficiency.(6)An improved ?-SVD method is proposed and emphasized in this research,in which a singular value rescaling vector ?=(?1,?2,…?k)is added to the decomposition results.This rescaling vector ? can be evaluated through a supervised machine learning method.When taking respectively Vn×k?k×k(?)and Um×k?k×k(1-?)as the document semantic representation matrix and term semantic representation matrix,the accuracy of the further clustering task is increased obviously.(7)The underpinning of NMF-based LSA is analyzed from the prospective of noises and probabilistic models,from which some more general updating rules are deduced.Specially,a NMF-based LSA with sparseness constraints is highlighted,and the corresponding updating rules and algorithms are proposed.Experimental research shows that when it comes to the performance such as accuracy,efficiency,this NMF-based LSA with sparseness constraints highly exceeds SVD based LSA.(8)As for LSA based on topic models,this research has given an intensive examination to the basic ideas,model inferences of PLSA,LDA(Latent Dirichlet Allocation),CTM(Correlated Topic Model),PAM(Pachinko Allocation Model)and hLDA(Hierarchal Topic Model)etc.As an earlier proposed generative model,PLSA has led LSA to the field of probability statistics.However it provides no generative solution to the topic mixing proportion of documents,and the number of parameters grows linearly with the size of the corpus;furthermore,PLSA lacks the generalization to new documents.By introducing Dirichlets to the topic mixing proportion distribution of documents and the topic assignment distributions on terms,LDA gives a more expressive model to document generation.The nature of conjugation to multinomial distribution of Dirichlet makes the model inference much easier by variational algorithm or Gibbs sampling method.The primary limitation of LDA is that it does not explicitly model the correlations among the latent topics.An alternative model that not only represents topic correlations,but also learns them is CTM.Rather than drawing topic mixture proportions from a Dirichlet,it does so from a logistic normal distribution,whose parameters include a covariance matrix in which each entry specifies the correlation between a pair of topics.Despite the huge parameter space and the serious inference cost,CTM cannot discover the hierarchical correlations of topics,which can be solved by using PAM or hLDA.Comparatively,PAM samples a topic path from the hierarchy for each word in a document rather than drawing a uniform topic path for all words in a document as in hLDA.By doing that,PAM can model a document which covers multiple super topics while losing much efficiency.By combining GEM distribution and nCRP(nested Chinese Restaurant Process)distribution,hLDA can be inferred by a Bayesian nonparametric approach,which provides more features to hLDA.(9)Most of the topic models proposed take a sampling for each word instance(here each time of word occurrence in a text can be taken as a word instance)in a document.Thus those general words(e.g.the,of,for,a etc.)also influence the generation process,especially these words will contribute to the topics of top levels making the topics less meaningful in PAM or hLDA.So in this research only the selected terms satisfying the domain-specificity criteria are sampled.Even for these selected terms,the number of term instances in a document is usually much higher than the number of different terms in the document.So if we see the frequency of a term in a document as a weight to its sampling,the times of sampling will be heavily collapsed.Given the assumption that all instances of an individual meaningful term in a document are drawn from a single topic,this term-weighted sampling method may be more efficient and applicable.This assumption can be highly supported when the general words are removed.Furthermore,since the direct frequency of a term in a document cannot be strictly considered as its significance in this document,if we use the combined term weighting scheme(proposed above),the sampling will be more effective.Motivated by this desire,a hierarchical topic model combined the characteristics of nonparametric hLDA and PAM,which is inferable by term-weighted Gibbs sampling method,is proposed and emphasized in this research.This model draws a topic path from the hierarchy for each weighted term in a document according to nCRP distribution.(10)Some application experiments of LSA in text information retrieval and clustering show that LSA based on matrix decomposition or topic models can improve the processing performance by revealing the hidden semantic structure or topics,compared to traditional methods.
Keywords/Search Tags:Latent Semantic Analysis, Domain-specific Chinese Textual Information Processing, Chinese Word Segmentation, Term Boundary Labeling&Recognition, Term Extraction, Conditional Random Fields, Domain-specific Term Selecting, Term Document Weighting
PDF Full Text Request
Related items