Font Size: a A A

Research On Text Structural Information Extraction And Clustering Based On XML

Posted on:2015-12-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y PuFull Text:PDF
GTID:1228330467485945Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Facing tremendous volume of semi-structured XML and non-structured free text, network information processing is one of the most research hotspots in dealing with these data more efficiently, precisely and uniformly. Compared with normal free text, XML documents have structural and semantic information, bringing data integration and deeply utilization based on XML more precise description and versatile expression, but in the meanwhile traditional NLP and DM methods can not be applied directly. This dissertation focuses on a series of scientific issues in the direction on XML information extraction and clustering as the following:(1) Statistical models of hierarchically structural information extraction for free text are provided. With the problems that many traditional information extraction methods ignore text semantics and its labeling result has usually only one level, lacking of context expression as well, a universal HMM based and a domanial CRF based models to extract hierarchical structure information are studied individually, which label free text using semantics and reproduce it in XML form. Two kinds of models are built separately to implement structure labeling by taking advantage of XML paths, making semantic expression more versatile.(2) Structural similarity calculation using element frequencies and positions of paths is proposed. In terms of the fact that most existing similarity caculation methods cannot be applied to XML document directly because of its structure character, a path matching method using longest common subsequence is proposed, providing more capability of context capturing and improving sensitivity of context identification, also with the frequency and position weights, performance in calculating XML’s structural similarity is improved.(3) Feature dimension reduction and general similarity of XML based on tensor analysis are discussed. Considering the correlation between XML’s structure and content, a tensor based method to describing XML documents and an MMI method to XML’s dimension reduction are presented. Since structure and content are not independent each other, a tensor based algorithm to calculate general similarity from non-linear angle are designed to show their relationships and effects to result precision.(4) XML clustering algorithm using similarity calculated from the former points is provided, with intention to design a fast and simple cluster method considering XML’s nature features. An effective neighbor center clustering method is proposed with characters such as low sensitivity to initial cluster centers, the ability to finding non-spherical cluster, filtering noise etc.The content of the dissertation can be adapted to many research areas like web mining, social network analysis, internet of things mining, etc., where data are often formed of both some kind of structure and content information, hence has a good application future.
Keywords/Search Tags:Structure information extration, XML structural similarity, XML generalsimilarity, feature extraction, XML clustering
PDF Full Text Request
Related items