Research On Text Structural Information Extraction And Clustering Based On XML

Posted on:2015-12-07

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y Pu

Full Text:PDF

GTID:1228330467485945

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Facing tremendous volume of semi-structured XML and non-structured free text, network information processing is one of the most research hotspots in dealing with these data more efficiently, precisely and uniformly. Compared with normal free text, XML documents have structural and semantic information, bringing data integration and deeply utilization based on XML more precise description and versatile expression, but in the meanwhile traditional NLP and DM methods can not be applied directly. This dissertation focuses on a series of scientific issues in the direction on XML information extraction and clustering as the following:(1) Statistical models of hierarchically structural information extraction for free text are provided. With the problems that many traditional information extraction methods ignore text semantics and its labeling result has usually only one level, lacking of context expression as well, a universal HMM based and a domanial CRF based models to extract hierarchical structure information are studied individually, which label free text using semantics and reproduce it in XML form. Two kinds of models are built separately to implement structure labeling by taking advantage of XML paths, making semantic expression more versatile.(2) Structural similarity calculation using element frequencies and positions of paths is proposed. In terms of the fact that most existing similarity caculation methods cannot be applied to XML document directly because of its structure character, a path matching method using longest common subsequence is proposed, providing more capability of context capturing and improving sensitivity of context identification, also with the frequency and position weights, performance in calculating XML’s structural similarity is improved.(3) Feature dimension reduction and general similarity of XML based on tensor analysis are discussed. Considering the correlation between XML’s structure and content, a tensor based method to describing XML documents and an MMI method to XML’s dimension reduction are presented. Since structure and content are not independent each other, a tensor based algorithm to calculate general similarity from non-linear angle are designed to show their relationships and effects to result precision.(4) XML clustering algorithm using similarity calculated from the former points is provided, with intention to design a fast and simple cluster method considering XML’s nature features. An effective neighbor center clustering method is proposed with characters such as low sensitivity to initial cluster centers, the ability to finding non-spherical cluster, filtering noise etc.The content of the dissertation can be adapted to many research areas like web mining, social network analysis, internet of things mining, etc., where data are often formed of both some kind of structure and content information, hence has a good application future.

Keywords/Search Tags:

Structure information extration, XML structural similarity, XML generalsimilarity, feature extraction, XML clustering

PDF Full Text Request

Related items

1	Objective Image/Video Quality Assessment Based On Structural Similarity
2	Research On Clustering Ensemble Method For Fusing Structural Information
3	The Algorithms Of Image Super Resolution Recovery Based On Structural Similarity
4	Research On Feature Selection Algorithms Based On Structure Information Of Samples And Features
5	The Study Of Feature Extraction And Clustering On Chinese Websites Product Reviews Based On The Improved Pruning Algorithm
6	Clustering system and clustering support vector machine for local protein structure prediction
7	Research On Clustering Algorithms For Large-Scale Social Networks Based On Structural Similarity
8	Research On Document Clustering Based On Semantic Similarity Of Hownet
9	Research On Feature Extraction Algorithm Of Text Classification
10	Research On Feature Extraction And Feature Selection Algorithms Based On Effective Distance