Research On Information Extraction Based On Prediction And Classification Model Of Lmbalanced Data Sets

Posted on:2013-10-31

Degree:Master

Type:Thesis

Country:China

Candidate:L Yue

Full Text:PDF

GTID:2248330395972418

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Concurrent with progress in food supervision, an overwhelming number of textualcomplaints are accumulating on web pages. Clustering these documents effectively has lead toa vast amount of research. Due to these complaint documents contain many domain-specificterminologies, they are difficult to manage manually. However, development of SemanticWeb technologies enables users to structuralize their domain knowledge into ontology. So,more and more user-specified ontologies have been constructed by domain experts torepresent specific domain knowledge. In recent work, since general ontology is a formal,explicit specification of a shared conceptualization for many domains of interest, it has beenapplied to various clustering methods for improving the performance. As a comprehensivesemantic lexical ontology for Chinese language, Hownet has been widely used for documentclustering.We propose a novel and effective classification model for imbalanced data sets (IDS)without modeling the routine data and imbalanced data or any prior knowledge. The core ideaof the methodology is composed of the following parts: First, we generalize some conceptswith a controlled vocabulary in our food ontology based on Hownet to extract the documentfeatures. Especially, the food ontology is extended during the procedure of feature extraction.Then, we present a document clustering method and a series of algorithms of IDS inductionand IDS reclassification based on latent semantic analysis (LSA) and term clusteringtechniques. The experimental results show the effectiveness of the series of algorithms of IDSprocessing. Moreover, the methodology in this paper improves the overall classificationperformance.Based on the last step, we develop an effective methodology to cluster food complaintdocuments utilizing fuzzy sets theory and the guidance of ontology. First, we extend thecontrolled vocabulary in a user-specified ontology with the use of Hownet as backgroundknowledge to explore better ways of representing documents semantically for clustering.Then, the similarity scores between every two documents in k-dimensional semantic space arecalculated to obtain the fuzzy compatibility relation, and a fuzzy equivalence relation basedon the fuzzy compatibility relation can be constructed. Finally, a cluster validation index isused as a guidance to determine the best number of clusters and suitable λ-cut value. Byutilizing fuzzy analysis, the documents’ fuzzy assessments with various degree ofmembership can be taken into account in the aggregation process to get more convincing andnon-overlapping clusters. Clustering an overwhelming number of multi-topic documents into the most suitable clusters is the requirement to solve the problem of large-scale overlapamong clusters.

Keywords/Search Tags:

ontology, singular value decomposition (SVD), latent semantic analysis (LSA), fuzzy compatible relation, fuzzy equivalence relation, imbalanced data sets(IDS)

PDF Full Text Request

Related items

1	Development Of Practical Software For Bad Data Identification In Power System
2	New Approaches For Fuzzy Classification And Their Applications
3	Research On Deep Structure Learning Algorithms Based On Dynamic Fuzzy Relation
4	The Vague Sets Theory And Its Applictions In Clustering Analysis
5	Traditional Chinese Medical Ontology Based Semantic Relation Discovering And Verification
6	Research On Formal Concept Analysis Method Based On Fuzzy Relation
7	The Research Of Fuzzy Ontology-based Grid Database Integration
8	Research And Application On Dynamic Fuzzy Dependent Relation Of Data
9	The Application And Research Of Latent Semantic Analysis In The Field Of Internet Data Mining
10	GrC-Based Research On Rapid Optimization Algorithms For Sequential Logic Circuits