Font Size: a A A

Research On Information Extraction Based On Prediction And Classification Model Of Lmbalanced Data Sets

Posted on:2013-10-31Degree:MasterType:Thesis
Country:ChinaCandidate:L YueFull Text:PDF
GTID:2248330395972418Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Concurrent with progress in food supervision, an overwhelming number of textualcomplaints are accumulating on web pages. Clustering these documents effectively has lead toa vast amount of research. Due to these complaint documents contain many domain-specificterminologies, they are difficult to manage manually. However, development of SemanticWeb technologies enables users to structuralize their domain knowledge into ontology. So,more and more user-specified ontologies have been constructed by domain experts torepresent specific domain knowledge. In recent work, since general ontology is a formal,explicit specification of a shared conceptualization for many domains of interest, it has beenapplied to various clustering methods for improving the performance. As a comprehensivesemantic lexical ontology for Chinese language, Hownet has been widely used for documentclustering.We propose a novel and effective classification model for imbalanced data sets (IDS)without modeling the routine data and imbalanced data or any prior knowledge. The core ideaof the methodology is composed of the following parts: First, we generalize some conceptswith a controlled vocabulary in our food ontology based on Hownet to extract the documentfeatures. Especially, the food ontology is extended during the procedure of feature extraction.Then, we present a document clustering method and a series of algorithms of IDS inductionand IDS reclassification based on latent semantic analysis (LSA) and term clusteringtechniques. The experimental results show the effectiveness of the series of algorithms of IDSprocessing. Moreover, the methodology in this paper improves the overall classificationperformance.Based on the last step, we develop an effective methodology to cluster food complaintdocuments utilizing fuzzy sets theory and the guidance of ontology. First, we extend thecontrolled vocabulary in a user-specified ontology with the use of Hownet as backgroundknowledge to explore better ways of representing documents semantically for clustering.Then, the similarity scores between every two documents in k-dimensional semantic space arecalculated to obtain the fuzzy compatibility relation, and a fuzzy equivalence relation basedon the fuzzy compatibility relation can be constructed. Finally, a cluster validation index isused as a guidance to determine the best number of clusters and suitable λ-cut value. Byutilizing fuzzy analysis, the documents’ fuzzy assessments with various degree ofmembership can be taken into account in the aggregation process to get more convincing andnon-overlapping clusters. Clustering an overwhelming number of multi-topic documents into the most suitable clusters is the requirement to solve the problem of large-scale overlapamong clusters.
Keywords/Search Tags:ontology, singular value decomposition (SVD), latent semantic analysis (LSA), fuzzy compatible relation, fuzzy equivalence relation, imbalanced data sets(IDS)
PDF Full Text Request
Related items