Font Size: a A A

Research On Retrieval And Mining On Probabilistic Data And Hierarchical Text Classification

Posted on:2014-03-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:F GaoFull Text:PDF
GTID:1228330434973079Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Information retrieval and data mining and is a comprehensive interdisciplinary, which deals with storage, indexing, searching, query and analysis. This dissertation does some researches on mining and retrieval over probabilistic data and hierachical text classification and the results are as follow.Firstly, we study the problem oftextual retrieval on probabilistic spatial data. We continuously track and analyze the geographical feature of Twitter data and find that requests for spatial and text retrieval are on the demand. By analyzing GPS-tagged Twitter data which we have been continuouly collectingfrom November2011to May2012, we find that there is much uncertainty associated with geographical locations, which lead to the very poor results of spatial and textual retieval on such Twitter data. This study aims to obtain results that have high spatial confidence and strong textual similarity. Firstly, we define top-(c,k) retrieval mode based on possible world semantic model in order to unifiy two types of semantics:tertual similarity and spatial confidence. Secondly, through analyzing existing indexes for spatial textual data, we index probabilistic spatial textual data using IRTree index and present an incrementally scoring algorithm (ISA) for calculating textual simlarity and spatial confidence, which visits each spatial textual object in decreasing order of their textual similarity. Thirdly, we design a parameterized probabilistic ranking algorithm (PRankc) which cooperates with ISA algorithm and can finish calculatng top-c confidence for all objects in the linear time. In addition, we design an optimization strategy for PRankc algorithm to avoid visiting all objects and also design a statistical model to estimate a reasonable value of parameter c of algorithm PRanlc. Finally, we conduct experiments on the real Twitter data and the results show that our top-(c,k) retrieval mode can produce excellent results compared with other retrieval modes and PRankc with optimization strategy can finish retrieval task in a reasonable time.Secondly, we study the problem of mining frequent itemset from probabilistic data. By analyzing the semantic of frequent itemset on probabilistic data, we recognize the semantic corruption of existing expectationbased frequent itemset definition and thus define probabilistic frequent itemset based on the possible world semantic model which holds the apriori property. In addition, we design a polynomial-time determination algorithm for candidate itemset. Based on classical Apriori algorithm, we develop a P-Apriori algorithm for mining probabilistic frequent itemset which incrementally report probabilistic frequent itemsets in decreasing order of their confidences. We conduct extensive experiments to examine the sensitivity of P-Apriori to the distribution of items’existing probability and to test the performance of P-Apriori under different configuration of mining parameters. The results show that P-Apriori algorithm can obtain probabilistic frequent itemsets in a reasonal time and space and the running time of the P-Apriori algorithm is linear with size of dataset.Thirdly, we study the problem of hierarchical textual classification and focus on the two key problems:training data skewness and error propagation. We define path-based semantic representation for classes in the hierarchy to capture their exact semantic. Then we design training sample enhancement strategy for the classes with less training samples. In addition, we aim to reduce and correct classification errors using prior information in the training data. We conduct experiments on the real open directory project (ODP) data to validate our methods. The results show that Bayes classifier and SVM with proposed strategies would produce better classification quality under the Mi-F1measurement.
Keywords/Search Tags:Probabilistic Data, Spatial Data, Text Retrieval, Frequent Itemset, Text Classification
PDF Full Text Request
Related items