Research On Retrieval And Mining On Probabilistic Data And Hierarchical Text Classification

Posted on:2014-03-01

Degree:Doctor

Type:Dissertation

Country:China

Candidate:F Gao

Full Text:PDF

GTID:1228330434973079

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Information retrieval and data mining and is a comprehensive interdisciplinary, which deals with storage, indexing, searching, query and analysis. This dissertation does some researches on mining and retrieval over probabilistic data and hierachical text classification and the results are as follow.Firstly, we study the problem oftextual retrieval on probabilistic spatial data. We continuously track and analyze the geographical feature of Twitter data and find that requests for spatial and text retrieval are on the demand. By analyzing GPS-tagged Twitter data which we have been continuouly collectingfrom November2011to May2012, we find that there is much uncertainty associated with geographical locations, which lead to the very poor results of spatial and textual retieval on such Twitter data. This study aims to obtain results that have high spatial confidence and strong textual similarity. Firstly, we define top-(c,k) retrieval mode based on possible world semantic model in order to unifiy two types of semantics:tertual similarity and spatial confidence. Secondly, through analyzing existing indexes for spatial textual data, we index probabilistic spatial textual data using IRTree index and present an incrementally scoring algorithm (ISA) for calculating textual simlarity and spatial confidence, which visits each spatial textual object in decreasing order of their textual similarity. Thirdly, we design a parameterized probabilistic ranking algorithm (PRankc) which cooperates with ISA algorithm and can finish calculatng top-c confidence for all objects in the linear time. In addition, we design an optimization strategy for PRankc algorithm to avoid visiting all objects and also design a statistical model to estimate a reasonable value of parameter c of algorithm PRanlc. Finally, we conduct experiments on the real Twitter data and the results show that our top-(c,k) retrieval mode can produce excellent results compared with other retrieval modes and PRankc with optimization strategy can finish retrieval task in a reasonable time.Secondly, we study the problem of mining frequent itemset from probabilistic data. By analyzing the semantic of frequent itemset on probabilistic data, we recognize the semantic corruption of existing expectationbased frequent itemset definition and thus define probabilistic frequent itemset based on the possible world semantic model which holds the apriori property. In addition, we design a polynomial-time determination algorithm for candidate itemset. Based on classical Apriori algorithm, we develop a P-Apriori algorithm for mining probabilistic frequent itemset which incrementally report probabilistic frequent itemsets in decreasing order of their confidences. We conduct extensive experiments to examine the sensitivity of P-Apriori to the distribution of items’existing probability and to test the performance of P-Apriori under different configuration of mining parameters. The results show that P-Apriori algorithm can obtain probabilistic frequent itemsets in a reasonal time and space and the running time of the P-Apriori algorithm is linear with size of dataset.Thirdly, we study the problem of hierarchical textual classification and focus on the two key problems:training data skewness and error propagation. We define path-based semantic representation for classes in the hierarchy to capture their exact semantic. Then we design training sample enhancement strategy for the classes with less training samples. In addition, we aim to reduce and correct classification errors using prior information in the training data. We conduct experiments on the real open directory project (ODP) data to validate our methods. The results show that Bayes classifier and SVM with proposed strategies would produce better classification quality under the Mi-F1measurement.

Keywords/Search Tags:

Probabilistic Data, Spatial Data, Text Retrieval, Frequent Itemset, Text Classification

PDF Full Text Request

Related items

1	Text Classification Based On The Extending Of Core Words
2	Text Classification Method Based On The Longest Closed Frequent Sequential Patterns
3	Text Classification Using Sentential Frequent Itemsets
4	Research On Frequent Itemsets Mining Algorithm In Data Stream
5	The Research And Application Of Association Rules Mining Algorithms Based On Directed Itemset Graph
6	Study On The Key Methods Over Uncertain Database
7	The Research And Application Of Unstructured Data Processing Technology
8	An Application Research Of Probabilistic Topic Model On Text Classification
9	Research And Application Of PVI Algorithm On Spatial Data Mining
10	Text Emotional Classification Based On Text Mining