Research On Information Retrieval Of Heterogeneous Information Networks

Posted on:2015-01-15

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y F Liu

Full Text:PDF

GTID:1268330425986898

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Heterogeneous information networks, composed of multiple types of objects andlinks, are ubiquitous in real life. It turns out that this level of abstraction has greatpower in not only representing and storing the essential information about the realworld, but also providing a useful tool to mine knowledge from it, by exploring thepower of links. Therefore, effective analysis of large-scale heterogeneous informationnetworks has recently attracted substantial interest. Following discussion on thedevelopment history and research of heterogeneous information networks, thisdissertation focus on some key topics in information retrieval by constructingheterogeneous information networks, i.e. semi-supervise learning, document clustering,cluster description and query suggestion. The main results and contributions of thisdissertation are as follows.(1) We consider The semi-supervised classification problem on query-documentheterogeneous information network which incorporate the bipartite graph with thecontent information from both sides. In order to strengthen the network structure, weintroduce class information of sample nodes. We investigate semi-supervised learningalgorithm based on two frameworks, including the graph-based regularizationframework and the iterative framework. In the regularization framework, we develop acost function to consider the direct relationship between two entity sets and the contentinformation from both sides, which leads to a significant improvement over thebaseline methods.(2) The semi-supervised classification problem on heterogeneous informationnetworks with an arbitrary schema consisting of a number of object and link types isconsidered in this paper. By applying graph regularization to preserve consistency overeach relation graph corresponding to each type of links separately, a classifyingfunction is developed which is sufficiently smooth with respect to the intrinsicstructure collectively revealed by known labeled and unlabeled points. an iterativeframework on heterogeneous information network is proposed in which theinformation of labeled data can be spread to the adjacent nodes by iterative methoduntil the steady state. The class memberships of unlabeled data can be inferred fromthose of labeled ones according to their proximities in the network. Some classicsemi-supervised learning algorithm can be used as a special case of the algorithm. (3) Two different topic propagation models: TP-TS and TP-Unify are proposedfor rich-text query-document heterogeneous information network. TP-TS consider thetopic modeling and random walk process are combined as two independent stages,PLSA provides a simplified solution to model topics of documents and queries, thenthe topic information propagate on the query-document bipartite graph. TP-Unifyinvestigate a joint regularization framework to directly incorporate heterogeneousinformation network into topic modeling by regularizing a statistical topic model, theimprovement over TP-TS owes to the direct optimization of the heterogeneousinformation analysis and topic modeling in a unified regularization framework.(4) A new method of extracting the category label was proposed, the basic idea isto convert cluster description into query rank in cluster, thus avoiding extractkeywords from web documents. We presented a rank algorithm which combination ofquery-document click graph, document affinity graph and web link graph, which caneffectively integrate evaluation of user, web pages creator and web page writers.(5) A Term-Query bipartite graph was trained by extracting semantic relationshipsfrom snippet clicked by query. With the combination of Query-URL graph andQuery-Flow graph, a heterogeneous Term-Query-URL information network wasconstructed. Random walk with restart (RWR) was performed on the informationnetwork for query suggestion. The relevance of long tail query suggestion can begreatly improved by taking account of semantic information and log information. Termvector of query was constructed based on probabilistic language model for querysuggestion of new query. The experimental results clearly show that our approachoutperforms three baseline methods.

Keywords/Search Tags:

heterogeneous information networks, information retrieval, semi-supervised learning, text clustering, cluster description, querysuggestion

PDF Full Text Request

Related items

1	Research On Semi-supervised Learning Methods On Heterogeneous Information Networks
2	Semi-Supervised Clustering Based On Attributed Heterogeneous Information Networks
3	Research Of Machine Learning Models And Algorithms For Information Filtering And Information Retrieval
4	Research On The Methods Of Web Text Mining For Information Retrieval
5	Semi-supervised Learning On Text Data
6	Web Information Retrieval Based On Semi-supervised Manifold Learning
7	The Research Of Collaborative Recommendation Algorithm Based On Semi-Supervised Learning
8	Research Of Image Recognition Techniques Based On The Semi-supervised Clustering And Generalized Distance Function Learning
9	Research On Semi-supervised Clustering And Classification Algorithm
10	Research On Cluster Ensemble Approaches With Semi-supervised Information And Large Scale Dataset