Font Size: a A A

Research On Information Retrieval Of Heterogeneous Information Networks

Posted on:2015-01-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y F LiuFull Text:PDF
GTID:1268330425986898Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Heterogeneous information networks, composed of multiple types of objects andlinks, are ubiquitous in real life. It turns out that this level of abstraction has greatpower in not only representing and storing the essential information about the realworld, but also providing a useful tool to mine knowledge from it, by exploring thepower of links. Therefore, effective analysis of large-scale heterogeneous informationnetworks has recently attracted substantial interest. Following discussion on thedevelopment history and research of heterogeneous information networks, thisdissertation focus on some key topics in information retrieval by constructingheterogeneous information networks, i.e. semi-supervise learning, document clustering,cluster description and query suggestion. The main results and contributions of thisdissertation are as follows.(1) We consider The semi-supervised classification problem on query-documentheterogeneous information network which incorporate the bipartite graph with thecontent information from both sides. In order to strengthen the network structure, weintroduce class information of sample nodes. We investigate semi-supervised learningalgorithm based on two frameworks, including the graph-based regularizationframework and the iterative framework. In the regularization framework, we develop acost function to consider the direct relationship between two entity sets and the contentinformation from both sides, which leads to a significant improvement over thebaseline methods.(2) The semi-supervised classification problem on heterogeneous informationnetworks with an arbitrary schema consisting of a number of object and link types isconsidered in this paper. By applying graph regularization to preserve consistency overeach relation graph corresponding to each type of links separately, a classifyingfunction is developed which is sufficiently smooth with respect to the intrinsicstructure collectively revealed by known labeled and unlabeled points. an iterativeframework on heterogeneous information network is proposed in which theinformation of labeled data can be spread to the adjacent nodes by iterative methoduntil the steady state. The class memberships of unlabeled data can be inferred fromthose of labeled ones according to their proximities in the network. Some classicsemi-supervised learning algorithm can be used as a special case of the algorithm. (3) Two different topic propagation models: TP-TS and TP-Unify are proposedfor rich-text query-document heterogeneous information network. TP-TS consider thetopic modeling and random walk process are combined as two independent stages,PLSA provides a simplified solution to model topics of documents and queries, thenthe topic information propagate on the query-document bipartite graph. TP-Unifyinvestigate a joint regularization framework to directly incorporate heterogeneousinformation network into topic modeling by regularizing a statistical topic model, theimprovement over TP-TS owes to the direct optimization of the heterogeneousinformation analysis and topic modeling in a unified regularization framework.(4) A new method of extracting the category label was proposed, the basic idea isto convert cluster description into query rank in cluster, thus avoiding extractkeywords from web documents. We presented a rank algorithm which combination ofquery-document click graph, document affinity graph and web link graph, which caneffectively integrate evaluation of user, web pages creator and web page writers.(5) A Term-Query bipartite graph was trained by extracting semantic relationshipsfrom snippet clicked by query. With the combination of Query-URL graph andQuery-Flow graph, a heterogeneous Term-Query-URL information network wasconstructed. Random walk with restart (RWR) was performed on the informationnetwork for query suggestion. The relevance of long tail query suggestion can begreatly improved by taking account of semantic information and log information. Termvector of query was constructed based on probabilistic language model for querysuggestion of new query. The experimental results clearly show that our approachoutperforms three baseline methods.
Keywords/Search Tags:heterogeneous information networks, information retrieval, semi-supervised learning, text clustering, cluster description, querysuggestion
PDF Full Text Request
Related items