Information Extraction And Visualization Analysis For Personal Documents

Posted on:2018-02-25

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Lv

Full Text:PDF

GTID:2348330512490262

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the popularity of network,the information on the Internet is exploding.In addition to the expansion of scale,the type of information tends to become increasingly diverse.Among a wide variety of data types,there is a specific data type containing public personal information,e.g.online resumes and personal homepages.Here,we call them "personal documents".Such data provides the possibility to infer social relationships between two people.For example,if two people have studied in the same university over an overlapping period of time,they are likely to be classmates.The social network obtained through this analysis has great value and can be applied to multiple research problems,such as the most influential analysis and community discovery in social network analysis.This paper introduces the system performing information extraction and visualization analysis on personal documents as well as the main algorithms.It extracts necessary information from personal documents and build an entity-linked network,then calculates PageRank to analyze the importance of each person node.The process of establishing the above network has two steps.First,we create a relational network composed of multiple types of entities,which can be regarded as a heterogeneousinformation network for specific domain.The first step involves entity recognition,event extraction and many other related works.According to the characteristics of the data,we cluster the parse trees combined with generated rules to extract events.The second step is building the relationships between two people nodes via analyzing meta-path.Before predicting the relationships between people nodes,we need to supply the links between two entity nodes of other types so as to obtain complete information of meta-path.We use various methods to determine the links between different types of nodes because of the heterogeneity of the network.The visualization analysis involves ranking the importance or the influence of people by PageRank.In the context of visualization,we hold that the relative ordering of nodes plays a more important role than exact PageRank values.So the calculation process should stop as early as the result has met the need of visualization.Previous researches on improving PageRank fall into two categories.Some try to speed up the convergence of the traditional Power method from the perspectives of mathematics and graph theory.The objective of these studies is to perform accurate calculation as much as possible,which is contrast to the early-stop idea.Others focus on the Monte Carlo method and try to approximate the ranking result of PageRank.This method can give fast estimate while accuracy no guaranteed,especially the relative ranking order of top nodes.Therefore,the second part of this paper proposes the Early-stop algorithm which can be regarded as two steps,grouping and parallel updating.Grouping determines the ranking ranges of nodes while parallel updating adjust the ranks of neighboring nodes.Experimental results show that it improves the accuracy of relative ordering of top nodes dramatically.The main contributions of this paper can be summarized as follows.We introduce the system to extract information and perform data analysis on personal documents.We note that visualized results decrease the precision demand and provide the opportunity to improve the efficiency of PageRank computing,based on which we propose the Early-stop algorithm to quickly estimate the relative ordering of nodes.The algorithm proposed overwhelms a state-of-the-art Monte Carlo method in terms of accuracy.

Keywords/Search Tags:

information network, information extraction, PageRank

PDF Full Text Request

Related items

1	Research And Application Of PageRank Algorithm To Community Detection
2	Knowledge Acquisition From Text
3	Study On Web Information Credibility Evaluation Method Based On Improved PageRank
4	Study Of Information Propagation Model For Large-Scale Social Networks And Its Applications
5	Design And Implementation Of Web Information Extraction Rules
6	Research On GMRES-type Algorithms For PageRank Problem
7	Research On Online Social Network Information Propagation Model Based On PageRank
8	The Design And Implementation Of Web Information Extraction System
9	Neural Network-based Open Information Extraction And Its Application
10	The Information Leakage Detection Based On Text Information Extraction