Font Size: a A A

Information Extraction And Visualization Analysis For Personal Documents

Posted on:2018-02-25Degree:MasterType:Thesis
Country:ChinaCandidate:C Y LvFull Text:PDF
GTID:2348330512490262Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the popularity of network,the information on the Internet is exploding.In addition to the expansion of scale,the type of information tends to become increasingly diverse.Among a wide variety of data types,there is a specific data type containing public personal information,e.g.online resumes and personal homepages.Here,we call them "personal documents".Such data provides the possibility to infer social relationships between two people.For example,if two people have studied in the same university over an overlapping period of time,they are likely to be classmates.The social network obtained through this analysis has great value and can be applied to multiple research problems,such as the most influential analysis and community discovery in social network analysis.This paper introduces the system performing information extraction and visualization analysis on personal documents as well as the main algorithms.It extracts necessary information from personal documents and build an entity-linked network,then calculates PageRank to analyze the importance of each person node.The process of establishing the above network has two steps.First,we create a relational network composed of multiple types of entities,which can be regarded as a heterogeneousinformation network for specific domain.The first step involves entity recognition,event extraction and many other related works.According to the characteristics of the data,we cluster the parse trees combined with generated rules to extract events.The second step is building the relationships between two people nodes via analyzing meta-path.Before predicting the relationships between people nodes,we need to supply the links between two entity nodes of other types so as to obtain complete information of meta-path.We use various methods to determine the links between different types of nodes because of the heterogeneity of the network.The visualization analysis involves ranking the importance or the influence of people by PageRank.In the context of visualization,we hold that the relative ordering of nodes plays a more important role than exact PageRank values.So the calculation process should stop as early as the result has met the need of visualization.Previous researches on improving PageRank fall into two categories.Some try to speed up the convergence of the traditional Power method from the perspectives of mathematics and graph theory.The objective of these studies is to perform accurate calculation as much as possible,which is contrast to the early-stop idea.Others focus on the Monte Carlo method and try to approximate the ranking result of PageRank.This method can give fast estimate while accuracy no guaranteed,especially the relative ranking order of top nodes.Therefore,the second part of this paper proposes the Early-stop algorithm which can be regarded as two steps,grouping and parallel updating.Grouping determines the ranking ranges of nodes while parallel updating adjust the ranks of neighboring nodes.Experimental results show that it improves the accuracy of relative ordering of top nodes dramatically.The main contributions of this paper can be summarized as follows.We introduce the system to extract information and perform data analysis on personal documents.We note that visualized results decrease the precision demand and provide the opportunity to improve the efficiency of PageRank computing,based on which we propose the Early-stop algorithm to quickly estimate the relative ordering of nodes.The algorithm proposed overwhelms a state-of-the-art Monte Carlo method in terms of accuracy.
Keywords/Search Tags:information network, information extraction, PageRank
PDF Full Text Request
Related items