Font Size: a A A

Research On Data Fusion Of Web Entity Activities

Posted on:2013-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:C GaoFull Text:PDF
GTID:2248330374981409Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Now, the internet is developing rapidly. The internet has gradually penetrated into people’s lives and become an important channel for people to obtain information and spread news. As the explosive growth and rapid spread of information on the web, the web is becoming an important information source, and the information contained on the web has great value and significance of study. Analyzing, mining and handling the huge amount of data on the web can access to a wealth of valuable information, which can be integrated to Market Intelligence, Public Opinion Analysis, E-commerce and so on, and providing the people with the further information services.The web can be divided into Surface Web and Deep Web according to the depth of information it contains. Generally, Surface Web is composed of the web which can be reached by the traditional search engines through url links, while Deep Web refer to the web who contains a online database, its data is stored in the real database, and the content of the web page is only produced and returned to the visitor by the web server when the visitor gives a query through the query interface.The study object of this thesis is the web entity activities. We define web entity activity as a certain activity made by a certain web entity at a certain time and a certain location. Several activities of a web entity can consist of its trace. Web entity activity has very important analytical value, for example, in Market Intelligence, the trace of an enterprise has a great reference value for job seekers.Different from traditional integration system whose study object is from the structured part of the web page, the Web entity activities integration system makes studies on the object which is from the unstructured text fragment. The system gains web entity activities information from the natural sentence by Web entity activities extraction, and translate the information into a structured pattern.In this thesis, we focus on the key technology of the fusion of the web entity activities. As the last step of the web entity activities integration, web entity activities fusion integration the different activities records representing the same web entity activity, and receive a complete and accurate web entity activity record.Web entity activities fusion has two main parts, one is web entity activities duplication detection and the other one is web entity activities truth discovery. The former put the different activity records representing the same web entity activity together and works for the truth discovery, which can find the truth value from the different records and create a complete and accurate record by solving the data conflict and replenishing the missing data. This thesis makes studies on these two parts, and raises my strategies on them separately. The main work and achievement are shown as follows:1. Solve the web entity activity duplication detection based on the K-means cluster and SVM classification techniques. We transform the duplication detection problem to vector classification question by getting the comparison vector which is consisted of the similarity on every dimension of every two records. Then, we get some sample sets by clustering which is used for training the SVM classification. After large number of observations, this thesis combines the traditional calculation methods with some particular characteristic of the web entity activities, and utilizes the architectural feature of the sentence to calculate the comparison and improves the clustering effect by using Euclidean distance formula with weights. Last, we classify the comparison vectors by using a iterative classification method.2. We raise an approach to solve the truth discovery problem of the web entity activities based on the Markov Logic Networks. The approach utilizes the features that the Markov Logic Networks can dispose the uncertain and imperfect even the contradictory knowledge. We integrate the semantics relationship among dimensions with the traditional data fusion characters and formulate the inference rule to find the truth values.
Keywords/Search Tags:Web entity activities, Web entity activities integration, Webentity activities duplication detection, Web entity activities truthdiscovery
PDF Full Text Request
Related items