Font Size: a A A

Research On Key Technologies Of Audio And Video Data Acquisition And Homology Analysis

Posted on:2019-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:Q Y FanFull Text:PDF
GTID:2348330563954324Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The Internet in our country is in a process of vigorous development.People have made tremendous changes in the way they acquire content.More and more Internet users like to acquire information through audio and video.At the same time,major online video sites have proposed the construction of a new ecology of pan-entertainment content,which puts new demands on audio and video data mining.This paper studies two key issues of audio and video data mining: the first problem is the acquisition of audio and video text data,and the data acquisition is the cornerstone of web mining;the second problem is homology analysis of audio and video data,homology analysis is to discover potential user relationships in online video sites by analyzing the similarities between entities in the real world.The data acquisition technology based on distributed web crawlers is currently the mainstream of research.This paper has conducted in-depth research on the existing distributed web crawler systems.Aiming at the deficiency of the existing open source crawler framework for distributed support,this paper designs a distributed network crawler system and presents a distributed task scheduling algorithm.In the face of massive data,in order to improve the system’s crawling efficiency,this article focuses on URL deduplication and web content deduplication.For URL deduplication,the advantages and disadvantages of the traditional Bloom Filter are first analyzed,and then the improvements are made to reduce the problem of high misclassification rate of the traditional BloomFilter.For the deduplication of webpage content,this paper proposes that the content of the webpage should be segmented before passing through.The Sim Hash algorithm to determine if the current page already exists.The experiment proves that under the massive data,the URL deduplication proposed in this paper has a lower misjudgment rate,and the proposed method for deduplicating webpage content has an obvious speed advantage over other algorithms.This paper makes an in-depth study of existing social network user relationship mining and proposes an SimRank-based audio and video data homology analysis method.This method can calculate the similarity between audio and video sharers,through similarity we Can measure the homogeneity among these audio and video sharers.In front of massive data,the original SimRank calculation time is very costly.This paper carries out a MapReduce-based distributed implementation and analysis of the original SimRank.It is found that in a distributed environment,SimRank has problems such as slow calculation speed and large communication volume.Finally,the paper improves the original distributed SimRank algorithm.In order to verify the improvement results,the experiments were conducted on three real network datasets and one dataset obtained by a web crawler,which proved that the improved distributed Sim Rank is more suitable for mass data calculation.Finally,this thesis designs and implements a set of audio and video data acquisition and homology analysis system.The data is verified by the real network video website.The resulting analysis shows that the distributed network crawler system designed in this paper can obtain sounds completely and quickly.The video data,the homology analysis method proposed in this paper can excavate the user-associated information that is intuitionistic.The entire system can provide comprehensive and accurate data support for the construction of a new ecology of pan-entertainment content.
Keywords/Search Tags:data acquisition, video website, relation mining, homology analysis, distrubuted computing
PDF Full Text Request
Related items