Font Size: a A A

Research On Embedded Subtitles Based Near-duplicate Video Web-page Removal And The Implementation

Posted on:2016-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y YuanFull Text:PDF
GTID:2308330503450637Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, more and more web pages appear on the Internet, so here comes the consequent problem, large number of web pages are similar. In the past decade, as the quantity of online video has increased exponentially,more and more users participate in the video-related activities. The time which users spend on video collection, edition, uploading, search and review has increased to an unprecedented level. The large-scale publishing and sharing of video further increase the number of nearly duplicate content which is already very large. So the Near-duplicate Video Retrieval(Near-duplicate Video Retrieval, NDVR) becomes the key to many new tasks. In present, under the situation that large near-duplicate data exist in network, many methods to remove near-duplicate web page are put forward, but there is litter method is used to remove near-duplicate video website. Therefore, this dissertation put forward a new way, based on embedded subtitles which can highly match with contents of video, to remove near-duplicate video website.The main research work of this dissertation can be summarized as follows:(1) The capture of web video. In this dissertation, the duplicate web video removal work is based on embedded subtitles, so obviously the extracting of embedded subtitle is the most basic word and also very critical step. Therefore, we do the research work about auto extracting embedded subtitle ahead of all, so to obtain the subtitle with text type.(2) We calculated and got the value of embedded subtitles’ similarity. As the comparison object was embedded subtitles whose order of text content must be consistent and LCS(Longest Common Subsequences Longest Common subsequence)algorithm had the strict ordering characteristic, we used and certainly realized LCS to calculate and get the value.(3) We set the standard for near-duplicate video for the removal. In theory, the standard should be 100% because that two embedded subtitles of the two videos had same contents should be complete unification. However, the subtitles in web video were documented through OCR processing which will produce some litter differences because of the different resolution. Moreover, some videos were some part of other videos, and when the part was big enough, we need identify they as near-duplicate. Sowe allowed some differences exist when we were judging two videos whether were near-duplicate videos or not. We used a lot of statistical data and scientific knowledge of mathematical statistics, so that we could get the most accurate critical value of near-duplicate.(4) According to the proposed idea which was based on the embedded subtitles to remove near-duplicate video, a near duplicate video website removal system was designed and implemented. In the last part of overall outline design, we illustrated the overall structure of the system; In the part of detail design, we explained the functions,the sub-modules of system. Finally, the function of the important module, the specific processing and the implementation details.(5) In order to verify the effectiveness of this proposed method, we searched and collected the top-10 hot films and then used two method to remove near duplicate video websites again, then observe the result and got the conclusion which one is more effective. One was now the most common method, which was based on the generation title, generation tag and video’s description. Another one was our method which was based on embedded subtitles.The experimental results show that, the method of near-duplicate video website removal which is based on embedded subtitles is more effective and performance better than the existing method.
Keywords/Search Tags:near duplicate video website, embedded subtitles, remove duplicate, LCS
PDF Full Text Request
Related items