Font Size: a A A

Design And Implementation Of The Tudou Video Search Engine System

Posted on:2011-11-03Degree:MasterType:Thesis
Country:ChinaCandidate:B XieFull Text:PDF
GTID:2178360308451224Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Tudou.com, Premier Chinese Online Video Sharing Website, is one of the earliest websites created in China. After a few years'development, its daily Page View reaches 100,000,000, and more than 40,000 new videos published daily. So far, Tudou's video inventory is around 20 million. The daily Video View is accessed over 80 million times which makes the video search the most important function to the user who wants to find his/her favorite videos on the website.This thesis summarizes the online video search technology, introduces and elaborates the concept of Vertical Search Engine. It is mainly focused on the development framework of Lucene which is an open source search engine. It also briefly describes the fundamental process and related API of using Lucene to implement the index and search function. The design and implementation of"Chinese Word Segmentation"is the most important part of the search engine technology. This study analyzes"Forward-Backward Maximum Matching Word Segmentation"and"Word Segmentation based on Statistics".The key points of this study are the design and implementation of the architecture of the video search engine. The system is divided into 3 modules: video search portal, video search query and video index. The architecture is five layers which include database layer, index layer, querying layer, portal layer and web cache layer. The web cache layer uses the open-source server Squid to implement the function of request processing and cache and also uses the sibling running mode to implement the load balance. This study analyzes and designs the kernel index layer, querying layer and portal layer. It does specific design for the communication and data exchange format between those layers. Regarding to the implementation of video search engine, this study uses Java to develop the main function modules. It use Tomcat as the web container, memcached as the memory cache server and MySql as the database software. The indexing and querying functions rely on"Chinese Word Segmentation"technology and use the forward maximum matching word segmentation algorithm. This study also does necessary conversion for dates, Chinese numbers, traditional Chinese characters and special characters. Based on forward maximum matching algorithm, this study investigates this technology by using the streamlined custom lexicon. This study also designs and uses the multi-index comprehensive weight sorting algorithm as the video sorting algorithm whose index including playback counts, upload time, comment counts, etc.. By applying all of these strategies together, the system architecture meets the large volume, high number of concurrent user's requirement, and provides a powerful video search function for Tudou.After several design changes and implementations, the overall video search engine system has become stable. The use of the"Chinese Word Segmentation"toolkit and the video sorting algorithm has resulted in a high level of quality for the whole search engine. Now, Tudou has 15 million search counts and over 10 million video playbacks daily. The search engine is not only a convenient tool for the user, but also brings the website a considerable hit and video playback amount.
Keywords/Search Tags:Video Sharing, Video Search, Distributed Architecture, Lucene, Chinese Word Segmentation
PDF Full Text Request
Related items