Font Size: a A A

On The Research And Development Of A Video Search Engine For Chinese Web

Posted on:2013-02-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:D GuoFull Text:PDF
GTID:1118330371477953Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Internet has undoubtedly become the largest inventory of information and knowledge that human has ever invented. More notably, since the age of Web2.0, the contributors of the information and knowledge on internet have included more and more average users. By combining the classic theory of information retrieval with the technology of internet, search engine turned this huge inventory into a truly meaningful mine of gold by making it equally findable, accessible and applicable to all users. With the rapid surge of multimedia information on the Web, especially video contents, the need of a search engine for these contents becomes even overwhelming. Thus, following the classic research on search engine technologies, the industry and academics are more and more attracted by the task of research and development of video search engine with high efficiency, scalability and freshness. This dissertation is the summary of Baidu's work in researching and development of a video search engine of Chinese internet, which is the first open public video search engine in China.The major work of this dissertation took off in2006. Based upon the largest web database of Baidu.com, this is the first time of a massive study of video content distributions on Chinese internet and user behavior logs on search engines. By analyzing the requirements of web users for video search, researching the principles of classic search engines, and setting up the evaluation standards for a video search engine, it is concluded that the classic search engines are not able to meet the needs of today's users for video search, and it is quite necessary to build a dedicate video search engine. With this conclusion, this dissertation proposed the architecture of a video search engine, analyzed the key technology issues to be resolved, including a focused crawling and content extraction for video-sharing websites, a mining and content extraction for web videos, and search ranking strategies for video search. The algorithms to address these issues are proposed in this dissertation. Based upon these core algorithms, the first video search engine for Chinese internet is constructed and goes live service to open public and soon became the most influential video search engine in China. The major works and innovations of this dissertation are as described below:1) An algorithm of focused crawling for video-sharing websites is proposed to resolve the deep crawling and content extraction problems of these websites. As part of it, the algorithm of differentiated deep crawling strategy on various types of web pages is proposed by classification of web pages based upon the site structure and page framework of these websites. And by applying various content extraction wrappers to the web pages containing video contents, the wrapper rules are studied to ensure the quality of deep crawling of these websites and the accuracy of content extraction.2) An algorithm is proposed to mine and extract information from the web pages containing video contents from the web page database of classic search engine. By detailed analysis of the URL prioritization algorithm of the spider system of classic search engine, a DOM-tree method to extract video information is proposed to assure the accuracy of content extraction and broadness of crawling. By combining the focused crawling and mining the classic web page database, video search engine has obtained the fundamental and handful resources of data and text indexing information, as well as gained the balance between accuracy and coverage.3) By analyzing the user requirements on video searching, an algorithm of aggregative ranking for Chinese video search engine is proposed. And with the methodology of online evaluation and experiment, the algorithm parameters are well tuned and its effectiveness is certified. In this algorithm, those factors of video text relevancy and quality of video or website are all taken into account to maximize the user searching experience of relevancy and video experience of browsing. The researches afterwards further suggest that the online evaluation methodology is very effective in such massive performance evaluation of internet applications.4) Upon the above core technologies, a full-functioning video search engine system is constructed and comprehensively evaluated from all technology aspects of video content coverage, freshness and searching relevancy. By introducing the data from the third-party, a comparison with other follower video search engines is conducted and, in conclusion, certified the effectiveness of the algorithms and strategies proposed in this dissertation, as well as the contribution of this video search engine to the online video industry of China.
Keywords/Search Tags:Search engine, information retrieval, video search engine, focusedcrawling, content extraction, ranking aggregation, performance measurement
PDF Full Text Request
Related items