Font Size: a A A

Research And Implementation On The Key Technologies Of The Vertical Search Engine

Posted on:2014-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y F JiaFull Text:PDF
GTID:2298330467464919Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years, the General Search Engine Technology has got a fully development and been widely used in many fields. However, the General Search Engine cannot meet the professional requirements of some particular users in Internet. Especially for the enterprise users, they have the more professional and more comprehensive requirements to the domain data because the domain data is the foundation of business operation and analysis. To solve this problem, the Vertical Search Engine technology emerged, and attracts great attention from the academics. The Vertical Search Engine is a topic-specific search engine that aims at a certain domain. It can return the specific domain retrieve result by using topic detection and directional data extraction. What’s more, the Vertical Search Engine can avoid the problems of massive noise data, inaccuracy in query results and insufficiency in searching depth, which are the inherent vulnerabilities in the General Search Engine. Thus, it attracts more attention from the enterprise user. This thesis focuses on the two key techniques in terms of topic detection and structured data extraction of the Vertical Search Engine and implements the relevant technologies.Topic detection of Web page is the key technology of the Vertical Search Engine and the significant prerequisite to structured data extraction. So, it attracts great attention from the academics and can be widely used in industrial applications. In this thesis, we focus on the Web page which contains rich structured data and propose a classification framework that reuse structured data extraction template. The framework avoids the dependency of Focused Crawlers on URL format, and has higher precision than traditional text classification. We verify the effectiveness of the framework in the classification of Web page that contains rich structured data.Considering the enterprise user want popular emotional inclination from the Vertical Search Engine to assist them making decision, this thesis research on the topic detection problem of Chinese short massage to provide a foundation for further emotion analysis. We propose the5WTAG algorithm under the5W model of news(When, Where, Who, What, hoW) based on the similarity between message news and Chinese short messages. The5WTAG algorithm carves Chinese short massage into statements first. Then it extracts the5W keywords from each statement and creates candidate topic hashtags respectively. In the end, the algorithm uses statistics and semantics analyzing way to compute recommendation of hashtag. The thesis uses real datasets collected from Sina weibo to evaluate accuracy of the5WTAG algorithm on the aspects of hashtags’ semantic, recommendation degree, etc.Finally, in order to solve the problem of structured data extraction in the Vertical Search Engine, the automatic structured data extraction technology of Web page is proposed. This technology extracts the Web page which contains rich structured data by using structured data extraction template. To implement the automation, this thesis improves MDR algorithm, and proposes the data region detecting algorithm of Web pages which contains a large number of structural data, and realizes the automatic generation of extraction template by using the algorithm. The effectiveness and accuracy of automatic extracting technology of structured data in Web page is verified by experiment.
Keywords/Search Tags:vertical search, data extraction, topic detection, Chinese short message, extraction template
PDF Full Text Request
Related items