The Designation And Implementation Of Content-aware System Of News Webpages

Posted on:2017-02-10

Degree:Master

Type:Thesis

Country:China

Candidate:P C Tian

Full Text:PDF

GTID:2348330518994040

Subject:Computer Science and Technology

Abstract/Summary:

The network resources is becoming richer and richer with the rapid development of the Internet,as a result,the online news play an important part in showing and comprehending affairs of all fields,it influences the Internet users’ recognition to events.There huge amount of logs recording the user requests for the news websites and the CDN suppliers which indicate the news’ or news topics’ popularity,contain the trend of public opinion and showing the users’ preference of news.The websites and the CDN suppliers are eager to be aware of the content information of news webpages using the URL from the access logs,so that they can obtain the hotspot and provide better services.In a word,a study on being aware of the news pages’ content including analyzing the content and detecting the topics via the URLs’ character is valuable.Based on the news topic detection and analysis of news content,this paper studies and realizes the related technical scheme of news content aware.The main researches are as follows:(1)Extracting the text from the news pages.This paper improved the tree path matching algorithm according to the news pages’ characters,then generated a tree path template for the text in a news page and set the threshold value of the character in the template.(2)Researching and using the character of the news pages’ URL structure.The thesis came up with a method to obtain the website’s name and the classification from a news page’s URL that it belonged to and a way classifying pages into context pages and non-context pages.(3)Detecting the news topics.After preprocessing the passages,this paper modeled text with the LDA topic model and determined the initialization parameter applicable to the business scene in this paper.Furthermore,the thesis combined the K-means clustering algorithm with the hierarchical clustering algorithm forming a two-layer hybrid clustering strategy,and improved the way to determine the initial clustering centers,which realizes clustering the news texts rapidly and accurately.Based on the researching results,this paper realized the methods of extracting the news from the news pages with the path template and obtaining the website’s name and classification of the news page as well as the topic detecting model.Both the methods and the model have been examined with experiments.The content-aware system of news pages has been implemented using the methods and model above,the system extracted the home page information,detected the news topics,recorded the news’ and keywords’ popularity,provided the basis for grasped the trend of public opinion and the prerequisite for improving the service for the news websites and CDN suppliers.

Keywords/Search Tags:

new pages, content aware, topic detection, content extraction, analyzing URL structure character

Related items

1	Research On Chinese Blog Pages Recognition And Content Extraction
2	Research On Content Extraction In HTML Web Pages Based Multi-Features
3	Research And Realization Of Web Information Mining Model Based On Topic Features
4	Research On Digital Video Semantic Content Extraction
5	Research On WEB Page Structure And Data Extraction Technology
6	Tag Tree Template In The Pages Of Critical Information Extraction And Topic Identification
7	Research On The Technology Of Web Data Extraction
8	Research On Detection Of Content-Aware Image Resizing
9	Research On Key Technologies Of Inforamaton Lifycycle Management In Content Aware Storage System
10	The Research And Implementation On Content Extraction In Web Pages Based Page Segmentation