Font Size: a A A

Detection And Simple Use Of Time Information In Real-time Search Engine

Posted on:2013-09-21Degree:MasterType:Thesis
Country:ChinaCandidate:W Z LiFull Text:PDF
GTID:2248330371985899Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, Social Networking Services (SNS) and Micro-Blog developed very rapidly in in a very short period of time after their appearing. And now they are becoming more and more popular among people through all the world.These web communities attract a very large number of users, and those people who belong to such kind of web community cound post any kind of information and message in anytime anywhere freedomly.And on the other hand, the traditional newspaper media are also beginning to transition to the electronic news media, so the news about the recent events which took place would be published quickly to the public through the network and web.From what have been mentioned above, how could we get these kinds of information that are mentioned quickly and accurately? As is well known,we get information from the Internet usually by using a search engine,such as Google,Bing and so on. In this process,the user get some keywords and submit these keywords to a search engine, then the search engine search for relevant web pages in the index database.Finally,the search engine sort these relevant results by some certain rules and return the ordered results to the user. So how about to get those information in SNS,Micro-Blog and the latest news reports through a traditional search engine?And the answer is NO,because those types of information happens in real-time.When they happened just a little time ago,the traditional search engines can not get these new information,so they cannot index these new content or put their indexes into the database. Therefore,even the user could submit some keywords related to the traditional search engines,these engines could not return some relevant data information about the thing that what has happened a short time ago.The traditional search engines are not suitable for these searches,and they can not be able to meet the needs. Contributed to these information retrieval needs,the real-time search engine come out.The real-time search engine is raised to solve the search problems in SNS,Micro-Blog and news reports.It provides search services for SNS,Micro-Blog and news.With the rapidly developing of SNS and Micro-Blog,real-time search engine has developed very fast.In real-time search engine,the core and key problem is to get the time information about the web pages in the Internet.And these types of time information about pages include the creation time of the page,the update time,and the update cycle time.To obtain these kinds of time information,some processing work must be done with a web page to get the main content of the page. After removing the parts that have nothing to do with the content on the page,we can get the time information about the pages by processing the main content and dealing with the characteristics shown by the page.In the process of analysis of these pages which contain such types of real-time information,we found they are generally made of only an independent block structure,and the main content of these pages shows some semantic and part of speech(POS) features about the words that are got through natural language process technology with the content.Based on the DOM tree model and considering of the HTML labels’s visibility,we use the semantic and part of speech features shown by the main content and then we get a algorithm called SemV which could be used to extract the main content of a web page and reconstruct the page. And the results of the related experiments show that SemV algorithm are effective and efficient to get the mian content of a page,and it can also reduce the storage space needed to store the page.Based on the extracted main content and reconstruction of the page, taking into account the semantic and POS features,we could get the time information included int the content and estimate the time of the page. And we also analyzed the links between news reports,and we get the algorithm EOM to estimate the time information of a page on the basis of event object model.The results demonstrate the feasibility and accuracy of this algorithm and model.After get the time information of pages,we analyzes the methods for the crawler in the real-time search engine to get the update content of a page.And a algorithm based on the greedy strategy is put out.The algorithm could fully utilize the update cycle time to make a plan for the crawler to detect and get the updated content. The greedy strategy is that the shorter update cycle time get a higher priority.And this strategy could fully use the limited hardware and bandwidth resource,and it helps the crawler to get the updated content timely and efficiently at the same time.At the last,it presents some problems to be solved in the field of real-time serch engine,and there are some further researches and work need to be done in future.
Keywords/Search Tags:real-time search engine, page’s time information, web crawler, content extraction, pagereconstruction
PDF Full Text Request
Related items