Font Size: a A A

The Research On Topic Extraction From Web Pages Based On Semantic

Posted on:2012-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:W ZhangFull Text:PDF
GTID:2178330332495819Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
When people search on line, they always get several pages on the same object which related to each other through web links. It's significant for us to maintain and identify whose closely associated pages, as a result, people can find the useful information exactly and quickly move to the other related topic which they are interested in. Nowadays, pages are usually maintained through artificial means, so it becomes essential to provide a intelligent recognition technology to satisfy people's need.This article is focused on topic extraction based on semantic information by introducing the ontology library. The thesis presents an algorithm to calculate semantic similarity of each text which is extracted from the web based on the ontology library of WordNet. Then summary the words related semantic information, so that it is able to get more accuracy results for topic extraction. The following is the step of accomplishment.Firstly, information filtering is the essential step, by which the navigation information, images and advertising on the web page are removed. Through a series of operation on pages, the processing of information extraction can avoid the affection of unrelated noises, and obtain an accurate result in the end. After information filtering, we need transform the data form page information to text information, because it's difficult to cluster based on the page form.Secondly, we should make the following processing. A system of coreference resolution is also introduced. It not only avoid unnecessary high-frequency of pronoun phenomenon, but also improves the accuracy in information extraction. Lucene can successfully analyze the structure of the sentence in the text, so a lot of noun phrases, such as names, proper nouns are explained accurately. Lucene can create index for each word, hence, it improves the efficiency and accuracy in information extraction.Thirdly, use the improved Lesk similarity algorithm of WordNet to calculate semantic similarity, then cluster pages into different classes, which makes web pages with different topics cluster to different types of aggregations and reduces the noise of pages with different themes. Moreover, the way of computing the semantic similarity of themes, rather than simply calculating the inner product of keywords makes the semantic information among web pages retained, which also improve the accuracy of information extraction.Finally, we use improved TF*IDF algorithm to extract information. The improved algorithm make up for the deficiency, which reckon without the distribution of characteristic within and between classes.In conclusion, according to the experimental data, the algorithm is feasible and effective.
Keywords/Search Tags:information extraction, WordNet, TF*IDF, Semantic Similarity
PDF Full Text Request
Related items