Font Size: a A A

Research On Web Filtering Method Of People Information

Posted on:2019-07-26Degree:MasterType:Thesis
Country:ChinaCandidate:C Y ZhouFull Text:PDF
GTID:2348330569487727Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid improvement of information technology and the popularity of smart phones,the Internet has revolutionized the way we communicate with others and has changed daily practices.People increasingly like to express their opinions by using social networks.These activities left precious text data resources in the network,and there are a lot of information about the characters introduced in the text data.Accurate access to these people's information is of great significance in the fields of portraits and other fields.In addition,the informal nature of webpage texts presents new challenges to all levels of natural language processing.Therefore,this thesis takes the people information webpage filtering as the application object,combined with the web data preprocessing technology,focusing on the research of the text information extraction and the filtering of the relevant webpages.The specific contributions are as follows:(1)Putting forward a method for extracting the text of potential characters based on the DOM node path feature of web pages.According to the characteristics of the link between the DOM node path of the web page and the text content of the webpage,the DOM node path feature set of the subject body text and the noise text is established.Then,the node path feature is clustered and the node path of the subject body is obtained.Finally,extracting the subject body text of the web page through the node path in the main content cluster.The experimental results show that this method is suitable for different types of web page text extraction,and this method gets high accuracy and speed.(2)Putting forward a web filtering method based on character information trigger word features.This method first uses the topic crawler to crawl web pages with potential personal information from the Internet and consider it as a data source,then manually labels them.By observing the text content of a large number of people information web pages,it is found that there are often a large number of trigger word information in the vicinity of the character attribute phrases.Therefore,this thesis summarizes some trigger word features which describe the character attribute information.In the feature extraction process,some structural features of the web page are also extracted.Finally,SVM is used to build a classifier,which is trained and applied to the filtering of character information web pages.The experimental results show that this method can achieve a better filtering effect on the character information webpage,and can solve the current problem of obtaining character information webpage.
Keywords/Search Tags:people information, social network, text data preprocessing, webpage content extraction, webpage filtering
PDF Full Text Request
Related items