Font Size: a A A

Research And Implement Of Commercial Vertical Search Engine Based On Lucene

Posted on:2016-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:L N PanFull Text:PDF
GTID:2298330452966277Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With rapid growth of network information resources, people are increasinglyconcerned about how quickly and efficiently it can get the extract potentially valuableinformation from the massive network information, so that it can play an effectiverole in the management and determination. Vertical search engine is of specialization,refinement and deep meaning.It can only get specific network information from agiven field,group or need, and it can also aggregate information and process index toprovide relevantly valuable services and information, so as to improve the accuracy ofthe users’ searching.Now the network commodity trading activities are increasing extremely,andbelows we will design the product search engine against the application,so that userscan find and purchase the goods they need in a short period of time, then it relates tothe achievement of commercially vertical search engine.This paper presents commercially vertical search engine of the electroniccommerce system. Firstly,we use python spiders crawl program to get the commoditydata from the existing B2C,such as website Jingdong and Tmall, and here wespecifically develop the backstage that it can add the commodity data into thedatabase manually.Secondly,we have discussed the data elimination algorithm againstMD5digital signature.Via the test, the actual demand in precision, recall and responsetime are all well. Thirdly,we have created the Autoword automatic formationalgorithm against the theory of association rules to define Chinese words.Thisalgorithm can dynamically construct thesaurus from a large number of Chinese corpusand do Chinese text mining work. According to the characteristics of commoditystructured information in the electronic commerce system and the existing TF-IDFalgorithm,we have proposed an improved sorting algorithm.We have also and used thefull-text retrieval and database query technology in the sysytem. The combine offull-text search technology and data bases can support the relevant ranking, improvethe retrieval speed, and it can also flexibly search and use structural information andreal-time information of goods.Through the transverse result display and comparisonwith other e-commerce websites,we have verified the superiority of the algorithm. The existing search engines mostly adopt response model of input and output,butwithout the users’ feedback.In this paper,we have proposed adaptive algorithm basedon user interaction, and have discussed intelligent sorting, which use the user data foroptimization of the sort results. Finally, this paper has completed the overallframework and the realization of the system. The main work I have done in the paperis as follows:1、The design and implementation of reptiles and the dissipationmodule.In this system, the data source has two parts. One part is the data which get fromthe existing B2C site Jingdong,and lynx,using the crawler script written by pythonfrom the start URL with the breadth first algorithm.The other part is that we manuallyadd the data into the database.Here we have developed background data addtionmodule specially.Good vertical search engine requires a high quality sources of data, and highquality search results must depend on the quality of the data.In order to avoid repeated,similar or incomplete information in the search results, the accuracy of the data iscrucial. This paper introduces the design of MD5digital signature data eliminationalgorithm.Via the experiment,the actual demand in the precision, recall and responsetime is well.2、Research and Application on AutoWord Building.Words are the basic elements of Chinese text, and Chinese language model playsa key role in Chinese text mining. Text classification is a data mining technology withhigh dimensions and most of the classifying algorithms are sensitive to thedimensions. As a result, the classification depends on the quantity of vocabularies.Besides, most of current Chinese language models are based on statistical theory, suchas N-gram model and other improved models. However, these statistical models aredisadvantaged with computational complexity. In order to improve the quantity andefficiency, this paper gives Chinese words a new definition based on AssociationRules, and proposes the Autoword algorithm, by which a word vocabulary isconstructed automatically and used for Chinese text mining. Finally, the efficiency ofthe Autoword algorithm is proved by experiment.3、The improvement of the optimization sorting algorithm.According to the characteristics of commodity structured information in the electronic commerce system and the existing TF-IDF algorithm,we have proposed animproved sorting algorithm.We have also and used the full-text retrieval and databasequery technology in the sysytem. The combine of full-text search technology and databases can support the relevant ranking, improve the retrieval speed, and it can alsoflexibly search and use structural information and real-time information ofgoods.Through the transverse result display and comparison with other e-commercewebsites,we have verified the superiority of the algorithm.The existing search engines mostly adopt response model of input and output,butthe model is of no user feedback.In this paper,we have proposed the adaptivealgorithm based on user interaction, and have discussed intelligent sorting, which usethe user data for optimization of the sort results.4、The building and implementation of the overall frame.We have analysed and mastered the main framework of Lucene and each parts,building a complete development environment.And we aslo have detailly studied theindexing module and retrieval module. Through the search engine and combined withLucene’s own particular formulation,we have designed the commercial vertical searchengine based on Lucene.It has the following characters:(1)it can accept data frompython crawler and also has the background data manual addtion module;(2)itsupports segmentation query;(3) it uses the the Lucene toolkit to realize the web pagecontent index;(4) it uses Ajax technology to implement the web interaction of searchservice, and create dynamic web page,finally turn to the user search results;(5) it usesthe Spring framework to improve the background management system, and uses theJSP technology to realize the front desk development system;(6) it can supportfull-text search;(7) it can highlight the search keyword;(8)it can display the querytime;(9) it can show the search history and filter keywords;(10)it can remove queryhistory.The realization of word segmentation, text search and sorting can use the classlibrary of Lucene and the related algorithm studied in this paper,but the key factor ofhigh brightness display only needs help with Highlighter.The database persistentlystore data.
Keywords/Search Tags:association rules, automatic formation, full text retrieval, user feedback, intelligent scheduling
PDF Full Text Request
Related items