Publish In |
International Journal of Advance Computational Engineering and Networking (IJACEN)-IJACEN |
Journal Home Volume Issue |
||||||||
Issue |
Volume-5,Issue-10 ( Oct, 2017 ) | |||||||||
Paper Title |
Modified N-Gram based Model for Identifying and Filtering Near-Duplicate Documents Detection | |||||||||
Author Name |
Farheen Naaz, Farheen Siddiqui | |||||||||
Affilition |
School of Engineering Sciences and Technology (SEST), Jamia Hamdard, Hamdard University, New Delhi. India | |||||||||
Pages |
55-59 | |||||||||
Abstract |
During last three decades World Wide Web (WWW) has expanded exponentially. A great deal of the web is full of duplicate or near-duplicate content. Documents that are served on the web are in different formats like PDF, HTML, excel and text. Our proposed solution is created on a publicly available dataset files. The dataset consists of files which are tagged as duplicate. Our work in this paper is based on the duplicate and near duplicate document detection using n-Gram based, a low-dimensional demonstration(LSI-SVD) approach, implemented in c#.net. Keywords - Duplicate document, N-gram, SVD (Singular Value Decomposition), LSI(Latent Semantic Indexing), Cosine similarity etc. | |||||||||
View Paper |