DOIONLINE

DOIONLINE NO - IJACEN-IRAJ-DOIONLINE-9663

Publish In
International Journal of Advance Computational Engineering and Networking (IJACEN)-IJACEN
Journal Home
Volume Issue
Issue
Volume-5,Issue-10  ( Oct, 2017 )
Paper Title
Modified N-Gram based Model for Identifying and Filtering Near-Duplicate Documents Detection
Author Name
Farheen Naaz, Farheen Siddiqui
Affilition
School of Engineering Sciences and Technology (SEST), Jamia Hamdard, Hamdard University, New Delhi. India
Pages
55-59
Abstract
During last three decades World Wide Web (WWW) has expanded exponentially. A great deal of the web is full of duplicate or near-duplicate content. Documents that are served on the web are in different formats like PDF, HTML, excel and text. Our proposed solution is created on a publicly available dataset files. The dataset consists of files which are tagged as duplicate. Our work in this paper is based on the duplicate and near duplicate document detection using n-Gram based, a low-dimensional demonstration(LSI-SVD) approach, implemented in c#.net. Keywords - Duplicate document, N-gram, SVD (Singular Value Decomposition), LSI(Latent Semantic Indexing), Cosine similarity etc.
  View Paper