DOIONLINE

DOIONLINE NO - IJAECS-IRAJ-DOIONLINE-1578

Publish In
International Journal of Advances in Electronics and Computer Science-IJAECS
Journal Home
Volume Issue
Issue
Volume-1,Issue-2  ( Dec, 2014 )
Paper Title
Entity Conflation In Large Structured Datasets & Post Conflation Database Reconciliation
Author Name
Kumar Amit, Rachna Shivangi
Affilition
Microsoft, India Development Center Hyderabad, India Polaris Financial Technologies Limited Chennai, India
Pages
11-14
Abstract
Often large real world databases encounter scenarios where a single entity (a person, a place, a country etc.) is stored as two or more separate entities. This results in duplication and redundancy which can be the root cause of irrelevant or undesired information when we process these datasets to churn out meaningful results. For instance, a database which stores all the country names can have ‘South Africa’ and ‘Republic of South Africa’ as two separate entities. This paper proposes an approach to map such entities, purge the duplicate ones and reconcile the database to make sure all foreign key references to the purged entities are updated to point to the entities that are being persisted. Our experiments on large real world databases with more than a million entries yielded results with high coverage and precision. Keywords- Conflation, Confidence Score, Fuzzy String Match, Dice’s Coefficient, Reconciliation, Purging
  View Paper