Header menu link for other important links
X
A Novel Approach of Deduplication on Indian Demographic Variation for Large Structured Data
Bhattacharjee K., Garg C., Shivakarthik S., Mehta S., Kumar A., Bhide S., Kulkarni K., Ratnaparkhi S., Agarwal K.,
Published in Springer Science and Business Media Deutschland GmbH
2022
Volume: 334
   
Pages: 343 - 355
Abstract
In the era of Big Data Analytics, information dissemination, data integrity, and identifying unique records from large pool of data poses a big challenge for analysts in entity matching and linking scenarios. Data ingestion from multiple sources of same real-world entity exhibits several data quality issues like redundancy, incorrectness, variations, etc. Also, there are data input errors like typographical/spelling mistakes as well as missing fields. In order to achieve entity resolution, uniqueness and eradicate data redundancy and improve the data quality issues, deduplication is the solution. India being a multi-lingual and multi-cultural country with vast demographic variations, there is a need to develop India-centric model for handling deduplication on various Indian structured data held by various authorities. This research proposes a novel approach catering to India-centric demographic variations, region-specific naming conventions, address standardization using a highly customizable and scalable deep learning approach, by customizing DeepMatcher algorithm along with a synthetic data generation tool reckoning Indian variations of names and addresses in a region-specific manner. © 2022, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About the journal
JournalLecture Notes in Networks and Systems
PublisherSpringer Science and Business Media Deutschland GmbH
ISSN23673370
Open AccessNo