Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

District cleaner #310

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

District cleaner #310

wants to merge 11 commits into from

Conversation

yashy3nugu
Copy link

@yashy3nugu yashy3nugu commented Dec 10, 2022

Updates the 'DISTRICT' field using fuzzy matching to match the closest standardized district name. District names are taken from here
The changes made to the dataset can be seen here

@captn3m0
Copy link
Contributor

Deduplicate the diff please. Something like:

-GREATER MUMBAI
+Mumbai

Don't need to know the remaining fields, just a unique list of districts impacted by the change

@captn3m0
Copy link
Contributor

What's the coverage of the change? (How many districts are matched, and left unmatched?)

@yashy3nugu
Copy link
Author

yashy3nugu commented Dec 14, 2022

What's the coverage of the change? (How many districts are matched, and left unmatched?)

Only 23 out of 16320 districts are left unmatched and required manual patches. Rest of the districts matched with the list

@yashy3nugu
Copy link
Author

Deduplicate the diff please. Something like:

-GREATER MUMBAI
+Mumbai

Don't need to know the remaining fields, just a unique list of districts impacted by the change

Updated the gist

@captn3m0
Copy link
Contributor

The changes are too aggressive.

-NEAR NEW MONDHA (ANAJ MANDI) HINGOLI
+Gandhinagar
-IN FRONT OF KANYASHALA
+Kalahandi
-RAVI STEEL CHOWK, KAMRE, RATU ROAD
+Amravati
-BLOCK- KANDHLA, DIST - SHAMLI
-NAI BAZAR, BHARWARI
+Hazaribagh
-PATTI, PAKHWANIA
+Panipat
-PO-AKHAR, DUDHER
+Dhar
-LEFT BANK, ALEU, NEW MANALI, DISTT - KULLU
+Dibang Valley
-TAL  JAWHAR  DISTT THANE
+Jalandhar
-NIKETAN ASHRAM, DISTT. PAURI
+Amristar

Don't think we can merge this till we're sure about the accuracy of the data.

In the meanwhile, I've found a nice source for an official list of districts india, with district codes that we can perhaps use. https://lgdirectory.gov.in/. Here's a cleaned up version: https://github.com/planemad/india-local-government-directory/blob/main/administrative/2-district.csv

It's missing a few districts, I've filed a PR for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants