-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cataloguing data errors #96
Comments
The industry-specific data: sic and NAICS does are often missing and sometimes wrong. so looking at in enforcement in a single industry is very very labor intensive. We've found upwards of 12 FRS id numbers for a single prison. they aren't cross referenced, we just had to find them all and compile them --is that what you mean by disaggregations? once we get the prison data retrieval from the SBU database working we'll be able to have validated lat/long next to EPA lat/long, so we could quantify the inaccuracies. |
"1" is a valid zip code according to the database Would be interesting to create a notebook just for identifying data errors |
Adding this, from Leif: "However, while increased scrutiny has been given to the quality of data in the national systems in recent years, the data quality before November 2000 has not been assessed and should be considered unknown." https://echo.epa.gov/tools/data-downloads <- We should really consider modifying the notebooks to start in 2001 -> "State/Local penalties were not tracked until Docket was replaced by ICIS in 2002." https://echo.epa.gov/tools/data-downloads/icis-fec-download-summary |
This is an important resource: https://echo.epa.gov/resources/echo-data/about-the-data |
From the above ^^^ |
See also: https://echo.epa.gov/resources/echo-data/known-data-problems Perhaps we are trying to add to this/make it more accessible. |
Relevant to the CD LA-2 report: |
More on geocoding errors....
|
There are 52 EPA facilities with FAC_COUNTY listed as METROPOLITAN BOSTON and tracked in the ECHO database, but there is no METROPOLITAN BOSTON county in reality. |
This one made me LOL |
cc @mraisle @leiffredrickson that this collection of data errors exists. There's also this document Principles for Open Environmental Data and I think some things are tracked in the EEW Tracker |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions. |
Still want to do this I think! |
Would this be a place to list discrepancies between SBU Data and ECHO web interface or would that be somewhere else? |
@shapironick I think here would be the best place - not sure there's anywhere else! |
okie great! this is just the one i mentioned in meeting last week: |
@shapironick Hmm...interesting that that facility has no address too... Ok, so I think I know what the problem is. Here we go: First, I can confirm that when I do my own query of the SBU database, with a different notebook, I don't see this formal action either. Note that the formal action was led by US EPA and is being pulled from the ICIS data system, not the SDWIS data system. So that's why it's not showing up in the notebook/database. Because it's not in the SDWA_Enforcements table in the SBU database, which is based on SDWIS. Instead, I suspect the formal enforcement action record resides in the ICIS FEC tables: https://echo.epa.gov/tools/data-downloads/icis-fec-download-summary We have these in the database, but haven't done much with them. I will check this table to confirm. Another great example of the fragmented data infrastructure we have to work with. |
Part of the challenge with working with the ICIS FEC tables - which are kind of like, a bunch of random other enforcement actions not described elsewhere - is that they are not organized by facility. There is no way to look at the CASE_ENFORCEMENTS table by facility. You have to query another table to get the enforcement action IDs associated with the facility, then ping the actual CASE_ENFORCEMENTS table for the detailed records. |
Indeed, I found the formal action in the CASE_ENFORCEMENTS table:
Seems like we need to figure out ways of querying this table. I think it would involve pinging the CASE_FACILITIES table for the REGISTRY_ID field (which links to the main ECHO_EXPORTER table), and then grabbing the ACTIVITY_ID (s) or CASE_NUMBER (s) I tested this approach:
This returns TWO actions, the one from 2016 and also one from 2011 (remember that ECHO.EPA.GOV only displays enforcement actions from the previous 5 years....our database has the full archive) To get the full record then we grab the ACTIVITY_ID (s)
|
@shapironick - I've lost the link to the notebook Steve wrote for you, but if you share it with me, I could try adding in a section that looks for these other enforcement actions. |
Amazing!! thank you, Eric! This is so helpful. Here is shane's notebook https://colab.research.google.com/github/shansen5/CA_Prisons/blob/master/CA_Prisons2.ipynb#scrollTo=MhTd2hLBpUJT do you think that the ICIS FEC table is mutually exclusive with other violations? or would there be the possibility of duplication? |
@shapironick Yes, I think there could be duplicates. According to the metadata: "ICIS may contain duplicates of actions and penalties shown in the program databases (i.e., ICIS-Air, ICIS-NPDES, and RCRAInfo)." https://echo.epa.gov/tools/data-downloads/icis-fec-download-summary |
Ok, I've made some edits to @shansen5's notebook - the principles are there and it'll work for a small number of facilities at this point, but hopefully Steve can adjust for looking at a greater number |
Thank you so much, Eric!
…On Wed, Feb 3, 2021 at 2:44 PM Eric Nost ***@***.***> wrote:
Ok, I've made some edits to @shansen5 <https://github.com/shansen5>'s
notebook - the principles are there and it'll work for a small number of
facilities at this point, but hopefully Steve can adjust for looking at a
greater number
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#96 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABZH33ZRB6NAX4TWD4LVSB3S5HGQVANCNFSM4NFHFRIA>
.
--
Nicholas Shapiro
Assistant Professor
UCLA Institute for Society and Genetics
Office: (310) 206-2366
Book my office hours here
<https://calendly.com/n-s/office-hours-monday-1-30-3-30?>
|
One thing I have tried to do is to look at the percent of facilities that have no available information (blank, null or NaN values in our database). I tested this on the summary Clean Water Act information available in the main ECHO Exporter table. I compared facilities in majority white neighbourhoods with those in majority minority neighborhoods: https://colab.research.google.com/drive/1Ej7d2ovZZx-mQz3wuPzrKWlPa5ZBY-l7 This shows that, across the US, the vast majority of CWA-regulated facilities have no recorded information about # of inspections, violations, enforcement actions, etc. It could be that's because they have no violations / enforcement actions. But reporting null is different from reporting a verified 0! This also shows that for the CWA, facilities located in majority minority neighbourhoods tend to have less certain information. This is especially true for inspections: Percent of facilities with no info on CWA_INSPECTION_COUNT 87.68 (minority) 80.79 (white) Percent of facilities with no info on CWA_DAYS_LAST_INSPECTION 77.87 (minority) 68.86 (white) (edited) |
Incredible work, Eric! |
Watershed errors: The Niagara River HUC8 watershed is listed as 04270101 see here: https://www.sciencebase.gov/catalog/item/5d6f6573e4b0c4f70cf90848 This gets rendered as 4270101 in our database. But looking at the database, the only HUCs listed for facilities in Buffalo are 4120103, 4120104, nan, 4140101, and 2010008. For Tonawanda, it's just 4120104 and nan The Niagara River HUC8 is listed as 04120104 here https://water.usgs.gov/wsc/cat/04120104.html So, unfortunately, it appears there are at least two HUC8 classification schemes in circulation. Seems worth digging into....I think the 427 is more recent and EPA has failed to update yet. |
Linking in @ericnost 's ideas for visualizing missing reports https://edgi.slack.com/archives/CKN6FMX42/p1615479261188300 |
We've mentioned cataloguing data errors before but I don't think we have a place to list them. For now, collecting here:
┆Issue is synchronized with this Trello card
The text was updated successfully, but these errors were encountered: