Conversation
|
These fixes are independent of #200 ... |
There was a problem hiding this comment.
Pull Request Overview
This PR reconfigures the Los Angeles Sheriff’s Department scraper by updating authentication endpoints, modifying key URLs, and improving the rebuild documentation.
- Updated authentication endpoints and request headers.
- Modified disclosure URL and index JSON URL identifiers.
- Expanded inline documentation for reconfiguration steps.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| clean/ca/los_angeles_sheriff.py | Updated URLs and added detailed reconfiguration instructions. |
| clean/ca/config/los_angeles_sheriff.py | Revised request headers and payload details for new authentication. |
Comments suppressed due to low confidence (1)
clean/ca/config/los_angeles_sheriff.py:21
- The index payload now uses a timezoneOffset of 300 (previously 240). Confirm that this change is intentional since it may affect date/time processing in downstream code.
"timezoneOffset":300,
| Content-Length. With your text editor in regex mode: | ||
| -- Search for ^ and replace with " | ||
| -- Search for :space and replace with ": " | ||
| -- Search for $ and replace with ", |
There was a problem hiding this comment.
There is an inconsistency in the reconfiguration instructions regarding the pageSize value (one step mentions 9999 while a later step specifies 9990). Please clarify the intended value to avoid confusion during future reconfigurations.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "caseindex", | ||
| ] # What cached JSON files aren't page-level JSONs? | ||
| self.base_url = "https://lasd.org/" | ||
| self.disclosure_url = "https://lasdsb1421.powerappsportals.us/" | ||
| self.disclosure_url = "https://lasdsb1421.powerappsportals.us/page/" | ||
| self.data_dir = data_dir |
There was a problem hiding this comment.
_get_detail_json() still hard-codes the detail Referer to the old /disfiles/?id=... path and overwrites the detail_request_headers['Referer'] value from config on every request. With the portal landing page now /page/ and the config Referer updated to the new _portal/modal-form-template-path/... URL, this mismatch is likely to break auth/403 for detail downloads. Update the Referer logic to match the new portal flow (or stop overriding the config Referer and build it using the new pattern).
| "Host": "lasdsb1421.powerappsportals.us", | ||
| "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0", | ||
| "Accept": "application/json, text/javascript, */*; q=0.01", | ||
| "Accept-Language": "en-US,en;q=0.5", | ||
| "Accept-Encoding": "gzip, deflate, br, zstd", | ||
| "Content-Type": "application/json; charset=utf-8", | ||
| "__RequestVerificationToken": "pz0J_b8H_ZlQAmE4hKMgYSLkebSTQlDLIlPOyRoFXP6MFUjCdCQX_pmccAdUE15bN32GryudwMLRfTTXAu7TwOm5NN_KmLyNjrgYte-tjHY1", | ||
| "X-Requested-With": "XMLHttpRequest", | ||
| "Request-Id": "|1dd3b06b17554eea9ca1ebc06f284889.d6781a0999504358", | ||
| "traceparent": "00-1dd3b06b17554eea9ca1ebc06f284889-d6781a0999504358-01", | ||
| "Origin": "https://lasdsb1421.powerappsportals.us", | ||
| "Connection": "keep-alive", | ||
| "Referer": "https://lasdsb1421.powerappsportals.us/page/", | ||
| "Cookie": "ASP.NET_SessionId=2k2vrqpb53tklzcqz0ftqqyy; ARRAffinity=e7388a2c22a416c2e33c2d3ee7de7b82a0b126fabf6b7f4e0b3c496ebe3cb797; ARRAffinitySameSite=e7388a2c22a416c2e33c2d3ee7de7b82a0b126fabf6b7f4e0b3c496ebe3cb797; timezoneoffset=300; isDSTSupport=true; isDSTObserved=false; ContextLanguageCode=en-US; timeZoneCode=35; __RequestVerificationToken=3QKB6wD1TiaX8JabqblCF1zbQmCAE9CW9wd4YXQztEmKpGoV_OFFR-XCSzhdRwnXaZLaY0Qz_MftVbDhScq4eEHYkQ7y_rHzzDVpyJcwKAY1; Dynamics365PortalAnalytics=wZC3-VnEG89IMsWsA6RciIpVlgEmzXyhX6xs75yoMr_M-cKOkmObnb1O8Au2n91ervjoXINj9HlM-PiMEwp3yNIrlT4x2D1f8Ogk_dij8ZTgqe-DuiogJGxgRdWY4h4h1csQkLv5l0rzuhD3kucsWQ2", | ||
| "Sec-Fetch-Dest": "empty", | ||
| "Sec-Fetch-Mode": "cors", | ||
| "Sec-Fetch-Site": "same-origin", |
There was a problem hiding this comment.
This adds hard-coded per-session authentication material (Cookie with ASP.NET_SessionId/affinity/analytics and __RequestVerificationToken, plus Request-Id/traceparent) to the repo. These values are ephemeral, make the scraper brittle (it will break when they expire), and committing them is a security/privacy risk. Prefer establishing a fresh session with an initial GET to the portal page, extracting the verification token dynamically, and letting requests.Session() manage cookies; keep only stable headers that are actually required.
| "Host": "lasdsb1421.powerappsportals.us", | ||
| "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0", | ||
| "Accept": "application/json, text/javascript, */*; q=0.01", | ||
| "Accept-Language": "en-US,en;q=0.5", | ||
| "Accept-Encoding": "gzip, deflate, br, zstd", | ||
| "Content-Type": "application/json", | ||
| "__RequestVerificationToken": "eHcTqQbCi1LqT2xhe50AZS-IY_4JPB6S-WOyeZ_43BorhlZfHO77Q69jKWO3bctuMtKNHSjY_SxQmKCmC0G2N8vhr-3KKu8cOa4GJ15NgOE1", | ||
| "__RequestVerificationToken": "ojoq5FYqn-m9nMvVN2SdlSsx61Rs3QOVQAHa08dQ4mDD9IKHOpw9DvsHqztMhqGK-cfrD6dDgBT9vRA5ey2G4ZMMZOYwz5bTOZlgLLW2KiA1", | ||
| "X-Requested-With": "XMLHttpRequest", | ||
| "Request-Id": "^|180fc898383b4cdea9562818e9ccb2f0.6971a699587e4f02", | ||
| "traceparent": "00-180fc898383b4cdea9562818e9ccb2f0-6971a699587e4f02-01", | ||
| "Request-Id": "|3a9ddf16fa214e86ad1fc1e4423882bd.c68d3b1ed6dd4b62", | ||
| "traceparent": "00-3a9ddf16fa214e86ad1fc1e4423882bd-c68d3b1ed6dd4b62-01", | ||
| "Origin": "https://lasdsb1421.powerappsportals.us", | ||
| "Connection": "keep-alive", | ||
| "Referer": "https://lasdsb1421.powerappsportals.us/disfiles/?id=13434aab-ab8b-ed11-81ad-001dd830a125", | ||
| "Cookie": "Dynamics365PortalAnalytics=I96I2Tvt4N-gPaURejqoFAgdfpCOkV7mfdXsXEgZZq8CooQCFX8ewO5C6tTxgHKGjV8Nqh30acufK6AFfDtdV_SivR7HLAZg5f476jxkzB394E5aPLo8PDI_xXsBmLWgXb5Sf28dZJ2CxuI4re7ZEA2; ASP.NET_SessionId=2k2vrqpb53tklzcqz0ftqqyy; ARRAffinity=254b55dea5200c22439ddc2bd303a9f6d5189518bb2c795f872095b53e417c82; ARRAffinitySameSite=254b55dea5200c22439ddc2bd303a9f6d5189518bb2c795f872095b53e417c82; timezoneoffset=240; isDSTSupport=true; isDSTObserved=true; ContextLanguageCode=en-US; timeZoneCode=35; __RequestVerificationToken=Y4mVGr7Dq1OfgQav9ztK4nDJNNtdU450gGRn6puub7-qbXeiwIiFBzyn-ZFIiwLgFTh13dMhEtTlTXdIUiXIlVaAKO9XENzlm-qMbNC5Egg1", | ||
| "Referer": "https://lasdsb1421.powerappsportals.us/_portal/modal-form-template-path/7ebea772-1fab-4aa3-9c03-f3b767f83247?id=e2c722aa-d0e0-ee11-904d-001dd809c772&entityformid=a0879e75-da5a-49a0-873f-4920ed8ece3f&languagecode=1033", | ||
| "Cookie": "ASP.NET_SessionId=2k2vrqpb53tklzcqz0ftqqyy; ARRAffinity=e7388a2c22a416c2e33c2d3ee7de7b82a0b126fabf6b7f4e0b3c496ebe3cb797; ARRAffinitySameSite=e7388a2c22a416c2e33c2d3ee7de7b82a0b126fabf6b7f4e0b3c496ebe3cb797; timezoneoffset=300; isDSTSupport=true; isDSTObserved=false; ContextLanguageCode=en-US; timeZoneCode=35; __RequestVerificationToken=3QKB6wD1TiaX8JabqblCF1zbQmCAE9CW9wd4YXQztEmKpGoV_OFFR-XCSzhdRwnXaZLaY0Qz_MftVbDhScq4eEHYkQ7y_rHzzDVpyJcwKAY1; Dynamics365PortalAnalytics=wZC3-VnEG89IMsWsA6RciIpVlgEmzXyhX6xs75yoMr_M-cKOkmObnb1O8Au2n91ervjoXINj9HlM-PiMEwp3yNIrlT4x2D1f8Ogk_dij8ZTgqe-DuiogJGxgRdWY4h4h1csQkLv5l0rzuhD3kucsWQ2", | ||
| "Sec-Fetch-Dest": "empty", | ||
| "Sec-Fetch-Mode": "cors", | ||
| "Sec-Fetch-Site": "same-origin", |
There was a problem hiding this comment.
Same issue for the detail request headers: committing a full Cookie header and request-specific tracing IDs makes this scraper depend on a captured browser session and likely to fail once the session/token expires. It also increases the risk of leaking sensitive session identifiers. Refactor to fetch/refresh these values at runtime (session + token extraction) rather than storing them in source control.
| URL. Copy that URL into JSONINDEXURL in the code below. | ||
| -- Open config/ca/los_angeles_sheriff.py. |
There was a problem hiding this comment.
The new troubleshooting instructions reference paths/names that don’t match the codebase: the config file is clean/ca/config/los_angeles_sheriff.py (not config/ca/...), and the index URL is stored in indexjsonurl (not JSONINDEXURL). Updating these references will make the rebuild steps actionable for the next person.
| URL. Copy that URL into JSONINDEXURL in the code below. | |
| -- Open config/ca/los_angeles_sheriff.py. | |
| URL. Copy that URL into indexjsonurl in the code below. | |
| -- Open clean/ca/config/los_angeles_sheriff.py. |
Description
Reconfigure Los Angeles Sheriff's Department
Summary of Changes
-- All the authentication stuff changed, as did the main URLs.
-- Added significant documentation showing how to rebuild this next time.
Related Issues
This is flagged in #162
How to Review
See if it still runs for you.
python -m clean.cli scrape-meta ca_los_angeles_sheriff -l debugSee if the documentation seems sufficient and appropriate.
Notes