Skip to content

Reconfig LASD #162#181

Open
stucka wants to merge 6 commits intodevfrom
lasd-162
Open

Reconfig LASD #162#181
stucka wants to merge 6 commits intodevfrom
lasd-162

Conversation

@stucka
Copy link
Copy Markdown
Contributor

@stucka stucka commented Dec 27, 2024

Description

Reconfigure Los Angeles Sheriff's Department

Summary of Changes

-- All the authentication stuff changed, as did the main URLs.

-- Added significant documentation showing how to rebuild this next time.

Related Issues

This is flagged in #162

How to Review

See if it still runs for you.

python -m clean.cli scrape-meta ca_los_angeles_sheriff -l debug

See if the documentation seems sufficient and appropriate.

Notes

  • URLs changed, as did authentication stuff

@stucka stucka requested a review from tarakc02 February 7, 2025 18:22
@stucka
Copy link
Copy Markdown
Contributor Author

stucka commented Apr 7, 2025

These fixes are independent of #200 ...

@newsroomdev newsroomdev requested a review from Copilot April 23, 2025 20:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reconfigures the Los Angeles Sheriff’s Department scraper by updating authentication endpoints, modifying key URLs, and improving the rebuild documentation.

  • Updated authentication endpoints and request headers.
  • Modified disclosure URL and index JSON URL identifiers.
  • Expanded inline documentation for reconfiguration steps.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
clean/ca/los_angeles_sheriff.py Updated URLs and added detailed reconfiguration instructions.
clean/ca/config/los_angeles_sheriff.py Revised request headers and payload details for new authentication.
Comments suppressed due to low confidence (1)

clean/ca/config/los_angeles_sheriff.py:21

  • The index payload now uses a timezoneOffset of 300 (previously 240). Confirm that this change is intentional since it may affect date/time processing in downstream code.
"timezoneOffset":300,

Content-Length. With your text editor in regex mode:
-- Search for ^ and replace with "
-- Search for :space and replace with ": "
-- Search for $ and replace with ",
Copy link

Copilot AI Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an inconsistency in the reconfiguration instructions regarding the pageSize value (one step mentions 9999 while a later step specifies 9990). Please clarify the intended value to avoid confusion during future reconfigurations.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 79 to 83
"caseindex",
] # What cached JSON files aren't page-level JSONs?
self.base_url = "https://lasd.org/"
self.disclosure_url = "https://lasdsb1421.powerappsportals.us/"
self.disclosure_url = "https://lasdsb1421.powerappsportals.us/page/"
self.data_dir = data_dir
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_detail_json() still hard-codes the detail Referer to the old /disfiles/?id=... path and overwrites the detail_request_headers['Referer'] value from config on every request. With the portal landing page now /page/ and the config Referer updated to the new _portal/modal-form-template-path/... URL, this mismatch is likely to break auth/403 for detail downloads. Update the Referer logic to match the new portal flow (or stop overriding the config Referer and build it using the new pattern).

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +18
"Host": "lasdsb1421.powerappsportals.us",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0",
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Content-Type": "application/json; charset=utf-8",
"__RequestVerificationToken": "pz0J_b8H_ZlQAmE4hKMgYSLkebSTQlDLIlPOyRoFXP6MFUjCdCQX_pmccAdUE15bN32GryudwMLRfTTXAu7TwOm5NN_KmLyNjrgYte-tjHY1",
"X-Requested-With": "XMLHttpRequest",
"Request-Id": "|1dd3b06b17554eea9ca1ebc06f284889.d6781a0999504358",
"traceparent": "00-1dd3b06b17554eea9ca1ebc06f284889-d6781a0999504358-01",
"Origin": "https://lasdsb1421.powerappsportals.us",
"Connection": "keep-alive",
"Referer": "https://lasdsb1421.powerappsportals.us/page/",
"Cookie": "ASP.NET_SessionId=2k2vrqpb53tklzcqz0ftqqyy; ARRAffinity=e7388a2c22a416c2e33c2d3ee7de7b82a0b126fabf6b7f4e0b3c496ebe3cb797; ARRAffinitySameSite=e7388a2c22a416c2e33c2d3ee7de7b82a0b126fabf6b7f4e0b3c496ebe3cb797; timezoneoffset=300; isDSTSupport=true; isDSTObserved=false; ContextLanguageCode=en-US; timeZoneCode=35; __RequestVerificationToken=3QKB6wD1TiaX8JabqblCF1zbQmCAE9CW9wd4YXQztEmKpGoV_OFFR-XCSzhdRwnXaZLaY0Qz_MftVbDhScq4eEHYkQ7y_rHzzDVpyJcwKAY1; Dynamics365PortalAnalytics=wZC3-VnEG89IMsWsA6RciIpVlgEmzXyhX6xs75yoMr_M-cKOkmObnb1O8Au2n91ervjoXINj9HlM-PiMEwp3yNIrlT4x2D1f8Ogk_dij8ZTgqe-DuiogJGxgRdWY4h4h1csQkLv5l0rzuhD3kucsWQ2",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds hard-coded per-session authentication material (Cookie with ASP.NET_SessionId/affinity/analytics and __RequestVerificationToken, plus Request-Id/traceparent) to the repo. These values are ephemeral, make the scraper brittle (it will break when they expire), and committing them is a security/privacy risk. Prefer establishing a fresh session with an initial GET to the portal page, extracting the verification token dynamically, and letting requests.Session() manage cookies; keep only stable headers that are actually required.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to 40
"Host": "lasdsb1421.powerappsportals.us",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0",
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Content-Type": "application/json",
"__RequestVerificationToken": "eHcTqQbCi1LqT2xhe50AZS-IY_4JPB6S-WOyeZ_43BorhlZfHO77Q69jKWO3bctuMtKNHSjY_SxQmKCmC0G2N8vhr-3KKu8cOa4GJ15NgOE1",
"__RequestVerificationToken": "ojoq5FYqn-m9nMvVN2SdlSsx61Rs3QOVQAHa08dQ4mDD9IKHOpw9DvsHqztMhqGK-cfrD6dDgBT9vRA5ey2G4ZMMZOYwz5bTOZlgLLW2KiA1",
"X-Requested-With": "XMLHttpRequest",
"Request-Id": "^|180fc898383b4cdea9562818e9ccb2f0.6971a699587e4f02",
"traceparent": "00-180fc898383b4cdea9562818e9ccb2f0-6971a699587e4f02-01",
"Request-Id": "|3a9ddf16fa214e86ad1fc1e4423882bd.c68d3b1ed6dd4b62",
"traceparent": "00-3a9ddf16fa214e86ad1fc1e4423882bd-c68d3b1ed6dd4b62-01",
"Origin": "https://lasdsb1421.powerappsportals.us",
"Connection": "keep-alive",
"Referer": "https://lasdsb1421.powerappsportals.us/disfiles/?id=13434aab-ab8b-ed11-81ad-001dd830a125",
"Cookie": "Dynamics365PortalAnalytics=I96I2Tvt4N-gPaURejqoFAgdfpCOkV7mfdXsXEgZZq8CooQCFX8ewO5C6tTxgHKGjV8Nqh30acufK6AFfDtdV_SivR7HLAZg5f476jxkzB394E5aPLo8PDI_xXsBmLWgXb5Sf28dZJ2CxuI4re7ZEA2; ASP.NET_SessionId=2k2vrqpb53tklzcqz0ftqqyy; ARRAffinity=254b55dea5200c22439ddc2bd303a9f6d5189518bb2c795f872095b53e417c82; ARRAffinitySameSite=254b55dea5200c22439ddc2bd303a9f6d5189518bb2c795f872095b53e417c82; timezoneoffset=240; isDSTSupport=true; isDSTObserved=true; ContextLanguageCode=en-US; timeZoneCode=35; __RequestVerificationToken=Y4mVGr7Dq1OfgQav9ztK4nDJNNtdU450gGRn6puub7-qbXeiwIiFBzyn-ZFIiwLgFTh13dMhEtTlTXdIUiXIlVaAKO9XENzlm-qMbNC5Egg1",
"Referer": "https://lasdsb1421.powerappsportals.us/_portal/modal-form-template-path/7ebea772-1fab-4aa3-9c03-f3b767f83247?id=e2c722aa-d0e0-ee11-904d-001dd809c772&entityformid=a0879e75-da5a-49a0-873f-4920ed8ece3f&languagecode=1033",
"Cookie": "ASP.NET_SessionId=2k2vrqpb53tklzcqz0ftqqyy; ARRAffinity=e7388a2c22a416c2e33c2d3ee7de7b82a0b126fabf6b7f4e0b3c496ebe3cb797; ARRAffinitySameSite=e7388a2c22a416c2e33c2d3ee7de7b82a0b126fabf6b7f4e0b3c496ebe3cb797; timezoneoffset=300; isDSTSupport=true; isDSTObserved=false; ContextLanguageCode=en-US; timeZoneCode=35; __RequestVerificationToken=3QKB6wD1TiaX8JabqblCF1zbQmCAE9CW9wd4YXQztEmKpGoV_OFFR-XCSzhdRwnXaZLaY0Qz_MftVbDhScq4eEHYkQ7y_rHzzDVpyJcwKAY1; Dynamics365PortalAnalytics=wZC3-VnEG89IMsWsA6RciIpVlgEmzXyhX6xs75yoMr_M-cKOkmObnb1O8Au2n91ervjoXINj9HlM-PiMEwp3yNIrlT4x2D1f8Ogk_dij8ZTgqe-DuiogJGxgRdWY4h4h1csQkLv5l0rzuhD3kucsWQ2",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue for the detail request headers: committing a full Cookie header and request-specific tracing IDs makes this scraper depend on a captured browser session and likely to fail once the session/token expires. It also increases the risk of leaking sensitive session identifiers. Refactor to fetch/refresh these values at runtime (session + token extraction) rather than storing them in source control.

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +32
URL. Copy that URL into JSONINDEXURL in the code below.
-- Open config/ca/los_angeles_sheriff.py.
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new troubleshooting instructions reference paths/names that don’t match the codebase: the config file is clean/ca/config/los_angeles_sheriff.py (not config/ca/...), and the index URL is stored in indexjsonurl (not JSONINDEXURL). Updating these references will make the rebuild steps actionable for the next person.

Suggested change
URL. Copy that URL into JSONINDEXURL in the code below.
-- Open config/ca/los_angeles_sheriff.py.
URL. Copy that URL into indexjsonurl in the code below.
-- Open clean/ca/config/los_angeles_sheriff.py.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants