Write JSON state #997

KaspariK · 2024-09-17T16:29:34Z

What

We store job and job run state as pickles. We would like to not do that. This is part of the path to not doing that by writing state data as JSON alongside our current pickles. Restoring from JSON will follow in another PR.

Why

In DAR-2328, the removal of the mesos-related code led to resetting of tron jobs state resulting in job runs starting from "0". The pickles weren't being unpickled correctly as this relies on classes that were deleted in the mesos code.

nemacysts

ty for adding all these types as well!

nemacysts · 2024-09-20T20:56:11Z

tron/actioncommand.py

+    @staticmethod
+    def to_json(state_data: dict) -> str:
+        return json.dumps(
+            {
+                "status_path": state_data["status_path"],
+                "exec_path": state_data["exec_path"],
+            }
+        )


should these to_json() functions be normal on the methods so that it's easier to add additional data to be serialized in the future? as-is, we'd need to track down any calls of to_json() for any modified classes and ensure that the state_data dict that we build for those calls has any new fields

e.g.,

Suggested change

@staticmethod

def to_json(state_data: dict) -> str:

return json.dumps(

{

"status_path": state_data["status_path"],

"exec_path": state_data["exec_path"],

}

)

def to_json(self) -> str:

return json.dumps(

{

"status_path": self.status_path,

"exec_path": self.exec_path,

}

)

i'm also debating whether or not we should have this return a dict and we call json.dumps() before saving, but that's probably not too big a change if we wanna do that later - and the current approach means that if something cannot be serialized to json, we'll get a better traceback :)

oh, i see - a lot of these have state_data() properties! i think that that makes this less pressing since the calls will (i assume) look like `SomeClass.to_json(some_object.state_data)

but it might still be nicer to switch since for things that do implement state_data() this would look like

Suggested change

@staticmethod

def to_json(state_data: dict) -> str:

return json.dumps(

{

"status_path": state_data["status_path"],

"exec_path": state_data["exec_path"],

}

)

def to_json(self) -> str:

state_data = self.state_data

return json.dumps(

{

"status_path": state_data["status_path"],

"exec_path": state_data["exec_path"],

}

)

and for classes that don't have a state_data property we'd be in the scenario described above and would benefit from this no longer being a staticmethod :)

(another benefit would also be avoiding any potential circular imports in the future since we won't need to import classes just to call to_json() :)

oh, i see - when this gets called in tron/serialize/runstate/dynamodb_state_store.py we just have a key and a dict ;_;

hmm, maybe this is better for a post-pickle-deletion refactor where we update the simplified code to pass around the actual objects rather than state_data dicts

lol, yeah basically my train of thought while writing all this. Your last comment is where I landed as well. It would have made this work a lot easier if we just had the objects. Ultimately it shouldn't be too terrible to refactor. Added a TODO/ticket.

tron/core/actionrun.py

tron/serialize/runstate/dynamodb_state_store.py

nemacysts · 2024-09-20T21:29:13Z

tron/serialize/runstate/dynamodb_state_store.py

+    def get_type_from_key(self, key: str) -> str:
+        return key.split()[0]
+
+    def _serialize_item(self, key: str, state: Dict[str, Any]) -> str:


if this only accepts two values for key, i think we can write key: Literal[runstate.JOB_STATE, runstate.JOB_RUN_STATE] rather than key: str

I did have that, but was getting Variable not allowed in type expression. I guess in this case I should have done Literal[job_state", "job_run_state"] or maybe just pretended it's an enum and added # type: ignore?

At any rate, I added the ignore, but let me know if you have a preference.

nemacysts · 2024-09-20T21:33:19Z

tron/utils/crontab.py

+            return sorted_groups  # type: ignore
+        return sorted_groups  # type: ignore


could we add comments for these ignores?

Added types. This file was one of those dominoes that came up when adding types elsewhere. I tried to use my whole brain to do this one because it's kind of a spooky change, but I think I got it. Infrastage testing looks good, but it's not exhaustive

nemacysts · 2024-09-20T21:35:44Z

tron/serialize/runstate/dynamodb_state_store.py

+                        "json_val": {
+                            "S": json_val[index * OBJECT_SIZE : min(index * OBJECT_SIZE + OBJECT_SIZE, len(json_val))]
+                        },
+                        "num_json_val_partitions": {


i always thought that this num_partitions stuff was something that dynamo required 🤣 - i guess this is just for our usage (to know how many partitions we're using per-item?)?

Yeah, exactly. We have key+index for our compound key, but we use num_partitions in a funny multi-step restore thing where we get all the first partitions, then for every one of those we try _get_remaining_partitions (which annoyingly will need to handle JSON having more partitions).

I debated at least adding a check before this to see if num_partitions was greater than 1, but just never got around to it. This might be worth a refactor after we've depickled things.

nemacysts

generally lgtm - we can hopefully refactor things a bit to make this even nicer after all the pickle logic is gone :p

i think my only remaining question is do we need to handle json failures in any specific way

(and i'll leave it up to you whether or not you'd like to merge the crontab changes)

tron/core/actionrun.py

nemacysts · 2024-10-22T19:09:44Z

tron/utils/crontab.py

we talked about this a bit in our 1:1, but i think it's fine to type: ignore things a bit and deal with fixing any type issues here for a later PR if we're not feeling particularly confident in these changes :)

For the sake of posterity, I kept this, but tried to maintain the logic as much as possible and added a few tests for good measure.

nemacysts · 2024-10-22T19:18:27Z

tron/actioncommand.py

@@ -195,6 +202,15 @@ def __eq__(self, other):
    def __ne__(self, other):
        return not self == other

+    @staticmethod
+    def to_json(state_data: dict) -> str:


from our 1:1: should we (or do we need to) handle failures in to_json? e.g., non-serializable data, missing data in the state_data dict leading to a KeyError?

if we want the json writing to be optional until we're reading from the json fields, we probably also want to make sure that json serialization failures don't result in losing pickle'd writes :)

nemacysts

i have one last question/thought around the save queue handling, but otherwise i think this is good to start testing in our non-prod envs

we should probably setup some log-based alerting (or maybe use the prometheus integration?) for whenever json writing errors happen since they're definitely things we'll want to handle ASAP as part of the pickle-removal process

nemacysts · 2024-10-24T17:22:47Z

tron/actioncommand.py

+        except KeyError as e:
+            log.error(f"Missing key in state_data: {e}")
+            return None
+        except Exception as e:
+            log.error(f"Error serializing SubprocessActionRunnerFactory to JSON: {e}")


doing something like the below suggestion here and in the other to_json()

Suggested change

except KeyError as e:

log.error(f"Missing key in state_data: {e}")

return None

except Exception as e:

log.error(f"Error serializing SubprocessActionRunnerFactory to JSON: {e}")

except KeyError:

log.exception(f"Missing key in state_data:")

return None

except Exception:

log.exception(f"Error serializing SubprocessActionRunnerFactory to JSON:")

might be nice since it'll include the full traceback

nemacysts · 2024-10-24T17:26:38Z

tron/serialize/runstate/dynamodb_state_store.py

                log.error(error)
+                # Add items back to the queue if we failed to save


thoughts on adding back to the save queue with just the pickled data? if there's any issues writing the json, a second json-less attempt might still succeed right?

It's a solid idea. I don't think we see tron_dynamodb_save_failure very often so it's a reasonable fallback. Would be nice to catch any partitioning issues.

KaspariK · 2024-10-25T14:34:36Z

we should probably setup some log-based alerting (or maybe use the prometheus integration?) for whenever json writing errors happen since they're definitely things we'll want to handle ASAP as part of the pickle-removal process

When you say prom integration are you thinking some counter like json_serialization_errors.inc() on failed serialization, then alerting on that?

nemacysts · 2024-10-28T16:57:36Z

When you say prom integration are you thinking some counter like json_serialization_errors.inc() on failed serialization, then alerting on that?

@KaspariK yup - exactly!

…d to_json methods for classes in DynamoDB restore flow. Write an additional attribute to DynamoDB to capture non-pickled state_data.

…odb_state_store to something a little more explanatory now that we have 2 versions

…adata

…learer.

…saction limit, and change setitem signature for take value tuple

…ting task

…t TODOs

…streams

…ODO and ticket for regex issues

…rom pickle write so that we maintain writing pickles if JSON fails

… the pickle back into the queue

… what to do in dynamodb_state_store

KaspariK marked this pull request as ready for review September 17, 2024 16:59

KaspariK requested review from EmanElsaban and nemacysts September 18, 2024 16:11

nemacysts reviewed Sep 20, 2024

View reviewed changes

nemacysts reviewed Oct 22, 2024

View reviewed changes

KaspariK force-pushed the u/kkasp/TRON-2237-write-json-state branch from 6054272 to 2c41653 Compare October 24, 2024 16:38

nemacysts approved these changes Oct 24, 2024

View reviewed changes

nemacysts approved these changes Oct 28, 2024

View reviewed changes

KaspariK added 20 commits October 29, 2024 13:39

Add types to DynamoDB restore flow. Add Persistable abstract class an…

b4a7e16

…d to_json methods for classes in DynamoDB restore flow. Write an additional attribute to DynamoDB to capture non-pickled state_data.

Remove old comments. Delete mcp state metadata.

c081772

Return these

88de418

Try itest validation without mcp_state

909e9b2

Fix isinstance check for Action to_json

1a5238b

Comment cleanup. Add cleanup TODOs for TRON-2293. Rename val in dynam…

c18cce4

…odb_state_store to something a little more explanatory now that we have 2 versions

Remove shelve key test since we no longer save mcp_state or state_met…

7fb271d

…adata

Remove deprecated itest. Rename some variables to make things a bit c…

677d6e3

…learer.

Remove unnecessary list call

e262cbc

Replace ignore with proper types for sorted_groups

83d7c65

Remove skip_validation since removing MCP StateMetadata

002f935

Add more types and a few TODO tickets

f5ab20c

Remove debug logs, add constant for transact_write_items, update tran…

00a50a9

…saction limit, and change setitem signature for take value tuple

Check that rendered_command and docker_image are not None before crea…

0eca03f

…ting task

Remove comment from before my enlightenment

d727236

Add remaining types, add tickets to valid TODOs, and delete irrelevan…

a02aedc

…t TODOs

Fix time_zone type. Remove unreachable code after typing trigger_down…

f26f855

…streams

Maintain most of the parsing logic as is. Add a few more tests. Add T…

c3c8c77

…ODO and ticket for regex issues

Add error handling in various to_json() funcs. Break JSON write out f…

7a0cb98

…rom pickle write so that we maintain writing pickles if JSON fails

Update to_json exception logging. On DynamoDB write failure only pass…

e79ae25

… the pickle back into the queue

KaspariK added 2 commits October 29, 2024 13:41

Raise these so that we get the full picture on failure and can decide…

e8dcc8e

… what to do in dynamodb_state_store

Add serialization error counter

b0186d1

KaspariK force-pushed the u/kkasp/TRON-2237-write-json-state branch from 2ce62ad to b0186d1 Compare October 29, 2024 20:41

KaspariK merged commit 3c63aa3 into master Oct 29, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write JSON state #997

Write JSON state #997

KaspariK commented Sep 17, 2024

nemacysts left a comment

nemacysts Sep 20, 2024

nemacysts Sep 20, 2024

nemacysts Sep 20, 2024

nemacysts Sep 20, 2024

nemacysts Sep 20, 2024

KaspariK Oct 18, 2024

nemacysts Sep 20, 2024

KaspariK Oct 18, 2024 •

edited

Loading

nemacysts Sep 20, 2024

KaspariK Oct 18, 2024

nemacysts Sep 20, 2024

KaspariK Oct 18, 2024

nemacysts left a comment

nemacysts Oct 22, 2024

KaspariK Oct 24, 2024 •

edited

Loading

nemacysts Oct 22, 2024

nemacysts left a comment

nemacysts Oct 24, 2024

nemacysts Oct 24, 2024

KaspariK Oct 25, 2024

KaspariK commented Oct 25, 2024

nemacysts commented Oct 28, 2024

		return sorted_groups # type: ignore
		return sorted_groups # type: ignore

		log.error(error)
		# Add items back to the queue if we failed to save

Write JSON state #997

Write JSON state #997

Conversation

KaspariK commented Sep 17, 2024

What

Why

nemacysts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KaspariK Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nemacysts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KaspariK Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nemacysts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KaspariK commented Oct 25, 2024

nemacysts commented Oct 28, 2024

KaspariK Oct 18, 2024 •

edited

Loading

KaspariK Oct 24, 2024 •

edited

Loading