Skip to content

serialization codegen: parse from data to code #1244

@milahu

Description

@milahu

so far we can use kaitai to parse from data to data
from low-level data (bytes) to high-level data (numbers, strings, ...)

it would be nice to have a way
to parse from data to code
to "reverse engineer" a high-level serialization code
which generates the same bytes as in the input data

this would be helpful to create custom serialization functions
by refactoring the generated code

this would also be helpful for fuzz-testing the correctness of ksy files
by feeding known-good data into the pipeline

example:

# some_format.ksy
meta:
  id: some_format
  endian: be
seq:
  - id: key
    size: 1
  - id: value
    type: str
    size: 4
    encoding: UTF-8
#!/usr/bin/env python3

input_data = b"\x00asdf"

import some_format
code = some_format.codegen_from_data(input_data)
with open("editme.py", "w") as f:
  f.write(code)

eval(code)
_io = write_data()
_io.seek(0)
output_data = _io.read()
assert input_data == output_data

the generated code would look like

#!/usr/bin/env python3

data_size = 5

def create_data():
  import some_format
  root = some_format.SomeFormat()
  root.key = 0
  root.value = "asdf"
  root._check()
  return root

def write_data():
  import kaitaistruct
  _io = kaitaistruct.KaitaiStream(io.BytesIO(bytearray(data_size)))
  root = create_data()
  root._write(_io)
  return _io

first draft in kaitai_serialize_codegen.py
example output: codegen_result.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions