-
Notifications
You must be signed in to change notification settings - Fork 27
/
Copy pathdisassembler.html
140 lines (139 loc) · 6.55 KB
/
disassembler.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link rel="stylesheet" href="pdfsyntax.css">
<title>PDFSyntax</title>
</head>
<body>
<header>
<a href="https://pdfsyntax.dev"><img src="logo.svg" width="50"/></a>
</header>
<main>
<h2>A PDF Disassembler</h2>
<p></p>
<h3>Introduction</h3>
<p>PDFSyntax is built from the ground up to become a PDF inspection and transformation library/tool. On this path, while debugging, it is sometimes useful to directly dump the <strong>structure</strong> - and <strong>not the content</strong> - of PDF files on the command line (either files to transform or file produced by a transformation).</p>
<p>This <em>"Disassembler"</em> function is exposed on the command line interface of the library. By design its output is terse and greppable.</p>
<p>As stated by the PDF Specification, <em>PDF is not a programming language, and a PDF file is not a program.</em> But there is an analogy: cross-references allow to jump to different portions of a file with absolute (regular indirect objects) or relative addressing (indirect objects embedded in object streams).</p>
<h3>Usage</h3>
<p>PDFSyntax can be installed from the GitHub repo (no dependency) or from PyPI:</p>
<pre> pip install pdfsyntax
</pre>
<p>Then the following command prints a dump on the standard output:</p>
<pre> python3 -m pdfsyntax disasm FILE
</pre>
<h3>Example</h3>
<p>Here is the output produced for the <a href='https://github.com/desgeeko/pdfsyntax/raw/main/samples/simple_text_string.pdf'><em>Simple Text String</em></a> example file from the PDF Specification:</p>
<pre>$ python3 -m pdfsyntax disasm samples/simple_text_string.pdf
+ 000 [8 ] comment %PDF-1.4
+ 008 [1 ] void
+ 009 [70 ] @- ind_obj 1,0 dict /Catalog
+ 079 [1 ] void
+ 080 [48 ] @- ind_obj 2,0 dict /Outlines
+ 128 [1 ] void
+ 129 [62 ] @- ind_obj 3,0 dict /Pages
+ 191 [1 ] void
+ 192 [183] @- ind_obj 4,0 dict /Page
+ 375 [1 ] void
+ 376 [121] 100% _ @- ind_obj 5,0 stream
+ 497 [1 ] void
+ 498 [27 ] @- ind_obj 6,0 array
+ 525 [1 ] void
+ 526 [119] @- ind_obj 7,0 dict /Font
+ 645 [1 ] void
+ 646 [198] xreftable /Root=1,0
- -@ xref 0,65535 000 free
- -@ xref 1,0 009 inuse
- -@ xref 2,0 080 inuse
- -@ xref 3,0 129 inuse
- -@ xref 4,0 192 inuse
- -@ xref 5,0 376 inuse
- -@ xref 6,0 498 inuse
- -@ xref 7,0 526 inuse
+ 844 [1 ] void
+ 845 [13 ] startxref 646
+ 858 [1 ] void
+ 859 [5 ] comment %%EOF
+ 864 [1 ] void
+ 865 [1 ] void
$
</pre>
<p>This example is very basic because of its xref table and single uncompressed (100%) content stream. Inside a more complex file you would probably find something like this:</p>
<pre>+ 053123 [461 ] 86% _Flate ind_obj 52,0 stream /XRef /Root=50,0
</pre>
<p></p>
<h3>Grep examples</h3>
<p>Ignore void spaces (separators) and focus on real objects:</p>
<pre> python3 -m pdfsyntax disasm FILE | grep -v void
</pre>
<p>List all xref entries of all tables or streams:</p>
<pre> python3 -m pdfsyntax disasm FILE | grep xref
</pre>
<p>Ignore detail lines:</p>
<pre> python3 -m pdfsyntax disasm FILE | grep "^+"
</pre>
<p>Search all mentions of an indirect object, both in itself and in xref:</p>
<pre> python3 -m pdfsyntax disasm FILE | grep 52,
</pre>
<p></p>
<h3>Columns</h3>
<p>Most of the columns are collapsable in order to save horizontal space. For example the position (<code>#2</code>) may have a maximum width of 10 digits but for small files the unecessary leading zeros are removed.</p>
<table><thead><tr><th> Column </th>
<th> Description </th>
</tr>
</thead>
<tbody><tr><td> <code>1</code> </td>
<td> <code>+</code> for a region with absolute positionning, <code>-</code> for a detail line (xref, /XRef, /ObjStm)</td>
</tr>
<tr><td> <code>2</code> </td>
<td> Position, absolute (<code>+</code>) or relative (<code>-</code>)</td>
</tr>
<tr><td> <code>3</code> </td>
<td> Size in bytes</td>
</tr>
<tr><td> <code>4</code> </td>
<td> Percentage compressed size / plain size</td>
</tr>
<tr><td> <code>5</code> </td>
<td> Applied filter(s)</td>
</tr>
<tr><td> <code>6</code> </td>
<td> Sequence number of embedded object</td>
</tr>
<tr><td> <code>7</code> </td>
<td> Indirect reference of <strong>envelope</strong> object</td>
</tr>
<tr><td> <code>8</code> </td>
<td> Index check, @ for OK and ? for NOK</td>
</tr>
<tr><td> <code>9</code> </td>
<td> Region type</td>
</tr>
<tr><td> <code>10</code> </td>
<td> Indirect reference of <strong>this</strong> object</td>
</tr>
<tr><td> <code>11</code> </td>
<td> Addressing mode of xref, envelope or absolute</td>
</tr>
<tr><td> <code>12</code> </td>
<td> Address of xref</td>
</tr>
<tr><td> <code>13</code> </td>
<td> Type of PDF object</td>
</tr>
<tr><td> <code>14</code> </td>
<td> Some detail : most important keys for dict, excerpt for comment</td>
</tr>
</tbody>
</table>
<p></p>
<blockquote><p> Warning: Encrypted files are not supported yet. </p>
</blockquote>
</main>
<footer>
© 2025 <a href="mailto:[email protected]">Martin D.</a> <[email protected]>
</footer>
</body>
</html>