Skip to content

Commit 3c08b95

Browse files
committed
x86 complete
1 parent a6519b3 commit 3c08b95

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+3568
-1529
lines changed

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
11
compile-exe
22
execute-exe
3-
*.bin
3+
/*.bin
4+
x86-32_backend_generated.c
5+
x86-32_backend_generated.h
6+
x86-32_engine_asm.o
7+
__pycache__

.gitmodules

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
[submodule "forth_programs/lvgl/lvgl"]
22
path = forth_programs/lvgl/lvgl
3-
url = https://github.com/lvgl/lvgl.git
3+
url = https://github.com/liamHowatt/lvgl.git

Makefile

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,22 @@
11
all: compile-exe execute-exe
22

3-
compile-exe: mcp_forth.h compile.c compile-exe.c num.c vm_backend.c
4-
gcc -Wall -fsanitize=address -g compile.c compile-exe.c num.c vm_backend.c -o compile-exe
3+
x86-32_backend_generated.c: x86_32_backend_generator.py nasm.py
4+
python3 x86_32_backend_generator.py
55

6-
execute-exe: mcp_forth.h execute.c execute-exe.c num.c vm_engine.c
7-
gcc -m32 -Wall -fsanitize=address -g execute.c execute-exe.c num.c vm_engine.c -o execute-exe
6+
compile-exe: mcp_forth.h mcp_forth.c compile.c compile-exe.c vm_backend.c x86-32_backend.c x86-32_backend_generated.c x86-32_backend_generated.h
7+
gcc -m32 -Wall -fsanitize=address -g mcp_forth.c compile.c compile-exe.c vm_backend.c x86-32_backend.c x86-32_backend_generated.c -o compile-exe
8+
9+
x86-32_engine_asm.o: x86-32_engine_asm.s
10+
nasm -felf32 -o x86-32_engine_asm.o x86-32_engine_asm.s
11+
12+
execute-exe: mcp_forth.h mcp_forth.c execute-exe.c runtime_io.c runtime_time.c runtime_string.c runtime_process.c runtime_file.c vm_engine.c x86-32_engine.c x86-32_engine_asm.o
13+
gcc -m32 -Wall -fsanitize=address -g mcp_forth.c execute-exe.c runtime_io.c runtime_time.c runtime_string.c runtime_process.c runtime_file.c vm_engine.c x86-32_engine.c x86-32_engine_asm.o -o execute-exe
814

915
test-simple: all
10-
find forth_programs/simple -maxdepth 1 -type f | xargs -I{} ./compile-and-run.sh {}
16+
find forth_programs/simple -maxdepth 1 -type f | xargs -I{} ./compile-and-run.sh vm {}
17+
18+
test-simple-x86: all
19+
find forth_programs/simple -maxdepth 1 -type f | xargs -I{} ./compile-and-run.sh x86 {}
1120

1221
clean:
13-
rm -f compile-exe execute-exe *.bin
22+
rm -f compile-exe execute-exe x86-32_backend_generated.c x86-32_backend_generated.h x86-32_engine_asm.o

README.md

Lines changed: 107 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,111 @@ make test-simple
1010
How to run `life.fs` and the hash_xors
1111

1212
```sh
13-
./compile-and-run.sh forth_programs/life/life.fs < forth_programs/life/starting_board.txt
14-
./compile-and-run.sh forth_programs/hash_xor/hash_xor.fs < forth_programs/hash_xor/hash_input.txt
13+
./compile-and-run.sh vm forth_programs/life/life.fs < forth_programs/life/starting_board.txt
14+
./compile-and-run.sh vm forth_programs/hash_xor/hash_xor.fs < forth_programs/hash_xor/hash_input.txt
1515
```
16+
17+
## Performance Measurements
18+
19+
| Test | Gforth | mcp-forth vm -O3 | mcp-forth x86 -O3 | C equivalent -O3 |
20+
| -------------------------------- | ------- | ---------------- | ----------------- | ---------------- |
21+
| SPI pixel data compression | 12.754s | 1m48.735s | 12.978s | 0.432s |
22+
23+
See the bemchmarks directory for the test source files.
24+
25+
## Why
26+
27+
I need portable driver code.
28+
29+
An imaginary display module has a Forth program in its ROM which is retrieved by the host
30+
and executed to copy updated areas to the display. This display is not trivially controlled.
31+
It receives a compressed stream of data over SPI. The Forth program has a function with
32+
a host-expected signature like `( x1 y1 x2 y2 src -- )` which implements the compression
33+
algorithm and uses host-provided facilities to achieve the SPI transmission. It needs
34+
to be fast so mcp-forth must be capable of producing somewhat optimal machine code. Forth is
35+
the chosen language instead of C because I need the compiler to use a minimal amount of
36+
memory to do the compilation. The mcp-forth compiler is intended to run on hosts which are
37+
MCUs with RAM around the range of 100 kB to 10 MB. A secondary requirement is the host cross-compiling
38+
Forth programs to run on peripheral MCUs with as little as 8 kB of RAM.
39+
40+
## Supported Architectures
41+
42+
The bytecode VM is one of the available compiler backends for ease of development and while
43+
having some freestanding merit such as being portable to platforms for which there isn't a
44+
native backend yet. It will likely always boast the smallest binary size across architectures
45+
due to using a stream of variable-length numbers as its encoding for opcodes and operands.
46+
It is the default choice for testing new Forth code due to having over/underflow/run
47+
checks where the other backends may forgo safety in favor speed. In summary, it is the default
48+
choice unless speed is a requirement.
49+
50+
Currently supported architectures:
51+
52+
- Interpreted bytecode VM (explained above)
53+
- x86-32
54+
55+
Planned:
56+
57+
- ARM Thumb (Cortex M0+)
58+
- Xtensa (LX6, i.e. ESP32)
59+
60+
## 32 Bits
61+
62+
For simplicity, mcp-forth only supports 32 bit. The C code that implements the runtimes assumes
63+
that both `int` and `void *` are 32 bits wide. The Forth cell size in mcp-forth is always 4 bytes. Pointers
64+
can be transparently handled as integers. The compiler can work on 64 bit machines so cross-compiling
65+
Forth programs on a 64 bit host is possible but running the output is only possible if
66+
the host supports some kind of 32 bit mode where pointers are 32 bits wide. This means that
67+
Apple M1, M2, M3, and M4 processors cannot execute the output of the mcp-forth compiler even with the
68+
VM runtime because they have no support for 32 bit programs. An emulator such as Qemu is required.
69+
70+
## Non-standard Quirks
71+
72+
- `C' <word>` works like `' <word>` except it creates a C function pointer from the word so that
73+
Forth words can be used as C callback functions. The number of parameters and optional
74+
return value is derived from the `( -- )` signature and an error is raised at compile time
75+
if the signature is missing. The `( -- )` signature has no other semantic meaning besides this.
76+
- Currently, defined words must only be used after they're defined or else a compile time
77+
error is raised.
78+
- Any word that was not found at compile time is a runtime dependency and must be provided by
79+
the runtime.
80+
- Gforth's "compile time only words" can be used outside of functions in mcp-forth.
81+
- `UNLOOP` is not required (and will be a no-op if added in the future)
82+
83+
## Minutia
84+
85+
### Iterative "Fragment Solving"
86+
87+
Fragments aka snippets of machine code have variable sizes depending on their operands. If a jump
88+
instruction jumps somewhere nearby, it may only use 1 byte to encode the offset, otherwise 4 bytes.
89+
Literal values are similar. An immediate literal may be loaded into a register differently
90+
depending on its size. Some architectures require multiple instructions to load larger immediate
91+
literal values.
92+
93+
Given that the jump distance may not be known at the time of a jump fragment's creation, the
94+
collection of all fragments at the end of compilation must be solved in an iterative way to
95+
achieve optimal packing.
96+
97+
Question: will iterative solving ever cause the compiler to hang in an infinite loop that it can't solve?
98+
99+
### Optimizing Compiler Memory Usage
100+
101+
The compiler allocates a few arrays which it repeatedly appends elements to during compilation
102+
and resizes them when their capacities are exceeded. There are no small allocations since the overhead
103+
of N allocations of a small struct may be greater than an allocated contiguous array of N small structs.
104+
105+
Strings referring to source code tokens are not allocated arrays of bytes.
106+
They are pointers into the source code and a length.
107+
108+
Since struct references are always being invalidated due to array resizing, struct references
109+
are stored as array indices instead of pointers.
110+
111+
Question: is it a good or bad idea to reduce memory usage by:
112+
113+
- Storing strings as only a pointer with no length. The strings are whitespace-terminated since they
114+
point inside the source code. There is a special case for a string that is the last token in the
115+
source with no following whitespace.
116+
- Extending the previous point, should 16 bit offsets into the source be used instead of 32 bit pointers?
117+
- For indices into arrays of structs, should 16 bit indices be used instead of 32 bit ints?
118+
- For a case where less-than-32-bit integer types are used to store offsets, can all the members of
119+
an array of offsets be dynamically promoted as needed? The first offset that exceeds 65535 would
120+
cause the array to be converted from an array of 16 bit offsets to an array of 32 bit offsets.

benchmarks/pixel_compress/.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
c
2+
fs.vm.bin
3+
fs.x86.bin
4+
res_c
5+
res_gforth
6+
res_m4_vm
7+
res_m4_x86

benchmarks/pixel_compress/README.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
The results of this benchmark are in the root README.md
2+
3+
This test compresses a pixel buffer and outputs
4+
the to stdout. the test is run 1000 times
5+
redundantly, only outputting once.
6+
7+
The C code is based on this: https://bitbucket.org/4DPi/gen4-hats/src/236d9c064b06640f6a33be649ddd0aae147e0239/4d-hats.c#lines-781:830
8+
9+
```sh
10+
gcc -Wall -O3 c.c -o c
11+
time ./c < ~/Downloads/rgb565data.bin > res_c
12+
```
13+
14+
the Forth version reads the input file internally
15+
from a hardcoded path.
16+
17+
```sh
18+
time gforth fs.fs -e bye > res_gforth
19+
```
20+
21+
In the Makefile entry for `execute-exe`, replace
22+
`-fsanitize=address -g` with `-O3`.
23+
24+
```sh
25+
../../compile-exe vm fs.fs fs.vm.bin
26+
time ../../execute-exe vm fs.vm.bin > res_m4_vm
27+
28+
../../compile-exe x86 fs.fs fs.x86.bin
29+
time ../../execute-exe x86 fs.x86.bin > res_m4_x86
30+
```
31+
32+
assert all the outputs are the same.
33+
34+
```sh
35+
cmp res_c res_gforth
36+
cmp res_c res_m4_vm
37+
cmp res_c res_m4_x86
38+
```

benchmarks/pixel_compress/c.c

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
#include <stdint.h>
2+
#include <stdio.h>
3+
4+
static struct {uint8_t buf[1024 * 1024];} lcdpi_spiblock_dma;
5+
6+
static void lcdpi_compress(int *ptr, unsigned short *pixel_buffer, int pixel_buffer_len, unsigned short *codes, unsigned short speedup)
7+
{
8+
int j;
9+
int x;
10+
unsigned short value, last_value, mask;
11+
int repeated;
12+
13+
mask = (codes[0] & 0x8000) ^ 0x8000;
14+
value = (((pixel_buffer[0] & 0xffc0) >> 1) + (pixel_buffer[0] & 0x001f)) | mask;
15+
lcdpi_spiblock_dma.buf[((*ptr) * 2) + 1] = (value >> 8);
16+
lcdpi_spiblock_dma.buf[((*ptr) * 2) + 2] = (value) & 0xff;
17+
(*ptr)++;
18+
last_value = value;
19+
repeated = 0;
20+
for (x = 1; x < pixel_buffer_len; x++) {
21+
value = (((pixel_buffer[x] & 0xffc0) >> 1) + (pixel_buffer[x] & 0x001f)) | mask;
22+
if (value != last_value) {
23+
for (j = 0; j < (repeated / speedup); j++) {
24+
lcdpi_spiblock_dma.buf[((*ptr) * 2) + 1] = codes[speedup] >> 8;
25+
lcdpi_spiblock_dma.buf[((*ptr) * 2) + 2] = codes[speedup] &0xff;
26+
(*ptr)++;
27+
}
28+
29+
if(repeated % speedup) {
30+
lcdpi_spiblock_dma.buf[((*ptr) * 2) + 1] = codes[repeated % speedup] >> 8;
31+
lcdpi_spiblock_dma.buf[((*ptr) * 2) + 2] = codes[repeated % speedup] &0xff;
32+
(*ptr)++;
33+
}
34+
35+
lcdpi_spiblock_dma.buf[((*ptr) * 2) + 1] = (value >> 8);
36+
lcdpi_spiblock_dma.buf[((*ptr) * 2) + 2] = (value) & 0xff;
37+
(*ptr)++;
38+
last_value = value;
39+
repeated = 0;
40+
} else {
41+
repeated++;
42+
}
43+
}
44+
for (j = 0; j < repeated / speedup; j++) {
45+
lcdpi_spiblock_dma.buf[((*ptr) * 2) + 1] = codes[speedup] >> 8;
46+
lcdpi_spiblock_dma.buf[((*ptr) * 2) + 2] = codes[speedup] &0xff;
47+
(*ptr)++;
48+
}
49+
50+
if(repeated % speedup) {
51+
lcdpi_spiblock_dma.buf[((*ptr) * 2) + 1] = codes[repeated % speedup] >> 8;
52+
lcdpi_spiblock_dma.buf[((*ptr) * 2) + 2] = codes[repeated % speedup] &0xff;
53+
(*ptr)++;
54+
}
55+
}
56+
57+
int main()
58+
{
59+
int ptr = 0;
60+
static uint8_t pb[1024 * 1024] __attribute__ ((aligned (2)));
61+
int pbl = 0;
62+
int c;
63+
while((c = getchar()) != EOF) {
64+
pb[pbl++] = c;
65+
}
66+
lcdpi_spiblock_dma.buf[0] = 0x42;
67+
unsigned short codes[] = {0xaaab, 0x80ff, 0x8cff, 0x8ccf, 0x98c9, 0xaabf, 0xaaaf, 0xaaab};
68+
for (int i = 0 ; i < 1000 ; i++) {
69+
ptr = 0;
70+
lcdpi_compress(&ptr, (unsigned short *) pb, pbl / 2, codes, 7);
71+
}
72+
fwrite(lcdpi_spiblock_dma.buf, 1, ptr * 2 + 1, stdout);
73+
}

benchmarks/pixel_compress/fs.fs

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
: compute_value ( hw -- hw )
2+
dup 0xffc0 and 2/
3+
swap 0x001f and
4+
or
5+
;
6+
7+
: pre_inc2 ( addr -- addr+2 addr+2 )
8+
2 + dup
9+
;
10+
11+
: store ( dst val -- )
12+
2dup swap 1+ c!
13+
8 rshift swap c!
14+
;
15+
16+
variable codes
17+
align here codes !
18+
0xaaab , 0x80ff , 0x8cff , 0x8ccf , 0x98c9 , 0xaabf , 0xaaaf ,
19+
variable last_value
20+
variable repeated
21+
variable start
22+
23+
: helper
24+
repeated @ 7 / dup 0<> if 0 do
25+
pre_inc2 0xaaab store
26+
loop else drop then
27+
28+
repeated @ 7 mod dup 0<> if
29+
>r pre_inc2 r> cells codes @ + @ store
30+
else drop then
31+
;
32+
33+
: lcdpi_compress ( src dst src_len -- out_len )
34+
>r
35+
dup start !
36+
1+
37+
swap pre_inc2 w@ compute_value last_value ! swap
38+
pre_inc2 last_value @ store
39+
0 repeated !
40+
r> 1 do
41+
swap pre_inc2 w@ compute_value >r swap r>
42+
dup last_value @ <> if
43+
last_value !
44+
45+
helper
46+
47+
pre_inc2 last_value @ store
48+
49+
0 repeated !
50+
else
51+
drop
52+
1 repeated +!
53+
then
54+
loop
55+
56+
helper
57+
58+
swap drop
59+
start @ -
60+
;
61+
62+
: check_error ( wior -- )
63+
0<> if ." error" cr bye then
64+
;
65+
66+
variable byte_read
67+
: read_file
68+
s" /home/liam/Downloads/rgb565data.bin" r/o open-file check_error
69+
begin byte_read 1 2 pick read-file check_error 0> while byte_read @ c, repeat
70+
close-file check_error
71+
;
72+
73+
: main
74+
here 1024 dup * allot
75+
66 over c!
76+
dup align here read_file here over - 2/ >r 2 - swap 2 - r>
77+
-1
78+
1000 0 do
79+
drop
80+
2 pick 2 pick 2 pick lcdpi_compress
81+
loop
82+
4 pick swap
83+
type
84+
;
85+
86+
main

benchmarks/pixel_compress/rgb565data.bin

Lines changed: 6 additions & 0 deletions
Large diffs are not rendered by default.

compile-and-run.sh

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
#!/bin/bash
22

3-
BASENAME=$(basename $1 .fs)
3+
BASENAME=$(basename $2 .fs)
44
BINNAME="${BASENAME}.bin"
5-
echo $1
6-
./compile-exe vm $1 $BINNAME
7-
./execute-exe vm $BINNAME
5+
echo $2
6+
./compile-exe $1 $2 $BINNAME
7+
./execute-exe $1 $BINNAME
88
echo

0 commit comments

Comments
 (0)