Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
611 commits
Select commit Hold shift + click to select a range
439d44d
Update testbench structure to allow multiple iterations
Jul 9, 2023
6cddea1
Fix: multiple iterations
Jul 11, 2023
703746a
Fix bug in axis_pixels. Update alex_axis_adapter_any for verilator
Jul 21, 2023
9657a45
Change linux sim to verilator
Jul 22, 2023
eeb2770
Refactor
Jul 23, 2023
e9c8997
Get compiled verilator
Jul 23, 2023
82753e5
Add notebooks for pynq firmware & perfomance
Jul 25, 2023
3f635ab
Add new bundle from zhenghua
Jul 29, 2023
b53dbe1
Refactor
Jul 30, 2023
43b6459
Refactor
Aug 1, 2023
ea9350b
Add export
Aug 2, 2023
40c324b
Integerate new bundle into param_test
Aug 14, 2023
4a807a4
Feature: allow CI > WRAM/KH
Aug 21, 2023
e7f4af2
Entire model in testbench - sv structs
Aug 24, 2023
b73b23b
Replace AXI-Stream output with ping-pong buffer. Works for K!=1
Aug 25, 2023
a203b4d
Fix K=1, changed handshake signsls to toggle
Aug 26, 2023
38f9b33
Add basic DPI-C firmware for xsim & verilator
Aug 26, 2023
925a2e2
Change done_fill to single clock, for interrupt
Aug 31, 2023
c260a4b
Moved load_y logic & file writing to runtime.c
Aug 31, 2023
1a28c12
Refactor to decouple txt gen; fixed X_BITS error
Aug 31, 2023
928d710
Migrated to binary file based DMA testbench, with control logic in C.…
Sep 2, 2023
dbd4429
Github actions bitstring
Sep 2, 2023
5d268c1
Make verilator happy
Sep 2, 2023
9331920
Pack N-bit words into 8-bit little endian bytes for DMA
Sep 3, 2023
fa2887b
Fix: tkeep broadcasting for BITS<8
Sep 3, 2023
993459c
Update vivado scripts to match output BRAM & interrupt
Sep 4, 2023
053fcd4
Updated zcu104 script
abarajithan11 Sep 4, 2023
d0fdae3
Update counters in callback with all indices
Sep 5, 2023
9e3dcec
Update README
Sep 6, 2023
c0c27ff
Write to byte memory from C and read into files
Sep 11, 2023
7b5954f
Add all Ps together in runtime
Sep 12, 2023
cee32e6
Add new bundle for p=1 case
Sep 12, 2023
89dd249
Move all memory from SV to C. Works in Verilator
Sep 12, 2023
e4b7930
Mark output ports with m_ for ASIC pin placement
Sep 13, 2023
8b7d62b
Add bias; works for all conv2d, not dense
Sep 15, 2023
47822e2
Refactor runtime
Sep 16, 2023
5f039e9
Fix Dense: 1. CONFIG_BEATS=1 causes odd CM, modified w_rot FSM to all…
Sep 16, 2023
2a8a49e
Update py quantization to integer operations
Sep 16, 2023
3103116
Further optimize activation
Sep 16, 2023
6e96871
Merge the logic of quantize and q_leaky_relu into one
Sep 16, 2023
3adc71f
Quantize & LRELU in Runtime
Sep 16, 2023
3bca141
Add support for quantized_bits(keep_neg=False) and quantized_relu(slo…
Sep 17, 2023
0b4fe23
Add & test tiling in python
Oct 18, 2023
2f79ab8
Fix bits=8
Oct 19, 2023
760448d
Fix H-padding
Oct 24, 2023
da04103
Fix runtime - switch order of N and L
Oct 24, 2023
383f235
Fix tiling (py) cm/cmp0
Oct 26, 2023
6d14263
Python - tiling all layers
Oct 26, 2023
cdf8f37
Temp disable last layer (since hwc tiling)
Oct 26, 2023
38165eb
Refactor runtime
Oct 26, 2023
5d46e76
Seperate debug outputs for raw, summed and processed
Oct 26, 2023
820ee27
Prepare for C tiling
Oct 27, 2023
9f9c742
Add y_index
Oct 27, 2023
5274542
Add C tiling
Oct 27, 2023
aa18cb5
Refactor indexing
Oct 27, 2023
16438f7
Refactor runtime
Oct 28, 2023
73d0fb6
Simplify striding
Oct 29, 2023
fdf37f6
Simplify striding
Oct 29, 2023
4ff3667
Simplify striding
Oct 29, 2023
d32cc6b
Simplify striding
Oct 29, 2023
0f95569
Simplify striding
Oct 29, 2023
08afc40
Simplify striding
Oct 29, 2023
72a3d2c
Simplify striding
Oct 29, 2023
4691423
Simplify striding
Oct 29, 2023
59cbfe1
Simplify striding
Oct 30, 2023
1dee0d6
Change p_sum ordering and buffer to nhwc
Oct 30, 2023
fe669c2
Prepare bundle to support stride & pool
Oct 30, 2023
047cdae
Add conv striding
Oct 30, 2023
3187573
Fix output nhwc debug reshaping
Oct 30, 2023
806ea6c
Refactor: merge regular & h_padding cases of writing, using sweep
Oct 30, 2023
e4dee87
Fix xr_sweep bug
Oct 30, 2023
594dda7
Refactor: change x_indices to y_indices, preparing for pooling
Oct 30, 2023
2339582
Move checking above flatten
Oct 30, 2023
e6264df
Fix OH!=YH
Oct 30, 2023
323eba3
Simplify asserts
Oct 31, 2023
65a3d2c
Refactor: extract write_tile into a function
Oct 31, 2023
a2e6799
Add pooling - works for stride:(1,1)
Oct 31, 2023
983a601
Fix pooling with stride at the edges
Oct 31, 2023
b3aa2d8
Refactor runtime, enable all layers on py_test
Oct 31, 2023
2370ab0
Add conv stride + pooling
Oct 31, 2023
1d5a15d
Refactor: rename p_bundle, p_bo; change relu to conditional (for arm)
Oct 31, 2023
14ac970
Simplify assert macro
Oct 31, 2023
6f69af8
Change datatypes using stdint.h
Oct 31, 2023
c2515c6
Make const
Oct 31, 2023
103e503
Cleanup
Nov 1, 2023
166380a
Cleanup spacing in model.h
Nov 1, 2023
7f0eda1
Add bit-packing in C runtime
Nov 4, 2023
50e56ab
Add buffer allocation. Prepare for chaining
Nov 6, 2023
2820d01
Update runtime & tb to work in absolute addresses, not offsets
Nov 7, 2023
fba8c1e
C: chaining
Nov 7, 2023
957700c
Update readme
Nov 9, 2023
44e1dad
Remove config dataclass in param_test. prepare for residual add
Nov 10, 2023
d6df904
Fix avg pool & act with shift=0
Nov 11, 2023
a63a3b3
Add residual add
Nov 13, 2023
20fcc64
More complex residual add example
Nov 13, 2023
0f80a19
Fix bug in N_BUNDLES=1 case
Nov 13, 2023
36a206d
Python packaging & documentation
Nov 15, 2023
8a5eb2e
Reorganize dir struture
Nov 16, 2023
6290620
Major Restructure: Move RTL,TCL,TB to deepsocflow for deployment. Kee…
Nov 16, 2023
599c9d1
Move dependencies to TOML
Nov 17, 2023
5d6ba86
Integrate Hardware class into param_test.py
Nov 18, 2023
9349f3a
Fix vivado flow
Nov 18, 2023
0511986
Move hw params to pytest product
Nov 19, 2023
8d0dd35
Reorganize tcl files
Nov 19, 2023
2944622
Updated API
Nov 20, 2023
d304ec8
Add softmax
Nov 20, 2023
ea04128
Fix softmax
Nov 20, 2023
c1834f7
Replace OUT_RAM_SWITCH with regular S2MM DMA
Nov 21, 2023
5911092
Update firmware with for loop and wait
Nov 21, 2023
98b40d4
Clean up indent
Nov 21, 2023
3b1e4c3
Add FPGA firmware - single_transfer.c
Nov 21, 2023
c3120b0
Update firmware to compile on Vitis
Nov 22, 2023
213cfe5
Update model to export W+B+X as one bin
Nov 22, 2023
3f4c644
Refactor: p_mem-> mem.
Nov 22, 2023
9ddc5bc
Replace alex_axis_adapter.v with latest
Nov 29, 2023
b82bdf2
Update alex_axis_adapter.v to avoid reg=0 for asic
Nov 29, 2023
91066c8
Fix verilator warnings - all except w_m_ready loop
Nov 29, 2023
a088f92
Fix verilog errors, python warnings
Nov 29, 2023
c8bd913
Bring PE into proc_engine
Nov 29, 2023
450b646
Write MUL pipeline as raw registers to avoid Vivado DSP warning
Nov 29, 2023
38baa58
Fix verilator error for pipeline
Nov 30, 2023
064ab08
Fix all verilator warnings
Nov 30, 2023
9beb4cf
Update title
Dec 1, 2023
24dcb3f
Full network runs on FPGA - except softmax
Dec 3, 2023
7a8e903
Fix softmax: ResNet runs on FPGA
Dec 4, 2023
dca831a
Clean up C API; test with O3: 25ms
Dec 4, 2023
2920bd2
Update README with firmware API
Dec 4, 2023
4fe0d2f
Update AXI Width to 128
Dec 4, 2023
28eabae
Create a new branch for DMA controller development
zhenghuama Dec 4, 2023
34a91c7
Update vivado.tcl to have 3, 128-bit S_AXI ports
Dec 5, 2023
19b77b9
Rename example to sim.c to avoid confusion
Dec 5, 2023
2691dd5
Add resnet50
Dec 5, 2023
ca948f0
Add O3 to make verilator fast, make param_test print output earlier t…
Dec 6, 2023
44cd0ac
Remove 'self.bundle = self.layers[2:]' in model, since it takes too m…
Dec 6, 2023
71a591b
Fix bundle.idx!=ib - this causes incorrect connections
Dec 6, 2023
c2297d2
update example
Dec 7, 2023
afe3385
Fix width mismatch issue in zcy102.tcl
Dec 7, 2023
15696e1
Fix acc_width < y_bits issue
Dec 7, 2023
769cc37
Add support for add & pool quant
Dec 7, 2023
efa3375
Update README.md
abarajithan11 Dec 9, 2023
275c69a
Fix resnet: allow non-consequtive buffers for output and residual add
Dec 22, 2023
0b44d3c
Merge branch 'master' of https://github.com/abarajithan11/cnn-fpga
Dec 22, 2023
6e495c5
Fix edges count
Dec 26, 2023
e119222
Relative paths added to tcl files, SRAM generation tcl automation, Ve…
RaviduHM99 Feb 16, 2024
114d240
Decouple global & local resets in cyclic_bram
Mar 16, 2024
ba09e5e
Convert rst to rstn in alex modules
Mar 16, 2024
6a4599d
Decouple global & local resets in weights rotator & alex modules
Mar 16, 2024
da62561
Add optional async reset to all modules
Mar 19, 2024
52a5f34
Merge branch 'master' of https://github.com/abarajithan11/cnn-fpga
Mar 19, 2024
6b9c924
Fix package versions
Mar 19, 2024
dc4d1b9
Replace 10ps delay with clocking blocks
Mar 19, 2024
7391232
Add verilog define
Mar 19, 2024
896c3a4
Add reset option, period, io_delay into hardware.py
Mar 19, 2024
e496038
Fix package issue
Mar 19, 2024
4667419
Fix VCS errors & warnings
Mar 20, 2024
72b305a
Update delay mul to avoid VCS warning
Mar 21, 2024
73e417c
Remove all X: Add reset to all registers, add initial zeros to tb, re…
Mar 25, 2024
ac0ee78
Fix prob=1 bug
Apr 16, 2024
f5ca3b8
Avoid duplicate clocking
Apr 16, 2024
5bd9fb0
Seperate ocm in runtime
Apr 22, 2024
73e8c5d
Move cache-flush to after each write
Apr 25, 2024
3af4c24
Add signal m_bpt to give bytes per transfer
May 1, 2024
a6d1deb
Update Y_BITS to Y_OUT_BITS
May 1, 2024
b9806b4
Remove KH/2 padding for KH=0
May 5, 2024
6bc9915
Update readme & swicth testbench from clocking blocks
May 5, 2024
861ed56
Update README.md
abarajithan11 May 5, 2024
e807bae
pipelined array
awengz May 13, 2024
1d30c54
Merge branch 'master' of https://github.com/abarajithan11/deepsocflow
awengz May 13, 2024
ae13850
fixed compile
awengz May 13, 2024
fb0753f
cleanup
awengz May 17, 2024
573e98e
reversed pipeline
awengz May 17, 2024
3db5c37
cleanup for asic
awengz May 21, 2024
21c7a7e
stash commit
awengz Jun 17, 2024
3bb695b
Merge branch 'pipelined_array' of https://github.com/abarajithan11/de…
awengz Jun 17, 2024
18ef0bc
new output shifter working
awengz Jun 20, 2024
bef0a33
new_api: core+act float works
Jun 26, 2024
30af24f
new_api: pool+act float works
Jun 26, 2024
a2776bc
stash
awengz Jun 30, 2024
0f65308
new_api: add,softmax float works
Jul 2, 2024
bff9aee
new_api: dense float works
Jul 2, 2024
e244d16
move stuff around
awengz Jul 8, 2024
52a19c1
stash
awengz Jul 12, 2024
d5196a7
working?
awengz Jul 13, 2024
e3b795a
Update README.md
rck289 Jul 15, 2024
773b704
revert changes to debug axis_pipeline_reg
awengz Jul 15, 2024
601b62e
FIx verilator warning
Jul 17, 2024
3a5205c
fix verilator error
awengz Jul 17, 2024
aab580a
more verilator fixes
awengz Jul 17, 2024
c65dbb0
new api: conv + bias + stride output match
Jul 17, 2024
2579c40
passing -- cleanup needed
awengz Jul 18, 2024
149532a
new_api: Act works
Jul 18, 2024
64834f8
new_api: Pool works
Jul 18, 2024
109c6bc
new_api: export_inference() works
Jul 18, 2024
e05c7ac
cleanup, added comments
awengz Jul 18, 2024
3e75fda
Stable AXI-IP, working in xsim, verilator and FPGA
zhenghuama Jul 19, 2024
be17fb6
Update param_test.py
zhenghuama Jul 19, 2024
d7af089
Update xilinx_example.c
zhenghuama Jul 19, 2024
b665667
Update param_test.py
zhenghuama Jul 19, 2024
5db399e
new_api: fixed point works
Jul 19, 2024
1ef97ba
new_api: move into library
Jul 19, 2024
6f1331b
new_api: export() works
Jul 19, 2024
eedd937
Fix fp issue, lightning fast now
Jul 19, 2024
d610f96
Merge branch 'pipelined_array' into DMA_controller_dev
awengz Jul 19, 2024
63afb1c
smol issue
awengz Jul 19, 2024
3b38615
Fix some verilator warnings
Jul 20, 2024
aa505e6
Fix merge
Jul 20, 2024
5ba0f97
Rename load_y, demo_full
Jul 20, 2024
4ceb91b
Rename top
Jul 20, 2024
74c111d
Fix verilator issue
Jul 20, 2024
c4ac870
Update alex ips for asic
Jul 20, 2024
f28c821
Parametrize config & mem base addr
Jul 20, 2024
175e831
Merge with dma_controller
Jul 20, 2024
966a5e7
Zero verilator warnings
Jul 20, 2024
e53dde8
Take max_n_bundles as a param=64
Jul 20, 2024
c2007c1
New API matches param_test.py
Jul 21, 2024
7c99386
Merge new_api + dma_controller + pipelined
Jul 21, 2024
eeb1bcb
Make new api the default
Jul 21, 2024
de5b7bb
Update types in runtime to i8,u8
Jul 21, 2024
082a602
Move pixel header to DMA tuser
Jul 21, 2024
7602917
Move axis_header to DMA, change AXI_WIDTH=32
Jul 22, 2024
f53935b
Make non-debug mode default
Jul 22, 2024
17922b5
Make vivado happy
Jul 22, 2024
764ed45
Rename, remove debug buffers
Jul 22, 2024
ffc8820
synthesis on quartus
awengz Jul 22, 2024
cc961ce
Verified on FPGA
zhenghuama Jul 23, 2024
6d36dcf
Added restrict to firmware
Jul 23, 2024
c33558c
Minor cleanup
Jul 23, 2024
6523d43
Tested multiple runs
Jul 23, 2024
dbc19c0
Clean up Bundle_t
Jul 23, 2024
f1fb55f
Minimize flushing cache
zhenghuama Jul 25, 2024
3e3307c
Make config & mem baseaddr a function argument, to support virtual mem
Jul 25, 2024
3e3bda2
Add & fix resnet50
Jul 31, 2024
25c6d9b
Export performance
Aug 1, 2024
c7f7889
Pointnet works
Aug 1, 2024
3645ce1
stuck debug
Aug 1, 2024
28004ef
fix stuck.py issue
awengz Aug 2, 2024
c1288a8
ResNet-50 passed
Aug 2, 2024
e811855
Add jettagger
Aug 3, 2024
7b181d4
Fix example & readme
abarajithan11 Nov 22, 2024
d90924b
Minor changes
Feb 7, 2025
94710c5
Update package dependancy
Apr 28, 2025
e76dec5
Add vcd
Apr 28, 2025
7a9d8df
Add optional trace
abarajithan11 Sep 25, 2025
8703d30
Update resnet settings
Oct 17, 2025
d58e99f
Add matmul
Oct 21, 2025
c0c0cf2
Update zcu104
abarajithan11 Oct 21, 2025
a8787b6
add XUpSample layer for nearest neighbor upsampling functionality to…
STAmirr Nov 6, 2025
be37885
changed export_inference function to handle optional attributes safel…
STAmirr Nov 6, 2025
85c86ed
adjusting XBundle class to enhance export functionality for various …
STAmirr Nov 6, 2025
45bd9b5
enhanced dataflow for upsample2d compatibility
STAmirr Nov 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions .github/workflows/verify.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ jobs:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4

- name: Cache modules
id: cache-verify
Expand All @@ -23,6 +23,11 @@ jobs:
${{ runner.os }}-build-
${{ runner.os }}-

- name: Set up Python 3.11.5
uses: actions/setup-python@v4
with:
python-version: '3.11.5'

- name: Install Verilator
run: |
sudo apt-get install --only-upgrade python3
Expand All @@ -38,6 +43,7 @@ jobs:

- name: Install DeepSoCFlow
run: |
python -m pip install --upgrade pip
pip install .

- name: Verify Full Design
Expand Down Expand Up @@ -96,4 +102,4 @@ jobs:

# mkdir -p run/work_resnet
# cd run/work_resnet
# python ../resnet_50.py
# python ../resnet_50.py
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,14 @@ __pycache__
temp/

run/fpga/*
run/work*

run/asic/*
!deepsocflow/asic/reports

*.pickle
*.h5
*.keras
deepsocflow/test/vectors
deepsocflow/test/xsim
deepsocflow/test/dnn_engine_tb.vcd
Expand All @@ -29,6 +32,10 @@ run/work_resnet
run/work_temp
run/work_ccd
run/work_dddd
run/work_llm
run/work_example
run/work_resnet18
run/work_pointnet
run/work/project_1

# Vivado and verilator sim
Expand Down
300 changes: 108 additions & 192 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,192 +1,108 @@
<!-- https://github.com/abarajithan11/deepsocflow/assets/26372005/113bfd40-cb4a-4940-83f4-d2ef91b47c91 -->

# An Open Framework to Empower Scientific Edge Computing with Modern Neural Networks ![status](https://github.com/abarajithan11/dnn-engine/actions/workflows/verify.yml/badge.svg)

DeepSoCFlow is a Python library that helps researchers build, train, and implement their own deep ML models, such as ResNet CNNs, Autoencoders, and Transformers on FPGAs and custom ASIC.

It takes several months of work to get such deep models running correctly on edge platforms, at their promised maximal performance. This painful work includes:

- Designing an optimal dataflow
- Building & verifying an accelerator, optimizing for high-frequency
- Building the System-on-Chip, verifying and optimizing data bottlenecks
- Writing C firmware to control the accelerator, verifying, optimizing

Often, after all that work, the models do not meet their expected performance due to memory bottlenecks and sub-optimal hardware implementation.

We present a highly flexible, high performance accelerator system that can be adjusted to your needs through a simple Python API. The implementation is maintained as open source and bare-bones, allowing the user to modify the processing element to do floating point, binarized calculations...etc.

<p align="center"> <img src="docs/sys.PNG" width="600"> </p>

## User API

![System](docs/workflow.png)

```py
from deepsocflow import Bundle, Hardware, QModel, QInput

'''
0. Specify Hardware
'''
hw = Hardware ( # Alternatively: hw = Hardware.from_json('hardware.json')
processing_elements = (8, 96) , # (rows, columns) of multiply-add units
frequency_mhz = 250 , #
bits_input = 4 , # bit width of input pixels and activations
bits_weights = 4 , # bit width of weights
bits_sum = 16 , # bit width of accumulator
bits_bias = 16 , # bit width of bias
max_batch_size = 64 , #
max_channels_in = 2048 , #
max_kernel_size = 13 , #
max_image_size = 512 , #
ram_weights_depth = 20 , #
ram_edges_depth = 288 , #
axi_width = 64 , #
target_cpu_int_bits = 32 , #
valid_prob = 0.1 , # probability in which AXI-Stream s_valid signal should be toggled in simulation
ready_prob = 0.1 , # probability in which AXI-Stream m_ready signal should be toggled in simulation
data_dir = 'vectors', # directory to store generated test vectors
)
hw.export() # Generates: config_hw.svh, config_hw.tcl, config_tb.svh, hardware.json
hw.export_vivado_tcl(board='zcu104')


'''
1. Build Model
'''
XN = 1
input_shape = (XN,18,18,3) # (XN, XH, XW, CI)

QINT_BITS = 0
kq = f'quantized_bits({hw.K_BITS},{QINT_BITS},False,True,1)'
bq = f'quantized_bits({hw.B_BITS},{QINT_BITS},False,True,1)'
q1 = f'quantized_relu({hw.X_BITS},{QINT_BITS},negative_slope=0)'
q2 = f'quantized_bits({hw.X_BITS},{QINT_BITS},False,False,1)'
q3 = f'quantized_bits({hw.X_BITS},{QINT_BITS},False,True,1)'
q4 = f'quantized_relu({hw.X_BITS},{QINT_BITS},negative_slope=0.125)'

x = x_in = QInput(shape=input_shape[1:], batch_size=XN, hw=hw, int_bits=QINT_BITS, name='input')

x = x_skip1 = Bundle( core= {'type':'conv' , 'filters':8 , 'kernel_size':(11,11), 'strides':(2,1), 'padding':'same', 'kernel_quantizer':kq, 'bias_quantizer':bq, 'use_bias':True , 'act_str':q1}, pool= {'type':'avg', 'size':(3,4), 'strides':(2,3), 'padding':'same', 'act_str':f'quantized_bits({hw.X_BITS},0,False,False,1)'})(x)
x = x_skip2 = Bundle( core= {'type':'conv' , 'filters':8 , 'kernel_size':( 1, 1), 'strides':(1,1), 'padding':'same', 'kernel_quantizer':kq, 'bias_quantizer':bq, 'use_bias':True , 'act_str':q2}, add = {'act_str':f'quantized_bits({hw.X_BITS},0,False,True,1)'})(x, x_skip1)
x = Bundle( core= {'type':'conv' , 'filters':8 , 'kernel_size':( 7, 7), 'strides':(1,1), 'padding':'same', 'kernel_quantizer':kq, 'bias_quantizer':bq, 'use_bias':False, 'act_str':q3}, add = {'act_str':f'quantized_bits({hw.X_BITS},0,False,True,1)'})(x, x_skip2)
x = Bundle( core= {'type':'conv' , 'filters':8 , 'kernel_size':( 5, 5), 'strides':(1,1), 'padding':'same', 'kernel_quantizer':kq, 'bias_quantizer':bq, 'use_bias':True , 'act_str':q4}, add = {'act_str':f'quantized_bits({hw.X_BITS},0,False,True,1)'})(x, x_skip1)
x = Bundle( core= {'type':'conv' , 'filters':24, 'kernel_size':( 3, 3), 'strides':(1,1), 'padding':'same', 'kernel_quantizer':kq, 'bias_quantizer':bq, 'use_bias':True , 'act_str':q1},)(x)
x = Bundle( core= {'type':'conv' , 'filters':10, 'kernel_size':( 1, 1), 'strides':(1,1), 'padding':'same', 'kernel_quantizer':kq, 'bias_quantizer':bq, 'use_bias':True , 'act_str':q4}, flatten= True)(x)
x = Bundle( core= {'type':'dense', 'units' :10, 'kernel_quantizer':kq, 'bias_quantizer':bq, 'use_bias':True , 'act_str':q4}, softmax= True)(x)

model = QModel(inputs=x_in.raw, outputs=x)
model.compile()
model.summary()

'''
2. TRAIN (using qkeras)
'''
# model.fit(...)


'''
3. EXPORT FOR INFERENCE
'''
SIM, SIM_PATH = 'xsim', "F:/Xilinx/Vivado/2022.1/bin/" # For Xilinx Vivado
# SIM, SIM_PATH = 'verilator', "" # For Verilator

model.export_inference(x=model.random_input, hw=hw) # Runs forward pass in float & int, compares them. Generates: config_fw.h (C firmware), weights.bin, expected.bin
model.verify_inference(SIM=SIM, SIM_PATH=SIM_PATH) # Runs SystemVerilog testbench with the model & weights, randomizing handshakes, testing with actual C firmware in simulation

'''
4. IMPLEMENTATION

a. FPGA: Open vivado, source vivado_flow.tcl
b. ASIC: Set PDK paths, run syn.tcl & pnr.tcl
c. Compile C firmware with generated header (config_fw.h) and run on device
'''
```

## Execution API
```c
#define NDEBUG
#include "platform.h"
#include "deepsocflow_xilinx.h"

int main() {

hardware_setup();
xil_printf("Welcome to DeepSoCFlow!\n Store weights, biases & inputs at: %p; \n", &mem.w);

model_setup();
model_run(); // run model and measure time

// Print: outputs & measured time
Xil_DCacheFlushRange((INTPTR)&mem.y, sizeof(mem.y)); // force transfer to DDR, starting addr & length
for (int i=0; i<O_WORDS; i++)
printf("y[%d]: %f \n", i, (float)mem.y[i]);
printf("Done inference! time taken: %.5f ms \n", 1000.0*(float)(time_end-time_start)/COUNTS_PER_SECOND);

hardware_cleanup();
return 0;
}
```

## Motivation

[HLS4ML](https://github.com/fastmachinelearning/hls4ml) is an open source python framework that's being widely adopted by the scientific community, to generate FPGA & ASIC implementations of their custom Deep Neural Networks. CERN has taped out chips with DNN compression algorithms to be used in LHC using HLS4ML. However, it is not possible to implement deeper neural networks on HLS4ML since it implements one engine per layer in hardware. This project aims to solve that problem and enhance HLS4ML, by creating a statically & dynamically reconfigurable, AXI-Stream DNN engine.


## Quick Start

0. You need either [Verilator 5.014+](https://verilator.org/guide/latest/install.html#git-quick-install) or XIlinx Vivado for simulation

1. Clone this repo and install deepsocflow
```bash
git clone https://github.com/abarajithan11/deepsocflow
cd deepsocflow
pip install .
```

2. Run the example
```bash
# Edit SIM and SIM_PATH in the file to match your simulator
cd run/work
python ../example.py
```

3. FPGA implementation:

3.1. Generate Bitstream from Vivado:
```bash
# Make sure correct fpga board was specified in the above script. Default is ZCU102
# Open Xilinx Vivado, cd into deepsocflow, and type the following in TCL console
cd run/work
source vivado_flow.tcl
```

3.2. Run on a ZYNQ FPGA:

- Open Xilinx Vitis
- Create an application project, using `.xsa` generated by running the `run/work/vivado_flow.tcl`
- Right click on application project -> Properties
- ARM v8 gcc compiler -> Directories -> Add Include Paths: Add absolute paths of `run/work` and `deepsocflow/c`
- ARM v8 gcc compiler -> Optimization -> Optimization most (-O3)
- ARM v8 gcc linker -> Libraries -> Add Library: `m` (math library)
- Build, Connect board & launch debug
- Add a breakpoint at `model_setup()`. When breakpoint hits, load `run/work/vectors/wbx.bin` to the address printed.
- Continue - This will run the model and print outputs & execution time

4. ASIC implementation with Cadence Genus & Innovus:
```bash
# First add your PDK to 'asic/pdk', change paths in the scripts and run:
cd run/work
genus -f ../../tcl/asic/run_genus.tcl
innovus
source ../../tcl/asic/pnr.tcl
```

## Framework Infrastructure

<p align="center"> <img src="docs/infra.png" width="600"> </p>


## Team Members

- Aba
- Zhenghua
<!-- https://github.com/abarajithan11/deepsocflow/assets/26372005/113bfd40-cb4a-4940-83f4-d2ef91b47c91 -->

# CGRA4ML: A Framework to Implement Modern Neural Networks for Scientific Edge Computing ![status](https://github.com/abarajithan11/dnn-engine/actions/workflows/verify.yml/badge.svg)

cgra4ml is a Python library that helps researchers build, train, and implement their own deep ML models, such as ResNet CNNs, Autoencoders, and Transformers on FPGAs and custom ASIC.

It takes a lot of effort and expertise to implement highly optimized neural networks on edge platforms. The challenging aspects include:

- Designing an optimal dataflow architecture
- Building & verifying an accelerator, optimizing for high-frequency
- Building the System-on-Chip, verifying and optimizing data bottlenecks
- Writing C firmware to control the accelerator and verify its correctness

Often, after all that work, the models do not meet their expected performance due to memory bottlenecks and sub-optimal hardware implementation.

We present a highly flexible, high-performance accelerator system that can be adjusted to your needs through a simple Python API. The framework is maintained as open source, allowing a user to modify the processing element to their desired data type using customized architecture, easily expand the architecture to meet the desired performance, and implement new neural network models.

<p align="center"> <img src="docs/overview.png" width="800"> </p>


## Execution API
```c
#define NDEBUG
#include "platform.h"
#include "deepsocflow_xilinx.h"

int main() {

hardware_setup();
xil_printf("Welcome to DeepSoCFlow!\n Store weights, biases & inputs at: %p; \n", &mem.w);

model_setup();
model_run(); // run model and measure time

// Print: outputs & measured time
Xil_DCacheFlushRange((INTPTR)&mem.y, sizeof(mem.y)); // force transfer to DDR, starting addr & length
for (int i=0; i<O_WORDS; i++)
printf("y[%d]: %f \n", i, (float)mem.y[i]);
printf("Done inference! time taken: %.5f ms \n", 1000.0*(float)(time_end-time_start)/COUNTS_PER_SECOND);

hardware_cleanup();
return 0;
}
```

## Motivation

[HLS4ML](https://github.com/fastmachinelearning/hls4ml) is an open source python framework that's being widely adopted by the scientific community, to generate FPGA & ASIC implementations of their custom Deep Neural Networks. CERN has taped out chips with DNN compression algorithms to be used in LHC using HLS4ML. However, it is not possible to implement deeper neural networks on HLS4ML since it implements one engine per layer in hardware. This project aims to solve that problem and enhance HLS4ML, by creating a statically & dynamically reconfigurable, AXI-Stream DNN engine.


## Quick Start

0. You need either [Verilator 5.014+](https://verilator.org/guide/latest/install.html#git-quick-install) or XIlinx Vivado for simulation

1. Clone this repo and install deepsocflow
```bash
git clone https://github.com/KastnerRG/cgra4ml
cd cgra4ml
pip install .
```

2. Run the example
```bash
# Edit SIM and SIM_PATH in the file to match your simulator
cd run/work
python ../example.py
```

3. FPGA implementation:

3.1. Generate Bitstream from Vivado:
```bash
# Make sure correct fpga board was specified in the above script. Default is ZCU102
# Open Xilinx Vivado, cd into deepsocflow, and type the following in TCL console
cd run/work
source vivado_flow.tcl
```

3.2. Run on a ZYNQ FPGA:

- Open Xilinx Vitis
- Create an application project, using `.xsa` generated by running the `run/work/vivado_flow.tcl`
- Right click on application project -> Properties
- ARM v8 gcc compiler -> Directories -> Add Include Paths: Add absolute paths of `run/work` and `deepsocflow/c`
- ARM v8 gcc compiler -> Optimization -> Optimization most (-O3)
- ARM v8 gcc linker -> Libraries -> Add Library: `m` (math library)
- Build, Connect board & launch debug
- Add a breakpoint at `model_setup()`. When breakpoint hits, load `run/work/vectors/wbx.bin` to the address printed.
- Continue - This will run the model and print outputs & execution time

4. ASIC implementation with Cadence Genus & Innovus:
```bash
# First add your PDK to 'asic/pdk', change paths in the scripts and run:
cd run/work
genus -f ../../tcl/asic/run_genus.tcl
innovus
source ../../tcl/asic/pnr.tcl
```

## Framework Infrastructure

<p align="center"> <img src="docs/infra.png" width="600"> </p>


## Team Members

- Aba
- Zhenghua
8 changes: 6 additions & 2 deletions deepsocflow/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,6 @@
from . import py
from .py import *
from deepsocflow.py.utils import *
from deepsocflow.py.dataflow import *
from deepsocflow.py.xbundle import *
from deepsocflow.py.xmodel import *
from deepsocflow.py.xlayers import *
from deepsocflow.py.hardware import *
Loading