Computer Vision CSCI 1430 Final Project
Structure: capture.py -> records screen and raw keyboard inputs preprocess.py -> preprocess screen recordings and keyboard logs dataset.py -> turns preprocessed data into dataset for training model.py -> neural network architecture train.py -> trains the network test_model.py -> offline tests for the model play.py -> runs the neural network, interacts with the game
Dev Notes: Apr 24 (Lixing) finished capture.py. Started writing function definition for preprocess.py. Need to record more data.
Note: current video resolution is 960x544, can consider reducing it by
half to speed up neural network. We can also consider cropping the
video since we only need the center track...but in some cases the
track is not in the middle, like in the calibration page
Note: since this is gonna be real-time inference of neural networks,
we really really need LOW LATENCY. An RNN is not very computational
heavy compared to a CNN. We can consider using something like ResNet-50
or ones with less depth to speed things up.
Apr 25 (Lixing)
Reduced video resolution to 480x272 (half of original).
Added more data.
Apr 26 (Lixing)
Started implementing preprocess.py. Implemented 1) key mapping that
turns every relevant key into an integer index 2) key_log array
conversion and visualization that encodes every key_log with an
array of shape (total_frame_count, 6). Improved documentations.
Apr 27 (Lixing)
Added more data. Bug fixes for preprocess.py
May 1 (Jonie)
Finished preprocess.py (added validation and trim_video)
Implemented validation() so key logs have matching key press/release pairs.
Created trim_video() to trim videos and sync them with key logs.
Helped make preprocessed data usable for for model training.
May 1 (Lixing)
In capture.py, further reduced video size to 128x224, and cropped
the videos to tracks in the middle.
To mediate videos already collected, preprocess.py
trim_video() automatically crops video frames before saving the video.
Added more data. Cleaned a lot lot data with validation()
May 6 (Lixing)
Added more data. Finished dataset.py, which defines a Dataset
class and uses tf.data.Dataset. When calling model.fit(), we can do:
dataset = Dataset(capture_path, key_log_path, batch_size=8)
model.fit(dataset.batched_data, epoch=60)
or
model.fit(dataset.data, epoch=60)
when batch_size is not specified.
May 9 (Jonie)
Wrote and finalized play.py.
play.py: loads the trained model and runs real-time gameplay. The file
uses screen capture and simulates key presses.
Implemented Player class with methods: get_loaded_model(),
preprocess_frame(), update_keys(), release_keys(), and play().
Player class with modular methods:
get_loaded_model(): loads the trained model.
preprocess_frame(): captures and processes frames.
update_keys(): simulates key presses based on model output probabilities.
release_keys(): releases all currently pressed keys.
play(): main game loop. Loads the model, captures screen input, and
simulate gameplay using key in real time.
May 9 (Tiffany)
finished writing, training, and testing the model
Split the original model.py into three separate files for better
readability and modularity
model.py: Contains the definition of the DJMaxModel
includes the neural network architecture, including CNN feature
extractor (ResNet18), RNN sequence model (GRU by default), and fully
connected output layers.
train.py:
Implements the training loop for the model, including loss
calculation, backpropagation, and optimization.
Saves the model checkpoint at each epoch for later use
Also has some data processing (which may be redundant with preprocessing code)
VideoDataset class handles the video and label data processing:
- Loads preprocessed videos and key logs.
- Resizes video frames to a smaller resolution (112x64) for faster
processing.
- Converts the video and label arrays to tensors for training.
test_model.py:
Contains functions and scripts for evaluating the trained model on
a test dataset.
May 9 (Jonie)
Looked over model.py, ran the model test to generate result graphs for
the poster.
Updated main method in test_model to make sure we are able to create
.npy files from video files.
Updated play.py from using TensorFlow to PyTorch for consistency with
the model and training.
May 9 (Lixing)
Since we switched to torch, I rewrote dataset.py, expanding
VideoDataset() in train.py for torch API and so the model can train from
the whole dataset instead of just 1 video.
Adjusted train.py so we can use Apple MPS.
May 10 (Lixing)
The original model architecture doesn't work well. Tried increasing
params count; tried switching from CNN to Transformer for feature
extraction.
May 11 (Lixing)
1) Large revision to main model architecture. The old architecture was
ResNet18 -> AvgPool2d -> 1-layer GRU -> Linear. The problem with this was
that the pooling layers significantly limited ability to represent
spatial information, which rhythm games rely heavily on.
The new model is
ResNet18 (removed MaxPool, reduced vertical stride) -> Splitting
the 128 feature maps horizontally into 4 chunks, and grouping chunks
1-2 and 3-4 as 0th and 5th chunks -> For each chunk, perform Conv2d
to compress each chunk (128 channels) to a sequence (1 channel) by
compressing width to 1 and retaining height -> 3-layers GRU on each
sequence -> Linear.
This new model captures spatial information well and reduces interference
between tracks.
2) Designed and implemented a new loss function. The original was just
a BCE, and I observed that the model was trying to fit key states
(0 or 1) but not responding well to key events (time at which to switch
between 0 and 1) on time. The new one is a dynamic loss function,
building on BCE, that: a) punishes the model less for not getting a key
on the exact time stamp (allowed to be ~3 frames off) and b) punishes
the model more for not getting sudden changes in key labels. These
are done by preprocessing the label via a 1d convolution, and setting
pos_weight of BCE dynamically to emphasize state changes in labels.
See train.py moving_avg() and dynamic_loss().
3) Trained the model. Our final model has around 3M params, trained
for 14 epochs.
4) Extensive model testing, offline and online tests. Changed the
test visualization, so now we see raw model output rather than
binarized values, for more accurate evaluation. Recorded demo videos.
5) Implemented inference_forward() in model.py, suitable for real-time
inference.
6) Modified play.py to use grayscale screen capture for real-time
inference, because there was a mismatch between color profiles of
live capture and mp4 videos in training dataset, and the model doesn't
recognize orange keys well.