Skip to content

yuetu00/cv_final

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cv_final

Computer Vision CSCI 1430 Final Project

Structure: capture.py -> records screen and raw keyboard inputs preprocess.py -> preprocess screen recordings and keyboard logs dataset.py -> turns preprocessed data into dataset for training model.py -> neural network architecture train.py -> trains the network test_model.py -> offline tests for the model play.py -> runs the neural network, interacts with the game

Dev Notes: Apr 24 (Lixing) finished capture.py. Started writing function definition for preprocess.py. Need to record more data.

    Note: current video resolution is 960x544, can consider reducing it by
    half to speed up neural network. We can also consider cropping the
    video since we only need the center track...but in some cases the
    track is not in the middle, like in the calibration page

    Note: since this is gonna be real-time inference of neural networks,
    we really really need LOW LATENCY. An RNN is not very computational
    heavy compared to a CNN. We can consider using something like ResNet-50
    or ones with less depth to speed things up.

Apr 25 (Lixing)
    Reduced video resolution to 480x272 (half of original).
    Added more data.

Apr 26 (Lixing)
    Started implementing preprocess.py. Implemented 1) key mapping that
    turns every relevant key into an integer index 2) key_log array
    conversion and visualization that encodes every key_log with an
    array of shape (total_frame_count, 6). Improved documentations.

Apr 27 (Lixing)
    Added more data. Bug fixes for preprocess.py

May 1 (Jonie)
    Finished preprocess.py (added validation and trim_video)
    Implemented validation() so key logs have matching key press/release pairs.
    Created trim_video() to trim videos and sync them with key logs.
    Helped make preprocessed data usable for for model training.

May 1 (Lixing)
    In capture.py, further reduced video size to 128x224, and cropped 
    the videos to tracks in the middle.
    To mediate videos already collected, preprocess.py
    trim_video() automatically crops video frames before saving the video.
    Added more data. Cleaned a lot lot data with validation()

May 6 (Lixing)
    Added more data. Finished dataset.py, which defines a Dataset
    class and uses tf.data.Dataset. When calling model.fit(), we can do:

        dataset = Dataset(capture_path, key_log_path, batch_size=8)
        model.fit(dataset.batched_data, epoch=60)
    or
        model.fit(dataset.data, epoch=60)

    when batch_size is not specified.

May 9 (Jonie)
    Wrote and finalized play.py.

    play.py: loads the trained model and runs real-time gameplay. The file 
    uses screen capture and simulates key presses.
        Implemented Player class with methods: get_loaded_model(), 
        preprocess_frame(), update_keys(), release_keys(), and play().
    Player class with modular methods:
        get_loaded_model(): loads the trained model.
        preprocess_frame(): captures and processes frames.
        update_keys(): simulates key presses based on model output probabilities.
        release_keys(): releases all currently pressed keys.
        play(): main game loop. Loads the model, captures screen input, and 
        simulate gameplay using key in real time.

May 9 (Tiffany)
    finished writing, training, and testing the model

    Split the original model.py into three separate files for better 
    readability and modularity

    model.py: Contains the definition of the DJMaxModel
        includes the neural network architecture, including CNN feature 
        extractor (ResNet18), RNN sequence model (GRU by default), and fully 
        connected output layers.

    train.py:
        Implements the training loop for the model, including loss 
        calculation,  backpropagation, and optimization.
        Saves the model checkpoint at each epoch for later use

        Also has some data processing (which may be redundant with preprocessing code)
        VideoDataset class handles the video and label data processing:
        - Loads preprocessed videos and key logs.
        - Resizes video frames to a smaller resolution (112x64) for faster 
        processing.
        - Converts the video and label arrays to tensors for training.

    test_model.py:
        Contains functions and scripts for evaluating the trained model on 
        a test dataset.

May 9 (Jonie)
    Looked over model.py, ran the model test to generate result graphs for 
    the poster.

    Updated main method in test_model to make sure we are able to create 
    .npy files from video files.

    Updated play.py from using TensorFlow to PyTorch for consistency with 
    the model and training.

May 9 (Lixing)
    Since we switched to torch, I rewrote dataset.py, expanding
    VideoDataset() in train.py for torch API and so the model can train from 
    the whole dataset instead of just 1 video.
    
    Adjusted train.py so we can use Apple MPS.

May 10 (Lixing)
    The original model architecture doesn't work well. Tried increasing 
    params count; tried switching from CNN to Transformer for feature 
    extraction.

May 11 (Lixing)
    1) Large revision to main model architecture. The old architecture was 
    ResNet18 -> AvgPool2d -> 1-layer GRU -> Linear. The problem with this was 
    that the pooling layers significantly limited ability to represent 
    spatial information, which rhythm games rely heavily on.
    
    The new model is 
    ResNet18 (removed MaxPool, reduced vertical stride) -> Splitting 
    the 128 feature maps horizontally into 4 chunks, and grouping chunks 
    1-2 and 3-4 as 0th and 5th chunks -> For each chunk, perform Conv2d 
    to compress each chunk (128 channels) to a sequence (1 channel) by 
    compressing width to 1 and retaining height -> 3-layers GRU on each 
    sequence -> Linear. 
    
    This new model captures spatial information well and reduces interference 
    between tracks.

    2) Designed and implemented a new loss function. The original was just 
    a BCE, and I observed that the model was trying to fit key states 
    (0 or 1) but not responding well to key events (time at which to switch 
    between 0 and 1) on time. The new one is a dynamic loss function, 
    building on BCE, that: a) punishes the model less for not getting a key 
    on the exact time stamp (allowed to be ~3 frames off) and b) punishes 
    the model more for not getting sudden changes in key labels. These 
    are done by preprocessing the label via a 1d convolution, and setting 
    pos_weight of BCE dynamically to emphasize state changes in labels. 
    See train.py moving_avg() and dynamic_loss(). 

    3) Trained the model. Our final model has around 3M params, trained 
    for 14 epochs.

    4) Extensive model testing, offline and online tests. Changed the 
    test visualization, so now we see raw model output rather than 
    binarized values, for more accurate evaluation. Recorded demo videos. 

    5) Implemented inference_forward() in model.py, suitable for real-time 
    inference. 

    6) Modified play.py to use grayscale screen capture for real-time 
    inference, because there was a mismatch between color profiles of 
    live capture and mp4 videos in training dataset, and the model doesn't 
    recognize orange keys well.

About

Computer Vision CSCI 1430 Final Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages