Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared memory issue while running Protenix example #74

Open
rakeshr10 opened this issue Feb 21, 2025 · 10 comments
Open

Shared memory issue while running Protenix example #74

rakeshr10 opened this issue Feb 21, 2025 · 10 comments

Comments

@rakeshr10
Copy link

rakeshr10 commented Feb 21, 2025

Hi, I tried to run the example I get the following error. How can I resolve this memory issue?

protenix predict --input examples/example.json --out_dir ./output --seeds 101

Image
@zhangyuxuann
Copy link
Collaborator

@rakeshr10 it's similar issue to #2

@rakeshr10
Copy link
Author

@zhangyuxuann Thanks for your reply. Putting shared memory worked but I was wondering if there is a way to run without using shared memory. I need to do large scale runs by spinning up pods on Kubernetes, so using /dev/shm might become an issue.

Another question is there a way to download the model weights and point it to the inference script instead of downloading it every time a new pod is spun up for inference.

It would also be helpful if you could provide some guidance on how to use the deepspeed evoformer attention kernel without compiling it every time a new pod is spun up. I noticed that the ninja workers are being called to compile evoformer code when it does not find the compiled cuda kernel in /root/.cache folder, this ends up taking a lot of time even though the inference is faster.

@zhangyuxuann
Copy link
Collaborator

@rakeshr10

  1. PyTorch's DataLoader uses shared memory by default to speed up data loading. You can disable multi-process data loading by setting --num_workers 0 and try again.
  2. if you put or copy the weights and ccd cache in ./release_data/ when the pod is spun up. it will not download again.
  3. if you want to use the deepspeed evoformer attention kernel without compiling it every time a new pod is spun up. You can modify the builder.py like follows. For example, you can copy your precompilation evoformer_attn.so to the /root/.cache/torch_extensions/ or other path
# new_builder.py modified the following line
        if self.name == "evoformer_attn" and os.path.exists(
            "/root/.cache/torch_extensions/py39_cu121/evoformer_attn/evoformer_attn.so"
        ):
            import sys

            sys.path.append("/root/.cache/torch_extensions/py39_cu121/evoformer_attn")
            op_module = importlib.import_module("evoformer_attn")
            print(
                "using precompilation from /root/.cache/torch_extensions/py39_cu121/evoformer_attn"
            )
        else:
            op_module = load(
                name=self.name,
                sources=self.strip_empty_entries(sources),
                extra_include_paths=self.strip_empty_entries(extra_include_paths),
                extra_cflags=cxx_args,
                extra_cuda_cflags=nvcc_args,
                extra_ldflags=self.strip_empty_entries(self.extra_ldflags()),
                verbose=verbose,
            )

if you build your image from docker, just copy the above modified builder.py like

COPY ./new_builder.py /usr/local/lib/python3.9/dist-packages/deepspeed/ops/op_builder/builder.py
# the dest path you can show as 
# import deepspeed
# print(deepspeed.__file__)

@rakeshr10
Copy link
Author

@zhangyuxuann Thanks for your reply. I tried some of your suggestions and noticed these.

  1. Setting --num_workers 0 worked but it was interesting to see that this was required only when inference needed to be run on GPU. A question does setting --num_workers 0 affect running batch inference jobs.
  2. I was able to package evoformer compiled code into docker however I noticed that the inference speed was not much different than normal inference. If I use layernorm I noticed there is speed up during inference. Similar to evoformer is there a way I can package precompiled layernorm cuda kernel into docker so that it does not recompile whenever a new pod is spun up.
  3. How do I specify bf16 weights for inference and is there any speed up or benefit for using bf16 weights.

@zhangyuxuann
Copy link
Collaborator

zhangyuxuann commented Feb 27, 2025

@rakeshr10

  1. Could you describe the first question in more detail? By the way, i think `Batch inference mode is not needed because you can put multiple instances to be inferred in one JSON file, and different pods infer different JSON files.
  2. you can copy the precompiled fastfold_layer_norm_cuda.so to ./protenix/model/layer_norm, it will not recompile layernorm again. the inference speed for using evoformer kernel is more obvious for long sequences.
  3. you can set configs.skip_amp.confidence_head = False and configs.skip_amp.sample_diffusion = False to enable Mixed BF16 inference, which is faster and memory-efficiency for long sequences. For better performance, we recommand the default inference setting.

Image

@rakeshr10
Copy link
Author

rakeshr10 commented Feb 27, 2025

@zhangyuxuann Thanks for your reply. I was able to add fastfold_layer_norm_cuda.so in the path and now it works. I noticed the inference settings are same as in the picture so won't change them.

Regarding my first question what I mentioned was when setting --num_workers 0 the inference does not require shared memory when running on GPU. My question was since I am setting --num_workers 0 does it cause any issue while running protenix predict command on a multisequence json file which I am assuming is batch inference and the json file contains multiple sequences and msa files.

I tried running the colabfold_msa.py script on a single multiple sequence pairs containing fasta file. While the colabfold_search generates multiple a3m files, the a3m processor does not convert all a3m files into corresponding pairing and nonpairing a3m files for running protenix inference on multiple a3m files.

Also it would be great if this script could take the fasta and a3m files and converts them into json inputs file for running protenix inference.

@JinyuanSun
Copy link

@zhangyuxuann Thanks for your reply. I was able to add fastfold_layer_norm_cuda.so in the path and now it works. I noticed the inference settings are same as in the picture so won't change them.

Regarding my first question what I mentioned was when setting --num_workers 0 the inference does not require shared memory when running on GPU. My question was since I am setting --num_workers 0 does it cause any issue while running protenix predict command on a multisequence json file which I am assuming is batch inference and the json file contains multiple sequences and msa files.

I tried running the colabfold_msa.py script on a single multiple sequence pairs containing fasta file. While the colabfold_search generates multiple a3m files, the a3m processor does not convert all a3m files into corresponding pairing and nonpairing a3m files for running protenix inference on multiple a3m files.

Also it would be great if this script could take the fasta and a3m files and converts them into json inputs file for running protenix inference.

The issue may be caused by the fasta format, can you provide the input file, to reproduce the issue.

@rakeshr10
Copy link
Author

rakeshr10 commented Mar 3, 2025

This is the fasta file with text extension. It has multiple pairs of sequences which can be submitted as a batch sequence search job to mmseqs using colabfold_search command to do msa generation for all pairs simultaneously.

It will be good if all of the a3m files generated for all pairs using the colabfold_search command can be converted to protenix compatible a3m and json files using colabfold_msa.py script

interactors.txt

@rakeshr10
Copy link
Author

@JinyuanSun Were you able to look into this issue and request for the script to handle multiple a3m files.

@JinyuanSun
Copy link

@JinyuanSun Were you able to look into this issue and request for the script to handle multiple a3m files.

The easiest approach would be to split the file into individual FASTA sequence files and then use a for loop in the shell to process them, thanks.
here is a code for you:

#!/bin/bash

input_fasta="input.fasta"
db_path="<path/to/colabfold_db>"
mmseqs_path="<path/to/mmseqs>"
output_dir="dimer_colabfold_msa"

mkdir -p split_fasta
mkdir -p "$output_dir"

awk '/^>/{f="split_fasta/seq"++i".fasta"} {print > f}' "$input_fasta"

for file in split_fasta/*.fasta; do
    echo "Processing $file with colabfold_msa.py..."
    python3 scripts/colabfold_msa.py "$file" "$db_path" "$output_dir" \
        --db1 uniref30_2103_db \
        --db3 colabfold_envdb_202108_db \
        --mmseqs_path "$mmseqs_path"
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants