GitHub

Codes for NEPdb database

Query and filter literature using query_pub.py and filter_pub.py
Verify the sequence and site information using match_sequencing.py

Note: Use the tran4_id_sequeence.py script if the transcript names in the collected data are come from different databases.
Add short sequence of wild_peptide and the length of antigen peptide when not provided with default NA for other information that could not be inferred using add_wtpep_NA.py
Unify data format with unifydata.R
Check sequences, antigens, and location information using check_align.py
Screen neoantign-related literature using SVM-based scripts under svm/

We downloaded COSMIC mutation information for genes that are important in cancer, which were then screened and sorted. These data came from 8,767 samples and 69 primary histology (cosmic_info_extract.py). We obtained 14,191 non-synonymous mutation sites from 683 genes and these sites appeared at least 3 times in all tumor samples, resulting in 516,036 short peptides. The short peptide lengths were 8, 9, 10, and 11 (cos_snp_pep_extract.py).
The programs for running netMHCpan4.0 and HLAthena are in their respective folders.
Extract the protein sequence from Homo_sapiens.GRCh38.pep.all.fa files based on the transcript ID (cos_pro_fasta.py). The references of Ensembl was downloaded from: ftp://ftp.ensembl.org/pub/release-93/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
curation		curation
prediction		prediction
test		test
.gitignore		.gitignore
README.md		README.md