Skip to content

Database Migration From YeastMine to AllianceMine

Ngoc Kim Ngan Tran edited this page Oct 16, 2024 · 11 revisions

Table of Contents

  1. Background
  2. Problem
  3. Investigation
  4. Proposed Solution
  5. Appendix

Background

GRNsight includes four primary databases:

  1. Expression database
  2. Grnsettings database
  3. Network database
  4. Protein-protein database

This document focuses on the network database and protein-protein database. Both databases contain:

  • A gene table
  • A source table
  • A network table (referred to as physical interactions in the protein-protein database)

Additionally, the protein-protein database includes a protein table. All scripts for downloading and uploading data to these databases are located in the database folder.

Problems

  1. Why do we need to migrate?

  2. What other issues should we address while migrating?

    1. #1106 ERT1 is missing from the network database.

      image

      When generating the GRN network, query ERT1 raised an error message: "Not found in the database." This issue can also affect other genes.

Investigation

#1106 Missing genes in GRN network database

Our short-term solution was to create a union gene table between PPI and GRN, but now we want to investigate further why our query is missing ERT1 and possibly other genes.

The original query approach involved retrieving a list of regulators, followed by querying the targets for each regulator. However, ERT1 was not listed as a regulator in the process. List of regulators. When querying the list of regulations from AllianceMine, ~15 connections showed ERT1 as a regulator. Relevant data in Box.

We would thoroughly investigate the networks, genes, and proteins table to see whether any data is missing.

Networks

Server Database vs AllianceMine for GRN

A comparison between the server network database and the AllianceMine network:

networks regulators target genes

If you would like more details, please refer to this link.

Genes

AllianceMine provides a list of all yeast genes, which is already given without the need to build the query builder from scratch. However, when I created the same query using the query builder, I got more genes than the provided yeast genes.

Screenshot 2024-10-14 at 8 39 11 PM

Currently, the gene tables in the server for PPI and GRN are in union. There are total of 7016 genes in the server, 7173 genes in yeast genes list and 7337 genes in the gene query by query builder. After investigation, all genes present in the server database are also found in AllianceMine’s gene query by the builder. Additionally, the query builder includes an extra set of 321 genes. Among these additional genes, 150 do not have associated display names. Of the 7016 genes present in the server database, 1639 have differing display names in the query builder, and within those 1639, 1618 lack display names entirely.

Questions: Should we add source columns to the gene table? If we add all the genes to the gene table, then it's possible to have gene without connections. Then it's harder to tell whether this gene comes from which source.

  • [KD] I'm not sure what is being asked, but I'm looking at the list of 321 genes from the query builder.
    • Anything that has a systematic name that begins with "YSC", for example "YSC0058" represents a gene that is not in the S288c strain of yeast that the genome is based on. We don't need to have those in our database.

Protein-Protein Interactions and Proteins

This is how PPI currently generated:

ppi

I created a query for all the PPI in one query, and here is the comparison:

ppi interactions

We can see that even though some interactions are missing in the AllianceMine database, there are significantly more missing interactions in the server than in AllianceMine. Additionally, it's reasonable for AllianceMine to miss some interactions compared to the ones on the server because of the different time stamps.

I also investigated the proteins. I created a query builder to query all genes that have proteins.

proteins in interactions total proteins

From the Venn diagram, we can see that querying all PPI at once captured all the proteins in PPI, just missing some interactions. We also see that the server is missing several proteins.

Proposed Solutions

GRN

Solution 1: Instead of query regulators, query all genes, then follow original procedures. solution 2

Pros: Cons:
✅ Capture all networks ❌ Requires many queries.

But when I look further into how the networks are queried, it's the same as querying all the regulations with filtering.

Solution 2 [Preffered]: Query the network by requesting the list of regulations in one request.

solution 1

The AllianceMine network provides a comprehensive list of all networks. A network can be missing when we query at different times.

Pros: Cons:
✅ Query all regulatory networks in one request.
✅ Capture all genes.

PPI

Solution: Query all genes and proteins in one query, and query all the interactions in one query

ppi new query

Pros: Cons:
✅ Reduce significant queries to the service.
✅ Capture all proteins.
✅ Capture all interactions.

Appendix

GRN Network AllianceMine Query Builder

image

Regulators Query

image

Query Builder for List of S. cerevisiae with taxon ID of 559292

Screenshot 2024-10-14 at 8 52 06 PM

Query Builder for PPI

Screenshot 2024-10-15 at 1 52 36 PM

Query Builder for Proteins

Uploading Screenshot 2024-10-15 at 2.11.23 PM.png…

Clone this wiki locally