-
Notifications
You must be signed in to change notification settings - Fork 8
Database Migration From YeastMine to AllianceMine
GRNsight includes four primary databases:
- Expression database
- Grnsettings database
- Network database
- Protein-protein database
This document focuses on the network database and protein-protein database. Both databases contain:
- A gene table
- A source table
- A network table (referred to as physical interactions in the protein-protein database)
Additionally, the protein-protein database includes a protein table. All scripts for downloading and uploading data to these databases are located in the database
folder.
-
Why do we need to migrate?
- YeastMine is no longer supported, and we need to transition to AllianceMine.
-
What other issues should we address while migrating?
-
#1106 ERT1 is missing from the network database.
When generating the GRN network, query ERT1 raised an error message: "Not found in the database." This issue can also affect other genes.
-
#1106 Missing genes in GRN network database
Our short-term solution was to create a union gene table between PPI and GRN, but now we want to investigate further why our query is missing ERT1 and possibly other genes.
The original query approach involved retrieving a list of regulators, followed by querying the targets for each regulator. However, ERT1 was not listed as a regulator in the process. List of regulators. When querying the list of regulations from AllianceMine, ~15 connections showed ERT1 as a regulator. Relevant data in Box.
We would thoroughly investigate the networks, genes, and proteins table to see whether any data is missing.
A comparison between the server network database and the AllianceMine network:
If you would like more details, please refer to this link.
AllianceMine provides a list of all yeast genes, which is already given without the need to build the query builder from scratch. However, when I created the same query using the query builder, I got more genes than the provided yeast genes.
Currently, the gene tables in the server for PPI and GRN are in union. There are total of 7016 genes in the server, 7173 genes in yeast genes list and 7337 genes in the gene query by query builder. After investigation, all genes present in the server database are also found in AllianceMine’s gene query by the builder. Additionally, the query builder includes an extra set of 321 genes. Among these additional genes, 150 do not have associated display names. Of the 7016 genes present in the server database, 1639 have differing display names in the query builder, and within those 1639, 1618 lack display names entirely.
Questions: Should we add source columns to the gene table? If we add all the genes to the gene table, then it's possible to have gene without connections. Then it's harder to tell whether this gene comes from which source.
- [KD] I'm not sure what is being asked, but I'm looking at the list of 321 genes from the query builder.
- Anything that has a systematic name that begins with "YSC", for example "YSC0058" represents a gene that is not in the S288c strain of yeast that the genome is based on. We don't need to have those in our database.
This is how PPI currently generated:
I created a query for all the PPI in one query, and here is the comparison:
We can see that even though some interactions are missing in the AllianceMine database, there are significantly more missing interactions in the server than in AllianceMine. Additionally, it's reasonable for AllianceMine to miss some interactions compared to the ones on the server because of the different time stamps.
I also investigated the proteins. I created a query builder to query all genes that have proteins.
From the Venn diagram, we can see that querying all PPI at once captured all the proteins in PPI, just missing some interactions. We also see that the server is missing several proteins.
Solution 1: Instead of query regulators, query all genes, then follow original procedures.
Pros: | Cons: |
---|---|
✅ Capture all networks | ❌ Requires many queries. |
But when I look further into how the networks are queried, it's the same as querying all the regulations with filtering.
Solution 2 [Preffered]: Query the network by requesting the list of regulations in one request.
The AllianceMine network provides a comprehensive list of all networks. A network can be missing when we query at different times.
Pros: | Cons: |
---|---|
✅ Query all regulatory networks in one request. | |
✅ Capture all genes. |
Solution: Query all genes and proteins in one query, and query all the interactions in one query
Pros: | Cons: |
---|---|
✅ Reduce significant queries to the service. | |
✅ Capture all proteins. | |
✅ Capture all interactions. |