Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User tests of SIF export: problems with demo file 1 and 2 #301

Closed
kdahlquist opened this issue Aug 1, 2016 · 8 comments
Closed

User tests of SIF export: problems with demo file 1 and 2 #301

kdahlquist opened this issue Aug 1, 2016 · 8 comments

Comments

@kdahlquist
Copy link
Collaborator

kdahlquist commented Aug 1, 2016

I am breaking this off from #287 because the issues are likely to be different depending on the file type.

I tested the SIF export with each of the demo files and a sample unweighted and weighed network from @GraceJohnson I had leftover from last semester.

When I open the exported SIF files in Notepad, there are no line breaks. The files open in Wordpad, Word, and Excel with linebreaks, so there might be some issue with notepad itself. I don't know if there is something we want to do about this, but I wanted to mention it, in case it is an issue.

I then attempted to open them in both BioTapestry and Cytoscape.

As for BioTapestry, they all seemed to open OK (as we expected, since we are not conforming to BioTapestry's convention for pos, neg relationship types, they all opened as "unweighted" networks).

However, there were hiccups with Cytoscape.

  • Cytoscape was not able to parse demo file 1 and demo file 2 properly. I think I know what is happening.
    • It looks like when you are exporting, you export the nodes that do not have any targets.
    • For demos 3 and 4, these nodes are in the middle of the list and do not affect the parsing.
    • For demos 1 and 2, the first node in the list, "ACE2" does not have any targets. When I delete this line and try opening in Cytoscape, it parses just fine, making me think that this is what the problem is.
    • I note that when you export a SIF file from Cytoscape, it does not output any nodes in the list that don't have any targets. I think that we should probably follow this practice as well.
    • What I mean by Cytoscape not being able to parse it, is that it creates some unconnected nodes. The effect of no line breaks then becomes apparent because gene labels are then smushed together that should be separate.

Minor observation/feature request (now broken out as issue #308):

  • We might want to rethink what the default name for the demo files are when you export them. Currently, the name has the #, (, ), commas, and space characters in them because that is what we call them in the menu. For people who want to pipe files to different programs, this is going to make them cringe (as it did me). Any chance we can customize this so that the actual file names that the demos have in our test-files directory are used instead (which only have hyphens and underscores)?
@dondi
Copy link
Owner

dondi commented Aug 1, 2016

Yes, Notepad is known to only handle one type of linebreak; there are three out there these days, and with the variety of tool chains out there, some of which adjust linebreaks while some don't, the tendency these days is to update the tools themselves to recognize all three kinds. Wordpad, Notepad++, Sublime Text, and Atom are updated but Notepad has not been. The export code is using the \n variety, which is the most platform agnostic linebreak. Windows uses \r\n natively (and is likely what Notepad is expecting); \r is the Mac-native variety.

OK, we can omit target-less nodes for SIF export.

As for filenames, the demos are actually the special case because we inject code that changes the label at the top to something different. The current code just derives the filename unconditionally from that label. I’ll look into finding a way to store both. Regularly-imported files should preserve the filename.

@kdahlquist
Copy link
Collaborator Author

kdahlquist commented Aug 1, 2016

I agree that regularly-imported files should preserve the filename. Maybe we should allow the filename that shows for the demo files just be their regular filename, too. We would not need to change the descriptions in the Demo menu. I don't know if that makes it more or less confusing. Broken out as issue #308.

@kdahlquist kdahlquist changed the title User tests of SIF export User tests of SIF export: problems with demo file 1 and 2 Aug 2, 2016
@kdahlquist
Copy link
Collaborator Author

I was able to verify this bug, which might be both a bug in GRNsight and Cytoscape. For clarity, I will describe again what happens.

First issue.

  • In GRNsight, load demo file 1, 21-genes_50-edges_Dahlquist-data_input.xlsx and export it to SIF.
  • Import this SIF file into Cytoscape.
    • Cytoscape is unable to parse the edges between nodes and the node labels are messed up.
  • To fix this, I opened the SIF file in Excel and deleted all of the nodes that had no targets.
    • This file now is imported properly into Cytoscape with all correct nodes and edge connections.
    • It turns out that I can also fix this problem by simply opening the SIF file in Excel and saving it as tab-delimited text without making other changes to it.

So, the conclusion from this is that the combination of having the first gene have no target with some other formatting that GRNsight does causes it to fail in Cytoscape. It could be the newline break format of Excel fixes this. I think this is the case because of the way that the labels on the disconnected nodes in Cytoscape look with the labels smushed together.

Second issue with GRNsight.

  • However, file where I deleted the nodes with no targets is no longer properly read by GRNsight (although it can be ready correctly by Cytoscape).
    • The nodes that were deleted are missing, as are their edge connections, resulting in a 15-node, 39-edge graph instead of the 21-node, 50-edge graph.

So, I go back on what I said previously. It's not an issue with targetless nodes per se, but somehow the combination with the newline break (I think).

GRNsight needs to be able to handle SIF files that have targetless nodes listed and not listed as Cytoscape can do.

For ease of reporting, I only tested this with the unweighted network for the moment.

I'm attaching the relevant test files.
GRNsight-to-SIF.zip

As we discussed at the meeting today, @dondi can wait on pursuing these fixes until @kdahlquist has completed more testing.

@dondi
Copy link
Owner

dondi commented Aug 4, 2016

I investigated this bug today and based on my tests, it looks like the issue was not the line breaks but the expectation that even for non-targeted genes, every line is supposed to have a tab for the relationship and target gene columns, even if there is nothing between the tabs.

I reached the conclusion this way:

  • First I visually inspected the files included above. They exhibited the difference in line endings, but I also noticed the extra tabs even if there was no data in between them.
  • I then temporarily changed the SIF export routine to include the \r\n-style line endings. I exported a new file with this, then imported it into Cytoscape. That file still did not export correctly.
  • I changed the SIF export routine to always include tabs for targeted genes, even if there aren't any. I also reverted the line ending to \n. This file did import correctly into Cytoscape.

Given this, I finalized the code to export SIFs in this way, and also updated the unit tests. Further, it turned out that changing the export in this way was easier to code if we went strictly to binary lines (i.e., only one targeted gene per line, with genes repeating over multiple lines if they have multiple targets). Thus, the change dovetails nicely with one of the finalized spec for SIF export in #309.

This export change has been uploaded to the beta v1.15 site. Let me know if you see the same results that I do.

@kdahlquist
Copy link
Collaborator Author

A fresh export of my test files GRNsight-to-SIF are now read correctly in GRNsight, Cytoscape and BioTapestry (including Demos 1 and 2 that were not read correctly in Cytoscape before).

However, the second half of the bug has not been resolved.

Right now, if a gene has no target, it is required by GRNsight to be in the "source" column (even if there are now tabs). If a targetless gene only appears in the "target" column, the graph is not parsed correctly and the node for the targetless gene is missing, as are all the edges to other nodes that do exist.

I've attached two test files. They both should return the same graph in GRNsight, but currently they do not (YOX1 and 2 edges are missing when I remove it from the source column because it has no targets.)

Both of these files are read and return identical graphs in both Cytoscape and BioTapestry.
7-genes_10-edges_test_GRNsight-to-SIF.zip

@dondi
Copy link
Owner

dondi commented Aug 4, 2016

Ah OK, I understand that better now. Yes, I can see from the code that this will happen. I'll factor this into the revision of SIF import.

@dondi
Copy link
Owner

dondi commented Aug 5, 2016

Support for target-less genes that are mentioned only in the edges has been implemented, and is available in the current beta v1.15. I tested this with the 7/10 sample files above and got 7 genes and 10 edges for both files.

@kdahlquist
Copy link
Collaborator Author

Confirmed that this is fixed and closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants