Skip to content

Commit 9aeb894

Browse files
sbooeshaghidetrout
andauthored
seqspec enhacements (addition of library specification + read specification) (#33)
* added sequence spec to assay object. added read object to sequence spec object * renamed assay_spec, library_spec, first implementation of seqspec index with read structure * added changelog * updated changelog * updated seqspec image * updated specification document and contribution document * seqspec index defaults to indexing reads, select --region to index region. added more checks for seqspec check on reads. added read builder on website * added assay builder to site * added a lib protocol/kit and seq protocol/kit to assay. modified assay to be assay_id. changed read_name to name. changed publication_date to date. added sequencer specific read profile examples. added region examples. built read/region/assay web pages to display examples, along with style sheets and relevant javascript code to load examples. updated python classes and seqspec schema with relevant changes. added verification for assay date attribute. * cleaning up repo, moving dev python notebook to correct location * added barcode * cleaned up reads, fixed tag issue with onlist in regions * added truseq single index and novaseq * Created using Colaboratory * Created using Colaboratory * seqspec print reads * added get_seqspec by modality, renamed get_modality to get_libspec * change lengths of random and onlist sequences * update format for specs * fixed min/max bug in seqspec check * fixed element seqspec * fixed element seqspec * fixed element seqspec * seqspec print joint libspec and seqspec * Created using Colaboratory * check that the primer id is in one of the atomic regions * added libseq to print, i.e. jointly printing a sequence and library spec. * added multiple checks to seqspec check, added libseq format for seqspec print * fixed truseq naming convention, added libseq print * fixed seqspec index and seqspec onlist to use the RegionCoordinate class * remote onlist download with the kallisto multiple lists onlist format (#31) * Support older versions of matplotlib the spines[["top", "bottom"...]] structure is a relatively recent update, this allows working with older versions of matplotlib * Get the test of seqspec check working again. The refactoring of repositories to split out the example specification yaml files means we didn't have any local files to try validating. So I had to use the stub I had added for other tests, however it needed some updates to be compatible with the library spec version of the schema. Also I did some mocking to avoid needing to create test fastq and barcode files. * Increase the number of Xs in the random region The validator now checks that the length of the sequence string is "X" * max_len characters. * Update minimal Region tests and add minimal Read tests. * Make some minimal tests for the seqspec print functions * update print command to use the replacment assay_id attribute previously it was assay * My test assay used custom_primer which didn't have a color. I randomly picked sea green. * Implement downloading lists via urls Also to work with barcode lists hosted by the DACC transparently decompress gzip files. The old read_list function took a filename, but I changed it to take the onlist object so it would have access to the location attribute to know if it should be reading locally or remotely instead of just guessing if the filename string started with a scheme url. * Only return the onlist filename if it a local file Even if there's one list but it's remote we need to download it and put into into a local file. * Add onlist argument to specify combine barcode list file format. Kallisto has a format where multiple barcode lists are in one file separated by whitespace. That's different from the more common cartisean product format where all the lists are crossed with each other. This adds the kallisto format as -f multi, and adds an argument for the current version -f product, but treats it as a default. * Fix test for project_regions_to_coordinates * Minimally test RegionCoordinate and project_regions_to_coordinates * test run_onlist_region and run_onlist_read A new accessor function was added to get onlists for the new read objects in addition to the older by region type. I also added some type annotations to be more clear that join_onlists needs a list of Onlist objects to work. (Since we need the full information to know if we need to download files) * added Diane's contributions to CHANGELOG and slightly moodified validate_check_args so it doesn't return the errors object (which is errors related to the spec and not to the arguments supplied * fixed cli 'options' to list out options and print default. improved typing annotation for a function * added -s seqtype to seqspec print to help disambiguate between printing sequence_spec objects, library_spec objects, or both * -s is actually spectype not seqtype in seqspec print, modified the CLI so that it has the options 'sequence', 'library', 'libseq' (which is both sequence and library) * fixed the naming convention for spectype library, sequence, libseq and format png, tree, html, info, sequence in the seqsec print command * clarified some cli descriptions, added -s specobject to seqspec onlist for specifying where to search for the onlist * Libspec local caching (#32) * Search for local copies of barcode files first. This will look in the current directory on the current machine first, Then it will search the local machine for the file at the full path portion of the filename, as interpreted by urlparse. Finally if the location is remote, it will assume that the file is accessible at that remote URL. Adding in this local first search is to make pipeline development easier since it allows other code to handle retrieving the barcode files for us. * Read IGVF utils environment variables for remote authentication This could easily be fleshed out more with more places to look for authentication information * The fake raw attribute really should be Bytes to match real requests The new read_remote logic has opening the different kinds of streams from the decompression wrapper, which meant all the streams needed to be opened in the same binary mode * run_onlist_region needs the region_id not region_type * Return the local path to an already downloaded remote barcode file Siddarth pointed out that my first implementation always copied the barcode over to onlist_joined.txt even if there was only one barcode file. This uses the local override to see if the barcode file is available and if it is and there's only one barcode file, it can return just that barcodes filename and not go through the copying function * add to changelog --------- Co-authored-by: Diane Trout <[email protected]> Co-authored-by: Diane Trout <[email protected]>
1 parent 470f45a commit 9aeb894

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+3205
-2243
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99

1010
Genomic library structure depends on both the assay and sequencer (and kit) used to generate and bind the assay-specific construct to the sequencing adapters to generate a sequencing library. Therefore, a `seqspec` is specific to both a genomics assay and sequencer.
1111

12-
A list of `seqspec` examples for multiple assays and sequencers can be found on [this website](https://igvf.github.io/seqspec/). Each `spec.yaml` describes the 5'->3' "Final library structure" for the assay and sequencer. Sequence specification files can be formatted with the `seqspec` command line tool.
12+
A list of `seqspec` examples for multiple assays and sequencers can be found on [this website](https://igvf.github.io/seqspec/). Each `spec.yaml` describes the 5'->3' "Final library structure" for the assay and sequencer and can be extended to include sequencer-specific read annotations. Sequence specification files can be formatted with the `seqspec` command line tool.
1313

1414
<img alt="image" src="/docs/seqspec.png">
1515

docs/CHANGELOG.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Changelog
2+
3+
## [0.2.0] - 2024-02-XX
4+
5+
### Changed
6+
7+
- `seqspec index` uses primer and max length of of supplied Read
8+
- `assay_spec` renamed `library_spec`
9+
- Reorganize specification document
10+
- Move contribution guidelines from `SPECIFICATION.md` to `CONTRIBUTION.md`
11+
- Move example `Region`s from `SPECIFCATION.md` to `seqspec/docs/regions`
12+
- `seqspec index` defaults to indexing reads, `--region` indexes regions
13+
- Change descriptors of attributes `assay_id`, `doi`
14+
- `Assay` attribute `assay` changed to `assay_id`
15+
- `Read` attribute `read_name` changed to `name`
16+
- `Assay` attribute `publication_date` changed to `date`
17+
- `Assay` attribute `sequencer` changed to `sequence_protocol`
18+
- `Assay` function `get_modality` changed to `get_libspec`
19+
- `Region` function `update_attr` uses the `max_len` to generate `random` and `onlist` sequence lengths instead of `min_len`
20+
- `get_region_by_type` changed to `get_region_by_region_type` to disambiguate between `region_type` and `sequence_type`
21+
- `seqspec onlist` (by default) searches for onlists in the `Region`s intersected by the `Read` passed to `-r`.
22+
- Support older versions of matplotlib by handling the `spines[["top", "bottom"...]]` structure
23+
- Increase the number of Xs in the random region to match `max_len` for validation
24+
- Update `seqspec print` command to use the replacement `assay_id` attribute instead of `assay`
25+
- Implement downloading onlists via URLs and transparently decompress gzip files
26+
- Change `read_list` function to take the `onlist` object for handling local and remote files
27+
- Add `onlist` argument to specify combined barcode list file format (kallisto's multi-file format and default cartesian product format)
28+
29+
### Added
30+
31+
- Add `sequence_spec` in the `Assay` object
32+
- Add `Read` object in the `sequence_spec` object
33+
- Add `sequence_spec` to the seqspec json schema
34+
- Add `Read` object to specification document
35+
- Add `Read` generator to website GUI
36+
- Add pattern matching to `date` in `Assay` (expected date format: DAY MONTH YEAR, where day is one or two numbers, month is the full named month starting with a Capital letter and year is the full year)
37+
- Add `library_kit` to `Assay` object (kit that adds seq adapters)
38+
- Add `library_protocol` to `Assay` object (library that generates insert)
39+
- Add `sequence_kit` to `Assay` object
40+
- Add website to view example `seqspec` objects
41+
- Add `get_seqspec` to assay returns sequence structure for a given modality
42+
- Add multiple checks to `seqspec check`
43+
- check read modalities exist in assay modalities
44+
- check primer ids from seqspec are unique and exists as region ids in libspec
45+
- check that the primer id exists as an atomic region (currently a strong assumption that may be relaxed in the future)
46+
- check properties of multiple sequence types
47+
- `fixed` and `regions` not null incompatible
48+
- `joined` and `regions` null incompatible
49+
- `random` and `regions` not null incompatible
50+
- `random` must have `sequence` of all X's
51+
- `onlist` and `onlist` property null incompatible
52+
- check that the min len is less than or equal to the max len
53+
- check that the length of the sequence is between min and max len
54+
- Note a strong assumption in `seqspec print` is that the sequence have a length equal to the `max_len` for visualization purposes
55+
- Add `RegionCoordinate` object that maps `Region` min/max lengths to 0-indexed positions
56+
- `seqspec onlist` searches for onlists in a `Region` based on `--region` flag
57+
- Add type annotations for `join_onlists` to clarify it needs a list of `Onlist` objects
58+
- Add minimal tests for `RegionCoordinate`, `project_regions_to_coordinates`, `run_onlist_region`, `run_onlist_read`, and seqspec print functions
59+
- Add list of options to CLI for `-f FORMAT` within `seqspec onlist` and `seqspec print`
60+
- Add `-s SEQTYPE` to `seqspec print` to disambiguate printing `sequence`, `library`, or `libseq` objects. TODO wrap `seqspec info` into `seqspec print -f info`.
61+
- Add `-s SPECOBJECT` to `seqspec onlist`. Specify specific object `read`, `region`, or `region-type` for finding the `onlist`.
62+
- Add fetching ability for seqspec onlist from remote with IGVF credentials (credit to @detrout)
63+
64+
### Removed
65+
66+
TODO:
67+
68+
- Remove `lib_struct`
69+
- Remove `parent_id`
70+
71+
### Fixed
72+
73+
- Sequencing overlapping pairs now supported
74+
- `seqspec check` correctly handles sequences lengths longer than the stated min/max range
75+
- Fix test for `project_regions_to_coordinates`
76+
- Get the test of seqspec check working again by updating the schema for the refactored example specification YAML files and mocking fastq and barcode files
77+
- Only return the onlist filename if it's a local file, downloading remote lists when needed

docs/CONTRIBUTING.md

Lines changed: 43 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,25 @@
1-
## Contributing
1+
# Contributing
22

3-
Thank you for wanting to add a spec or improve `seqspec`. If you have a bug that is related to `seqspec` please create an issue.
3+
Thank you for wanting to add a spec or improve `seqspec`. If you have a bug that is related to `seqspec` please create an issue. This document outlines the process for suggesting improvements to the `seqspec` specification and the procedure for updating the specification.
44

5-
### Issues
5+
## Issues
66

77
The issue should contain
88

99
- the `seqspec` command ran,
1010
- the error message, and
1111
- the `seqspec` and python version.
1212

13-
### Specs and code changes
13+
## Improvements
14+
15+
To suggest improvements to the seqspec project please do the following:
16+
17+
- **Open an Issue**: For suggesting improvements, please open a new issue in the GitHub repository.
18+
- **Describe Your Suggestion**: Clearly describe the problem and your proposed solution. Include examples and use cases where possible.
19+
- **Engagement**: Encourage community feedback on the suggestion through comments.
20+
- **Iterate**: Be open to iterating on your suggestion based on community feedback.
21+
22+
## Specs and code changes
1423

1524
If you'd like to add assays sequence specifications or make modifications to the `seqspec` tool please do the following:
1625

@@ -75,3 +84,33 @@ git push origin cool-new-feature
7584
5. Submit a pull request
7685

7786
If you are unfamiliar with pull requests, you can find more information on the [GitHub help page.](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests)
87+
88+
### Steps for Review
89+
90+
1. **Initial Review**: A maintainer will review the suggestion for completeness and relevance.
91+
2. **Community Feedback**: A period for community feedback will follow.
92+
3. **Final Review**: The maintainers will make a final review, considering all feedback.
93+
94+
### Decision Making
95+
96+
- Decisions will be made based on the specification's goals, community feedback, and overall impact on the `seqspec` ecosystem.
97+
98+
## Updating the Specification
99+
100+
### Approval and Merging
101+
102+
- Once approved, a maintainer will merge the changes into the specification.
103+
- Major changes may require a more detailed review process or a community vote.
104+
105+
### Versioning and Change Log
106+
107+
- **Versioning**: Follow semantic versioning. Major changes result in a version bump.
108+
- **Change Log**: Update the change log with a summary of the changes and contributors.
109+
110+
### Testing and Validation
111+
112+
- Ensure any changes are tested for compatibility and do not break existing functionality.
113+
114+
## Conclusion
115+
116+
We value your contributions and aim to make the process of improving the specification collaborative and transparent. For any questions, please contact the repository maintainers.

0 commit comments

Comments
 (0)