Skip to content

Commit 43d0f66

Browse files
Create new pds-deep-archive program and improve performance (#26)
* Resolutions for #13 and #21 - Resolve #21 with a new driver program `aipsip` that generates both the AIP and uses it to make the SIP as well, leaving all in the current working directory (along with two—count 'em, *two*—PDS labels for the price of one!). - Updates the Python `setuptools` metadata to generate the new `aipsip` (helps with #21). - Refactors logging and command-line argument setup (also for #21). - Unifies logging between `aipgen` and `sipgen` with the new `aipsip` so that there are `--debug` and `--quiet` options; without either you get a nominal amount of "hand-holding" of output. - Resolve #13 so that instead of billions of redundant XML parsing and XPath lookups we use a local `sqlite3` database and LRU caching. - Factor out XML parsing from `aipgen` and `sipgen` so we can apply caching. - Clear up logging messages so we can know what's calling what. - Create a temp DB in `sipgen` and populate it with mappings from lidvids to XML files for rapid lookups - But see also #25 for other uses of that DB. - Add standardized `--version` arguments for all three programs. With these changes, running `sipgen` on my Mac¹ can process a 272GiB `insight_cameras` export in 1:03. On `pdsimg-int1`, it handles the 1.5TiB`insight_cameras` dataset in under 4 hours. Footnotes: - ¹2.4 GHz 8-core Intel Core i9, SSD - ²2.3 GHz 8-core Intel Xeon Gold 6140, unknown drive * Improvements for usability and bug fixes for validate errors * After running validate, there were a few minor fixes that needed to be implemented. * Commented out / removed several CLI options for the time being until functionality is fully developed. * Updated file naming to take into the account bundle versioning separate from the AIP/SIP version * Updated docs per new pds-deep-archive script which combines aipgen and sipgen. Refs #21 Co-authored-by: Jordan Padams <[email protected]>
1 parent 2207dcf commit 43d0f66

File tree

9 files changed

+412
-343
lines changed

9 files changed

+412
-343
lines changed

README.rst

+4-93
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ Archival Information System (OAIS_) standards.
1010
Features
1111
========
1212

13+
• Provides an exectuble Python script ``pds-deep-archive``. Run ``pds-deep-archive --help`` for
14+
more details.
1315
• Provides an exectuble Python script ``aipgen``. Run ``aipgen --help`` for
1416
more details.
1517
• Provides an exectuble Python script ``sipgen``. Run ``sipgen --help`` for
@@ -42,6 +44,7 @@ well as ``libxsl2`` 1.1.28 or later.
4244
4345
4. You should now be able to run the deep archive utilities::
4446

47+
(pds-deep-archive) bash> pds-deep-archive --help
4548
(pds-deep-archive) bash> aipgen --help
4649
(pds-deep-archive) bash> sipgen --help
4750

@@ -63,102 +66,10 @@ To build the software for distribution:
6366
3. A tar.gz should now be available in the ``dist/`` directory for distribution.
6467

6568

66-
Usage
67-
=====
68-
69-
1. If not already activated, activate your virtualenv::
70-
71-
bash> $HOME/.virtualenvs/pds-deep-archive/bin/activate
72-
(pds-deep-archive) bash>
73-
74-
2. Then you can run aipgen. Here's a basic example using data in the test directory::
75-
76-
(pds-deep-archive) bash> aipgen test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml
77-
INFO 🏃‍♀️ Starting AIP generation for test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml
78-
INFO 🧾 Writing checksum manifest for /Users/kelly/Documents/Clients/JPL/PDS/Development/pds-deep-archive/test/data/ladee_test/ladee_mission_bundle to ladee_mission_bundle_checksum_manifest_v1.0.tab
79-
INFO 🚢 Writing transfer manifest for /Users/kelly/Documents/Clients/JPL/PDS/Development/pds-deep-archive/test/data/ladee_test/ladee_mission_bundle to ladee_mission_bundle_transfer_manifest_v1.0.tab
80-
INFO 🏷 Writing AIP label to ladee_mission_bundle_aip_v1.0.xml
81-
INFO 🎉 Success! All done, files generated:
82-
INFO • Checksum manifest: ladee_mission_bundle_checksum_manifest_v1.0.tab
83-
INFO • Transfer manifest: ladee_mission_bundle_transfer_manifest_v1.0.tab
84-
INFO • XML label: ladee_mission_bundle_aip_v1.0.xml
85-
INFO 👋 Thanks for using this program! Bye!
86-
87-
3. You can also run sipgen. Here is a basic usage example using data in the test directory::
88-
89-
(pds-deep-archive) bash> sipgen -c ladee_mission_bundle_checksum_manifest_v1.0.tab -s PDS_ATM -n -b https://atmos.nmsu.edu/PDS/data/PDS4/LADEE/ test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml
90-
⚙︎ ``sipgen`` — Submission Information Package (SIP) Generator, version 0.0.0
91-
🎉 Success! From test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml, generated these output files:
92-
• Manifest: ladee_mission_bundle_sip_v1.0.tab
93-
• Label: ladee_mission_bundle_sip_v1.0.xml
94-
95-
Note how the checksum manifest from ``aipgen`` was the input to ``-c`` in
96-
``sipgen``.
97-
98-
Full usage from the ``--help`` flag to ``aipgen``::
99-
100-
usage: aipgen [-h] [-v] IN-BUNDLE.XML
101-
102-
Generate an Archive Information Package or AIP. An AIP consists of three
103-
files: ➀ a "checksum manifest" which contains MD5 hashes of *all* files in a
104-
product; ➁ a "transfer manifest" which lists the "lidvids" for files within
105-
each XML label mentioned in a product; and ➂ an XML label for these two files.
106-
You can use the checksum manifest file ➀ as input to ``sipgen`` in order to
107-
create a Submission Information Package.
108-
109-
positional arguments:
110-
IN-BUNDLE.XML Root bundle XML file to read
111-
112-
optional arguments:
113-
-h, --help show this help message and exit
114-
-v, --verbose Verbose logging; defaults False
115-
116-
And usage from the ``--help`` flag for ``sipgen``::
117-
118-
usage: sipgen [-h] [-a {MD5,SHA-1,SHA-256}] -s
119-
{PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}
120-
[-u URL | -n] [-k] [-c AIP-CHECKSUM-MANIFEST.TAB]
121-
[-b BUNDLE_BASE_URL] [-v] [-i PDS4_INFORMATION_MODEL_VERSION]
122-
IN-BUNDLE.XML
123-
124-
Generate Submission Information Packages (SIPs) from bundles. This program
125-
takes a bundle XML file as input and produces two output files: ① A Submission
126-
Information Package (SIP) manifest file; and ② A PDS XML label of that file.
127-
The files are created in the current working directory when this program is
128-
run. The names of the files are based on the logical identifier found in the
129-
bundle file, and any existing files are overwritten. The names of the
130-
generated files are printed upon successful completion.
131-
132-
positional arguments:
133-
IN-BUNDLE.XML Bundle XML file to read
134-
135-
optional arguments:
136-
-h, --help show this help message and exit
137-
-a {MD5,SHA-1,SHA-256}, --algorithm {MD5,SHA-1,SHA-256}
138-
File hash (checksum) algorithm; default MD5
139-
-s {PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}, --site {PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}
140-
Provider site ID for the manifest's label; default
141-
None
142-
-u URL, --url URL URL to the registry service; default https://pds-dev-
143-
el7.jpl.nasa.gov/services/registry/pds
144-
-n, --offline Run offline, scanning bundle directory for matching
145-
files instead of querying registry service
146-
-k, --insecure Ignore SSL/TLS security issues; default False
147-
-c AIP-CHECKSUM-MANIFEST.TAB, --aip AIP-CHECKSUM-MANIFEST.TAB
148-
Archive Information Product checksum manifest file
149-
-b BUNDLE_BASE_URL, --bundle-base-url BUNDLE_BASE_URL
150-
Base URL prepended to URLs in the generated manifest
151-
for local files in "offline" mode
152-
-v, --verbose Verbose logging; defaults False
153-
-i PDS4_INFORMATION_MODEL_VERSION, --pds4-information-model-version PDS4_INFORMATION_MODEL_VERSION
154-
Specify PDS4 Information Model version to generate
155-
SIP. Must be 1.13.0.0+; default 1.13.0.0
156-
157-
15869
Documentation
15970
=============
16071

161-
Additional documentation is available in the ``docs`` directory and also TBD.
72+
Installation and Usage information can be found in the documentation online at https://nasa-pds-incubator.github.io/pds-deep-archive/ or the latest version is maintained under the ``docs`` directory.
16273

16374

16475

docs/source/development/index.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@ build it out::
99
python3 bootstrap.py
1010
bin/buildout
1111

12-
At this point, you'll have the ``aipgen`` and ``sipgen`` programs ready to run
13-
as ``bin/aipgen`` and ``bin/sipgen`` that's set up to use source Python code
12+
At this point, you'll have the ``pds-deep-archive``, ``aipgen``, ``sipgen`` programs ready to run
13+
as ``bin/pds-deep-archive``, ``bin/aipgen``, and ``bin/sipgen`` that's set up to use source Python code
1414
under ``src``. Changes you make to the code are reflected in ``bin/sipgen``
1515
immediately.
1616

docs/source/usage/index.rst

+69-76
Original file line numberDiff line numberDiff line change
@@ -1,81 +1,77 @@
11
🏃‍♀️ Usage
22
===========
33

4-
This package provides two executables, ``aipgen`` that generats Archive
5-
Information Packages; and ``sipgen``, that generates Submission Information
6-
Package (SIP)—both from PDS bundles.
7-
8-
Running ``aipgen --help`` or ``sipgen --help`` will give a summary of the
4+
This package provides one primary executable, ``pds-deep-archive`` that generates both
5+
and Archive Information Package (AIP) and a Submission Information Package (SIP). The
6+
SIP is what is delivered by the PDS to the NASA Space Science Data Coordinated Archive (NSSDCA).
7+
For more information about the products produced, see the following references:
8+
* OAIS Information - http://www.oais.info/
9+
* AIP Information - https://www.iasa-web.org/tc04/archival-information-package-aip
10+
* SIP Information - https://www.iasa-web.org/tc04/submission-information-package-sip
11+
12+
This package also comes with the two sub-components of ``pds-deep-archive`` that can be ran
13+
individually:
14+
* ``aipgen`` that generates Archive Information Packages from a PDS4 bundle
15+
* ``sipgen`` that generates Submission Information from a PDS4 bundle
16+
17+
Running ``pds-deep-archive --help`` will give a summary of the
918
command-line invocation, its required arguments, and any options that refine
1019
the behavior. For example, to create an AIP from the LADEE 1101 bundle in
11-
``test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml`` run::
20+
``test/data/ladee_test/mission_bundle/LADEE_Bundle_1101.xml`` run::
1221

13-
aipgen test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml
22+
aipgen test/data/ladee_test/mission_bundle/LADEE_Bundle_1101.xml
1423

1524
The program will print::
1625

17-
INFO 🏃‍♀️ Starting AIP generation for test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml
18-
INFO 🧾 Writing checksum manifest for /Users/kelly/Documents/Clients/JPL/PDS/Development/pds-deep-archive/test/data/ladee_test/ladee_mission_bundle to ladee_mission_bundle_checksum_manifest_v1.0.tab
19-
INFO 🚢 Writing transfer manifest for /Users/kelly/Documents/Clients/JPL/PDS/Development/pds-deep-archive/test/data/ladee_test/ladee_mission_bundle to ladee_mission_bundle_transfer_manifest_v1.0.tab
20-
INFO 🏷 Writing AIP label to ladee_mission_bundle_aip_v1.0.xml
21-
INFO 🎉 Success! All done, files generated:
22-
INFO • Checksum manifest: ladee_mission_bundle_checksum_manifest_v1.0.tab
23-
INFO • Transfer manifest: ladee_mission_bundle_transfer_manifest_v1.0.tab
24-
INFO • XML label: ladee_mission_bundle_aip_v1.0.xml
25-
INFO 👋 Thanks for using this program! Bye!
26-
27-
This creates three output files in the current directory as part of the AIP:
26+
INFO 👟 PDS Deep Archive, version 0.0.0
27+
INFO 🏃‍♀️ Starting AIP generation for test/data/ladee_test/mission_bundle/LADEE_Bundle_1101.xml
2828

29-
• ``ladee_mission_bundle_checksum_manifest_v1.0.tab``, the checksum manifest
30-
• ``ladee_mission_bundle_transfer_manifest_v1.0.tab``, the transfer manifest
31-
• ``ladee_mission_bundle_aip_v1.0.xml``, the label for these two files
29+
INFO 🎉 Success! AIP done, files generated:
30+
INFO • Checksum manifest: ladee_mission_bundle_v1.0_checksum_manifest_v1.0.tab
31+
INFO • Transfer manifest: ladee_mission_bundle_v1.0_transfer_manifest_v1.0.tab
32+
INFO • XML label for them both: ladee_mission_bundle_v1.0_aip_v1.0.xml
3233

33-
The checkum manifest may then be fed into ``sipgen`` to create the SIP::
34+
INFO 🏃‍♀️ Starting SIP generation for test/data/ladee_test/mission_bundle/LADEE_Bundle_1101.xml
3435

35-
sipgen --aip ladee_mission_bundle_checksum_manifest_v1.0.tab ladee_mission_bundle_checksum_manifest_v1.0.tab --s PDS_ATM --offline --bundle-base-url https://atmos.nmsu.edu/PDS/data/PDS4/LADEE/ test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml
36+
INFO 🎉 Success! From /Users/jpadams/Documents/proj/pds/pdsen/workspace/pds-deep-archive/test/data/ladee_test/mission_bundle/LADEE_Bundle_1101.xml, generated these output files:
37+
INFO • SIP Manifest: ladee_mission_bundle_v1.0_sip_v1.0.tab
38+
INFO • XML label for the SIP: ladee_mission_bundle_v1.0_sip_v1.0.xml
3639

37-
This program will print::
40+
INFO 👋 That's it! Thanks for making an AIP and SIP with us today. Bye!
3841

39-
⚙︎ ``sipgen`` — Submission Information Package (SIP) Generator, version 0.0.0
40-
🎉 Success! From test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml, generated these output files:
41-
• Manifest: ladee_mission_bundle_sip_v1.0.tab
42-
• Label: ladee_mission_bundle_sip_v1.0.xml
42+
This creates 5 output files in the current directory as part of the AIP and SIP Generation:
4343

44-
And two new files will appear in the current directory:
44+
• ``ladee_mission_bundle_v1.0_checksum_manifest_v1.0.tab``, the checksum manifest
45+
• ``ladee_mission_bundle_v1.0_transfer_manifest_v1.0.tab``, the transfer manifest
46+
• ``ladee_mission_bundle_v1.0_aip_v1.0.xml``, the label for these two files
4547

46-
• ``ladee_mission_bundle_sip_v1.0.tab``, the created SIP manifest as a
48+
• ``ladee_mission_bundle_v1.0_sip_v1.0.tab``, the created SIP manifest as a
4749
tab-separated values file.
48-
• ``ladee_mission_bundle_sip_v1.0.xml``, an PDS label for the SIP file.
49-
50-
For reference, the full "usage" message from ``aipgen`` is::
51-
52-
usage: aipgen [-h] [-v] IN-BUNDLE.XML
53-
54-
Generate an Archive Information Package or AIP. An AIP consists of three
55-
files: ➀ a "checksum manifest" which contains MD5 hashes of *all* files in a
56-
product; ➁ a "transfer manifest" which lists the "lidvids" for files within
57-
each XML label mentioned in a product; and ➂ an XML label for these two files.
58-
You can use the checksum manifest file ➀ as input to ``sipgen`` in order to
59-
create a Submission Information Package.
60-
61-
positional arguments:
62-
IN-BUNDLE.XML Root bundle XML file to read
63-
64-
optional arguments:
65-
-h, --help show this help message and exit
66-
-v, --verbose Verbose logging; defaults False
67-
68-
For reference, the full "usage" message from ``sipgen`` follows::
69-
70-
usage: sipgen [-h] [-a {MD5,SHA-1,SHA-256}] -s
71-
{PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}
72-
[-u URL | -n] [-k] [-c AIP-CHECKSUM-MANIFEST.TAB]
73-
[-b BUNDLE_BASE_URL] [-v] [-i PDS4_INFORMATION_MODEL_VERSION]
74-
IN-BUNDLE.XML
50+
• ``ladee_mission_bundle_v1.0_sip_v1.0.xml``, an PDS label for the SIP file.
51+
52+
For reference, the full "usage" message from ``pds-deep-archive`` is::
53+
54+
$ pds-deep-archive --help
55+
usage: pds-deep-archive [-h] [--version] -s
56+
{PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}
57+
[-n] -b BUNDLE_BASE_URL [-d] [-q]
58+
IN-BUNDLE.XML
59+
60+
Generate an Archive Information Package (AIP) and a Submission Information
61+
Package (SIP). This creates three files for the AIP in the current directory
62+
(overwriting them if they already exist):
63+
➀ a "checksum manifest" which contains MD5 hashes of *all* files in a product
64+
➁ a "transfer manifest" which lists the "lidvids" for files within each XML
65+
label mentioned in a product
66+
➂ an XML label for these two files.
67+
68+
It also creates two files for the SIP (also overwriting them if they exist):
69+
① A "SIP manifest" file; and an XML label of that file too. The names of
70+
the generated files are based on the logical identifier found in the
71+
bundle file, and any existing files are overwritten. The names of the
72+
generated files are printed upon successful completion.
73+
② A PDS XML label of that file.
7574

76-
Generate Submission Information Packages (SIPs) from bundles. This program
77-
takes a bundle XML file as input and produces two output files: ① A Submission
78-
Information Package (SIP) manifest file; and ② A PDS XML label of that file.
7975
The files are created in the current working directory when this program is
8076
run. The names of the files are based on the logical identifier found in the
8177
bundle file, and any existing files are overwritten. The names of the
@@ -86,22 +82,19 @@ For reference, the full "usage" message from ``sipgen`` follows::
8682

8783
optional arguments:
8884
-h, --help show this help message and exit
89-
-a {MD5,SHA-1,SHA-256}, --algorithm {MD5,SHA-1,SHA-256}
90-
File hash (checksum) algorithm; default MD5
85+
--version show program's version number and exit
9186
-s {PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}, --site {PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}
92-
Provider site ID for the manifest's label; default
93-
None
94-
-u URL, --url URL URL to the registry service; default https://pds-dev-
95-
el7.jpl.nasa.gov/services/registry/pds
87+
Provider site ID for the manifest's label
9688
-n, --offline Run offline, scanning bundle directory for matching
97-
files instead of querying registry service
98-
-k, --insecure Ignore SSL/TLS security issues; default False
99-
-c AIP-CHECKSUM-MANIFEST.TAB, --aip AIP-CHECKSUM-MANIFEST.TAB
100-
Archive Information Product checksum manifest file
89+
files instead of querying registry service. NOTE: By
90+
default, set to True until online mode is available.
10191
-b BUNDLE_BASE_URL, --bundle-base-url BUNDLE_BASE_URL
102-
Base URL prepended to URLs in the generated manifest
103-
for local files in "offline" mode
104-
-v, --verbose Verbose logging; defaults False
105-
-i PDS4_INFORMATION_MODEL_VERSION, --pds4-information-model-version PDS4_INFORMATION_MODEL_VERSION
106-
Specify PDS4 Information Model version to generate
107-
SIP. Must be 1.13.0.0+; default 1.13.0.0
92+
Base URL for Node data archive. This URL will be
93+
prepended to the bundle directory to form URLs to the
94+
products. For example, if we are generating a SIP for
95+
mission_bundle/LADEE_Bundle_1101.xml, and bundle-base-
96+
url is https://atmos.nmsu.edu/PDS/data/PDS4/LADEE/,
97+
the URL in the SIP will be https://atmos.nmsu.edu/PDS/
98+
data/PDS4/LADEE/mission_bundle/LADEE_Bundle_1101.xml.
99+
-d, --debug Log debugging messages for developers
100+
-q, --quiet Don't log informational messages

setup.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,8 @@
6464
entry_points={
6565
'console_scripts': [
6666
'sipgen=pds.aipgen.sip:main',
67-
'aipgen=pds.aipgen.aip:main'
67+
'aipgen=pds.aipgen.aip:main',
68+
'pds-deep-archive=pds.aipgen.main:main'
6869
]
6970
},
7071
namespace_packages=['pds'],

0 commit comments

Comments
 (0)