bitdruid
diff --git a/‎README.md
Lines changed: 37 additions & 42 deletions b/‎README.md
Lines changed: 37 additions & 42 deletions
diff --git a/‎dev/pip_build.sh
Lines changed: 4 additions & 8 deletions b/‎dev/pip_build.sh
Lines changed: 4 additions & 8 deletions
diff --git a/‎pyproject.toml
Lines changed: 29 additions & 0 deletions b/‎pyproject.toml
Lines changed: 29 additions & 0 deletions
diff --git a/‎pywaybackup/Arguments.py
Lines changed: 31 additions & 29 deletions b/‎pywaybackup/Arguments.py
Lines changed: 31 additions & 29 deletions
diff --git a/‎pywaybackup/Exception.py
Lines changed: 2 additions & 2 deletions b/‎pywaybackup/Exception.py
Lines changed: 2 additions & 2 deletions
@@ -2,7 +2,8 @@
 
 [![PyPI](https://img.shields.io/pypi/v/pywaybackup)](https://pypi.org/project/pywaybackup/)
 [![PyPI - Downloads](https://img.shields.io/pypi/dm/pywaybackup)](https://pypi.org/project/pywaybackup/)
-![Python Version](https://img.shields.io/badge/Python-3.6-blue)
+![Python Version](https://img.shields.io/badge/Python-3.8-blue)
+![Python_Sqlite3 Version](https://img.shields.io/badge/Python_Sqlite3-3.25-blue)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 
 Downloading archived web pages from the [Wayback Machine](https://archive.org/web/).
@@ -37,18 +38,20 @@ This tool allows you to download content from the Wayback Machine (archive.org).
 ## Arguments
 
 - `-h`, `--help`: Show the help message and exit.
-- `-a`, `--about`: Show information about the tool and exit.
+- `-v`, `--version`: Show information about the tool and exit.
 
 ### Required
 
 - **`-u`**, **`--url`**:<br>
   The URL of the web page to download. This argument is required.
 
 #### Mode Selection (Choose One)
-- **`-c`**, **`--current`**:<br>
-  Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
-- **`-f`**, **`--full`**:<br>
+- **`-a`**, **`--all`**:<br>
   Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
+- **`-l`**, **`--last`**:<br>
+  Download the last version of each file snapshot. You will get one directory with a rebuild of the page. It contains the last version of each file of your specified `--range`.
+- **`-f`**, **`--first`**:<br>
+  Download the first version of each file snapshot. You will get one directory with a rebuild of the page. It contains the first version of each file of your specified `--range`.
 - **`-s`**, **`--save`**:<br>
   Save a page to the Wayback Machine. (beta)
 
@@ -65,7 +68,7 @@ Limits the amount of snapshots to query from the CDX server. If an existing CDX
 
 - **Range Selection:**<br>
   Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
-  (year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)
+  (year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
    - **`-r`**, **`--range`**:<br>
      Specify the range in years for which to search and download snapshots.
    - **`--start`**:<br>
@@ -102,39 +105,47 @@ Specifies delay between download requests in seconds. Default is no delay (0).
 <!-- - **`--convert-links`**:<br>
 If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
 
-## Special:
+### Special:
 
 - **`--reset`**:  
   If set, the job will be reset, and any existing `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch without considering previously downloaded data.
 
 - **`--keep`**:  
   If set, all files will be kept after the job is finished. This includes the `cdx` and `db` file. Without this argument, they will be deleted if the job finished successfully.
 
-### Examples
+# Usage 
 
-Download the latest snapshot of all available files:<br>
-`waybackup -u http://example.com -c`
+### Handling Interrupted Jobs
+When a job is interrupted (by any reason), `pywaybackup` is designed to resume the job from where it left off. It automatically detects existing job data (based on the URL and <u>**optional query parameters**</u> - including output directory) and resumes the process without requiring manual intervention. Here's how the tool handles different scenarios:
 
-Download the latest snapshot of a specific file (e.g., a login page):<br>
-`waybackup -u http://example.com/login.html -c --explicit`
+- **Default Behavior:** 
+  - On restarting the same job (same URL, <u>**optional query parameters**</u>, and output directory), the tool will:
+    - Reuse the existing `.cdx` and `.db` files.
+    - Resume downloading snapshots from the last successful point.
+    - Skip previously downloaded files to save time and resources.
 
-Download all snapshots within the last 5 years and prevent redirects:<br>
-`waybackup -u http://example.com -f -r 5 --no-redirect`
+- **Manual Reset with `--reset`:** 
+  - This command deletes any existing `.cdx` and `.db` files associated with the job and starts the process from scratch.
+  - Useful if:
+    - The previous data is corrupted.
+    - You want to re-query the snapshots without considering previously downloaded data.
 
-Download all snapshots from a specific range (2020 to December 12, 2022) with 4 workers, and show a progress bar:<br>
-`waybackup -u http://example.com -f --start 2020 --end 20221212 --workers 4 --progress`
+- **Preserving Job Data with `--keep`:** 
+  - Normally, `.cdx` and `.db` files are deleted after the job finishes successfully.
+  - Use `--keep` to retain these files for future use (e.g., re-analysis or extending the query later).
 
-Download all snapshots and save the output in a specific folder with 3 workers:<br>
-`waybackup -u http://example.com -f -r 5 -o /home/user/Downloads/snapshots --workers 3`
-
-Download all snapshots but only images and CSS files, filtering for specific filetypes (jpg, css):<br>
-`waybackup -u http://example.com -f --filetype jpg,css`
-
-Download all timestamps but start over and ignore existing progress, log the output, and retry 3 times if any error occurs:<br>
-`waybackup -u http://example.com -f --log --retry 3 --reset`
+> **Note1:** The resumption process only works if the output directory remains the same as the one used during the initial job.
+> 
+> **Note2:** `--reset` will NOT delete the already downloaded files for now. You have to remove them 'by hand'.
+  
+### Example
 
-Download the latest snapshot, follow no redirects but keep the database and cdx-file:<br>
-`waybackup -u http://example.com -c --no-redirect --keep`
+1. Start downloading all available snapshots:<br>`waybackup -u https://example.com -a`
+2. Interrupt the process `CTRL+C`<br>
+3. The tool will detect the existing job data and resume downloading from the last completed point:<br>`waybackup -u https://example.com -a`
+> **Important:** `waybackup -u https://example.com -c` -> The tool will NOT resume because a necessary identifier-changed
+4. This deletes any existing .cdx and .db files associated with the job and starts the process from scratch:<br>`waybackup -u https://example.com -a --reset`
+5. This ensures all job-related files are kept for future use, such as re-analysis or extending the query later:<br>`waybackup -u https://example.com -a --keep`
 
 ## Output path structure
 
@@ -195,22 +206,6 @@ For download queries:
 ]
 ```
 
-For list queries:
-
-```
-[
-   {
-      "digest": "DIGESTOFSNAPSHOT",
-      "id": 1,
-      "mimetype": "text/html",
-      "status": "200",
-      "timestamp": "yyyymmddhhmmss",
-      "url": "http://example.com/"
-   },
-   ...
-]
-```
-
 ### Debugging
 
 Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
 
@@ -4,13 +4,9 @@
 SCRIPT_PATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
 TARGET_PATH="$SCRIPT_PATH/.."
 
-# install dependencies
-pip install twine wheel setuptools
+pip install build twine
 
-# build 
-python $TARGET_PATH/setup.py sdist bdist_wheel --verbose
-python -m twine upload dist/*
-#pip install -e $TARGET_PATH
+python -m build
+twine upload dist/*
 
-# clean up
-rm -rf $TARGET_PATH/build $TARGET_PATH/dist # $TARGET_PATH/*.egg-info
+rm -rf $TARGET_PATH/build $TARGET_PATH/dist $TARGET_PATH/*.egg-info
@@ -0,0 +1,29 @@
+[build-system]
+requires = ["setuptools", "wheel"]
+build-backend = "setuptools.build_meta"
+
+[tool.setuptools]
+packages = ["pywaybackup"]
+
+[project]
+name = "pywaybackup"
+version = "3.0.1"
+description = "Query and download archive.org as simple as possible."
+authors = [
+    { name = "bitdruid", email = "[email protected]" }
+]
+license = { file = "LICENSE" }
+readme = "README.md"
+requires-python = ">=3.8"
+dependencies = [
+    "requests==2.31.0",
+    "tqdm==4.66.2",
+    "python-magic==0.4.27; sys_platform == 'linux'",
+    "python-magic-bin==0.4.14; sys_platform == 'win32'",
+]
+
+[project.scripts]
+waybackup = "pywaybackup.main:main"
+
+[project.urls]
+homepage = "https://github.com/bitdruid/python-wayback-machine-downloader"
@@ -2,33 +2,33 @@
 import sys
 import os
 import argparse
+from importlib.metadata import version
 
 from pywaybackup.helper import url_split, sanitize_filename
 
-from pywaybackup.__version__ import __version__
-
 class Arguments:
-
+    
     def __init__(self):
-
+        
         parser = argparse.ArgumentParser(description='Download from wayback machine (archive.org)')
-        parser.add_argument('-a', '--about', action='version', version='%(prog)s ' + __version__ + ' by @bitdruid -> https://github.com/bitdruid')
-
+        parser.add_argument('-v', '--version', action='version', version='%(prog)s ' + version("pywaybackup") + ' by @bitdruid -> https://github.com/bitdruid')
+        
         required = parser.add_argument_group('required (one exclusive)')
         required.add_argument('-u', '--url', type=str, metavar="", help='url (with subdir/subdomain) to download')
         exclusive_required = required.add_mutually_exclusive_group(required=True)
-        exclusive_required.add_argument('-c', '--current', action='store_true', help='download the latest version of each file snapshot')
-        exclusive_required.add_argument('-f', '--full', action='store_true', help='download snapshots of all timestamps')
+        exclusive_required.add_argument('-a', '--all', action='store_true', help='download snapshots of all timestamps')
+        exclusive_required.add_argument('-l', '--last', action='store_true', help='download the last version of each file snapshot')
+        exclusive_required.add_argument('-f', '--first', action='store_true', help='download the first version of each file snapshot')
         exclusive_required.add_argument('-s', '--save', action='store_true', help='save a page to the wayback machine')
-
+        
         optional = parser.add_argument_group('optional query parameters')
         optional.add_argument('-e', '--explicit', action='store_true', help='search only for the explicit given url')
         optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search')
         optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss')
         optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss')
         optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (e.g. "html,css")')
         optional.add_argument('--limit', type=int, nargs='?', const=True, metavar='int', help='limit the number of snapshots to download')
-
+        
         behavior = parser.add_argument_group('manipulate behavior')
         behavior.add_argument('-o', '--output', type=str, metavar="", help='output folder - defaults to current directory')
         behavior.add_argument('--log', action='store_true', help='save a log file into the output folder')
@@ -39,57 +39,59 @@ def __init__(self):
         behavior.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)')
         # behavior.add_argument('--convert-links', action='store_true', help='Convert all links in the files to local paths. Requires -c/--current')
         behavior.add_argument('--delay', type=int, default=0, metavar="", help='delay between each download in seconds')
-
+        
         special = parser.add_argument_group('special')
         special.add_argument('--reset', action='store_true', help='reset the job and ignore existing cdx/db/csv files')
         special.add_argument('--keep', action='store_true', help='keep all files after the job finished')
-
+        
         args = parser.parse_args(args=None if sys.argv[1:] else ['--help']) # if no arguments are given, print help
-
+        
         required_args = {action.dest: getattr(args, action.dest) for action in exclusive_required._group_actions}
         optional_args = {action.dest: getattr(args, action.dest) for action in optional._group_actions}
         args.query_identifier = str(args.url) + str(required_args) + str(optional_args)
-
+        
         # if args.convert_links and not args.current:
         #     parser.error("--convert-links can only be used with the -c/--current option")
-
+        
         self.args = args
-
+    
     def get_args(self):
         return self.args
 
 class Configuration:
-
+    
     @classmethod
     def init(cls):
-
+        
         cls.args = Arguments().get_args()
         for key, value in vars(cls.args).items():
             setattr(Configuration, key, value)
-
+        
         # args now attributes of Configuration // Configuration.output, ...
         cls.command = ' '.join(sys.argv[1:])
         cls.domain, cls.subdir, cls.filename = url_split(cls.url)
-
+        
         if cls.output is None:
             cls.output = os.path.join(os.getcwd(), "waybackup_snapshots")
         os.makedirs(cls.output, exist_ok=True)
-
+        
         if cls.log is True:
             cls.log = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.log")
-
-        if cls.full:
-            cls.mode = "full"
-        if cls.current:
-            cls.mode = "current"
-
+        
+        if cls.all:
+            cls.mode = "all"
+        if cls.last:
+            cls.mode = "last"
+        if cls.first:
+            cls.mode = "first"
+        
         if cls.filetype:
             cls.filetype = [ft.lower().strip() for ft in cls.filetype.split(",")]
-
+        
         cls.cdxfile = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.cdx")
         cls.dbfile = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.db")
         cls.csvfile = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.csv")
-
+        
         if cls.reset:
             os.remove(cls.cdxfile) if os.path.isfile(cls.cdxfile) else None
             os.remove(cls.dbfile) if os.path.isfile(cls.dbfile) else None
 
@@ -7,7 +7,7 @@
 
 import re
 
-from pywaybackup.__version__ import __version__
+from importlib.metadata import version
 
 class Exception:
 
@@ -59,7 +59,7 @@ def exception(cls, message: str, e: Exception, tb=None):
             cls.new_debug = False
             f = open(debug_file, "w")
             f.write("-------------------------\n")
-            f.write(f"Version: {__version__}\n")
+            f.write(f"Version: {version("pywaybackup")}\n")
             f.write("-------------------------\n")
             f.write(f"Command: {cls.command}\n")
             f.write("-------------------------\n\n")