Skip to content

Commit 4a79a80

Browse files
committed
Merge branch 'r/3.0.1'
1 parent 933f4ee commit 4a79a80

File tree

10 files changed

+181
-183
lines changed

10 files changed

+181
-183
lines changed

README.md

Lines changed: 37 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,8 @@
22

33
[![PyPI](https://img.shields.io/pypi/v/pywaybackup)](https://pypi.org/project/pywaybackup/)
44
[![PyPI - Downloads](https://img.shields.io/pypi/dm/pywaybackup)](https://pypi.org/project/pywaybackup/)
5-
![Python Version](https://img.shields.io/badge/Python-3.6-blue)
5+
![Python Version](https://img.shields.io/badge/Python-3.8-blue)
6+
![Python_Sqlite3 Version](https://img.shields.io/badge/Python_Sqlite3-3.25-blue)
67
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
78

89
Downloading archived web pages from the [Wayback Machine](https://archive.org/web/).
@@ -37,18 +38,20 @@ This tool allows you to download content from the Wayback Machine (archive.org).
3738
## Arguments
3839

3940
- `-h`, `--help`: Show the help message and exit.
40-
- `-a`, `--about`: Show information about the tool and exit.
41+
- `-v`, `--version`: Show information about the tool and exit.
4142

4243
### Required
4344

4445
- **`-u`**, **`--url`**:<br>
4546
The URL of the web page to download. This argument is required.
4647

4748
#### Mode Selection (Choose One)
48-
- **`-c`**, **`--current`**:<br>
49-
Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
50-
- **`-f`**, **`--full`**:<br>
49+
- **`-a`**, **`--all`**:<br>
5150
Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
51+
- **`-l`**, **`--last`**:<br>
52+
Download the last version of each file snapshot. You will get one directory with a rebuild of the page. It contains the last version of each file of your specified `--range`.
53+
- **`-f`**, **`--first`**:<br>
54+
Download the first version of each file snapshot. You will get one directory with a rebuild of the page. It contains the first version of each file of your specified `--range`.
5255
- **`-s`**, **`--save`**:<br>
5356
Save a page to the Wayback Machine. (beta)
5457

@@ -65,7 +68,7 @@ Limits the amount of snapshots to query from the CDX server. If an existing CDX
6568

6669
- **Range Selection:**<br>
6770
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
68-
(year 2019, year+month 201901, year+month+day 20190101, year+month+day+hour 2019010112)
71+
(year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
6972
- **`-r`**, **`--range`**:<br>
7073
Specify the range in years for which to search and download snapshots.
7174
- **`--start`**:<br>
@@ -102,39 +105,47 @@ Specifies delay between download requests in seconds. Default is no delay (0).
102105
<!-- - **`--convert-links`**:<br>
103106
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
104107

105-
## Special:
108+
### Special:
106109

107110
- **`--reset`**:
108111
If set, the job will be reset, and any existing `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch without considering previously downloaded data.
109112

110113
- **`--keep`**:
111114
If set, all files will be kept after the job is finished. This includes the `cdx` and `db` file. Without this argument, they will be deleted if the job finished successfully.
112115

113-
### Examples
116+
# Usage
114117

115-
Download the latest snapshot of all available files:<br>
116-
`waybackup -u http://example.com -c`
118+
### Handling Interrupted Jobs
119+
When a job is interrupted (by any reason), `pywaybackup` is designed to resume the job from where it left off. It automatically detects existing job data (based on the URL and <u>**optional query parameters**</u> - including output directory) and resumes the process without requiring manual intervention. Here's how the tool handles different scenarios:
117120

118-
Download the latest snapshot of a specific file (e.g., a login page):<br>
119-
`waybackup -u http://example.com/login.html -c --explicit`
121+
- **Default Behavior:**
122+
- On restarting the same job (same URL, <u>**optional query parameters**</u>, and output directory), the tool will:
123+
- Reuse the existing `.cdx` and `.db` files.
124+
- Resume downloading snapshots from the last successful point.
125+
- Skip previously downloaded files to save time and resources.
120126

121-
Download all snapshots within the last 5 years and prevent redirects:<br>
122-
`waybackup -u http://example.com -f -r 5 --no-redirect`
127+
- **Manual Reset with `--reset`:**
128+
- This command deletes any existing `.cdx` and `.db` files associated with the job and starts the process from scratch.
129+
- Useful if:
130+
- The previous data is corrupted.
131+
- You want to re-query the snapshots without considering previously downloaded data.
123132

124-
Download all snapshots from a specific range (2020 to December 12, 2022) with 4 workers, and show a progress bar:<br>
125-
`waybackup -u http://example.com -f --start 2020 --end 20221212 --workers 4 --progress`
133+
- **Preserving Job Data with `--keep`:**
134+
- Normally, `.cdx` and `.db` files are deleted after the job finishes successfully.
135+
- Use `--keep` to retain these files for future use (e.g., re-analysis or extending the query later).
126136

127-
Download all snapshots and save the output in a specific folder with 3 workers:<br>
128-
`waybackup -u http://example.com -f -r 5 -o /home/user/Downloads/snapshots --workers 3`
129-
130-
Download all snapshots but only images and CSS files, filtering for specific filetypes (jpg, css):<br>
131-
`waybackup -u http://example.com -f --filetype jpg,css`
132-
133-
Download all timestamps but start over and ignore existing progress, log the output, and retry 3 times if any error occurs:<br>
134-
`waybackup -u http://example.com -f --log --retry 3 --reset`
137+
> **Note1:** The resumption process only works if the output directory remains the same as the one used during the initial job.
138+
>
139+
> **Note2:** `--reset` will NOT delete the already downloaded files for now. You have to remove them 'by hand'.
140+
141+
### Example
135142

136-
Download the latest snapshot, follow no redirects but keep the database and cdx-file:<br>
137-
`waybackup -u http://example.com -c --no-redirect --keep`
143+
1. Start downloading all available snapshots:<br>`waybackup -u https://example.com -a`
144+
2. Interrupt the process `CTRL+C`<br>
145+
3. The tool will detect the existing job data and resume downloading from the last completed point:<br>`waybackup -u https://example.com -a`
146+
> **Important:** `waybackup -u https://example.com -c` -> The tool will NOT resume because a necessary identifier-changed
147+
4. This deletes any existing .cdx and .db files associated with the job and starts the process from scratch:<br>`waybackup -u https://example.com -a --reset`
148+
5. This ensures all job-related files are kept for future use, such as re-analysis or extending the query later:<br>`waybackup -u https://example.com -a --keep`
138149
139150
## Output path structure
140151

@@ -195,22 +206,6 @@ For download queries:
195206
]
196207
```
197208

198-
For list queries:
199-
200-
```
201-
[
202-
{
203-
"digest": "DIGESTOFSNAPSHOT",
204-
"id": 1,
205-
"mimetype": "text/html",
206-
"status": "200",
207-
"timestamp": "yyyymmddhhmmss",
208-
"url": "http://example.com/"
209-
},
210-
...
211-
]
212-
```
213-
214209
### Debugging
215210

216211
Exceptions will be written into `waybackup_error.log` (each run overwrites the file).

dev/pip_build.sh

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,9 @@
44
SCRIPT_PATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
55
TARGET_PATH="$SCRIPT_PATH/.."
66

7-
# install dependencies
8-
pip install twine wheel setuptools
7+
pip install build twine
98

10-
# build
11-
python $TARGET_PATH/setup.py sdist bdist_wheel --verbose
12-
python -m twine upload dist/*
13-
#pip install -e $TARGET_PATH
9+
python -m build
10+
twine upload dist/*
1411

15-
# clean up
16-
rm -rf $TARGET_PATH/build $TARGET_PATH/dist # $TARGET_PATH/*.egg-info
12+
rm -rf $TARGET_PATH/build $TARGET_PATH/dist $TARGET_PATH/*.egg-info

pyproject.toml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
[build-system]
2+
requires = ["setuptools", "wheel"]
3+
build-backend = "setuptools.build_meta"
4+
5+
[tool.setuptools]
6+
packages = ["pywaybackup"]
7+
8+
[project]
9+
name = "pywaybackup"
10+
version = "3.0.1"
11+
description = "Query and download archive.org as simple as possible."
12+
authors = [
13+
{ name = "bitdruid", email = "[email protected]" }
14+
]
15+
license = { file = "LICENSE" }
16+
readme = "README.md"
17+
requires-python = ">=3.8"
18+
dependencies = [
19+
"requests==2.31.0",
20+
"tqdm==4.66.2",
21+
"python-magic==0.4.27; sys_platform == 'linux'",
22+
"python-magic-bin==0.4.14; sys_platform == 'win32'",
23+
]
24+
25+
[project.scripts]
26+
waybackup = "pywaybackup.main:main"
27+
28+
[project.urls]
29+
homepage = "https://github.com/bitdruid/python-wayback-machine-downloader"

pywaybackup/Arguments.py

Lines changed: 31 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -2,33 +2,33 @@
22
import sys
33
import os
44
import argparse
5+
from importlib.metadata import version
56

67
from pywaybackup.helper import url_split, sanitize_filename
78

8-
from pywaybackup.__version__ import __version__
9-
109
class Arguments:
11-
10+
1211
def __init__(self):
13-
12+
1413
parser = argparse.ArgumentParser(description='Download from wayback machine (archive.org)')
15-
parser.add_argument('-a', '--about', action='version', version='%(prog)s ' + __version__ + ' by @bitdruid -> https://github.com/bitdruid')
16-
14+
parser.add_argument('-v', '--version', action='version', version='%(prog)s ' + version("pywaybackup") + ' by @bitdruid -> https://github.com/bitdruid')
15+
1716
required = parser.add_argument_group('required (one exclusive)')
1817
required.add_argument('-u', '--url', type=str, metavar="", help='url (with subdir/subdomain) to download')
1918
exclusive_required = required.add_mutually_exclusive_group(required=True)
20-
exclusive_required.add_argument('-c', '--current', action='store_true', help='download the latest version of each file snapshot')
21-
exclusive_required.add_argument('-f', '--full', action='store_true', help='download snapshots of all timestamps')
19+
exclusive_required.add_argument('-a', '--all', action='store_true', help='download snapshots of all timestamps')
20+
exclusive_required.add_argument('-l', '--last', action='store_true', help='download the last version of each file snapshot')
21+
exclusive_required.add_argument('-f', '--first', action='store_true', help='download the first version of each file snapshot')
2222
exclusive_required.add_argument('-s', '--save', action='store_true', help='save a page to the wayback machine')
23-
23+
2424
optional = parser.add_argument_group('optional query parameters')
2525
optional.add_argument('-e', '--explicit', action='store_true', help='search only for the explicit given url')
2626
optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search')
2727
optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss')
2828
optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss')
2929
optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (e.g. "html,css")')
3030
optional.add_argument('--limit', type=int, nargs='?', const=True, metavar='int', help='limit the number of snapshots to download')
31-
31+
3232
behavior = parser.add_argument_group('manipulate behavior')
3333
behavior.add_argument('-o', '--output', type=str, metavar="", help='output folder - defaults to current directory')
3434
behavior.add_argument('--log', action='store_true', help='save a log file into the output folder')
@@ -39,57 +39,59 @@ def __init__(self):
3939
behavior.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)')
4040
# behavior.add_argument('--convert-links', action='store_true', help='Convert all links in the files to local paths. Requires -c/--current')
4141
behavior.add_argument('--delay', type=int, default=0, metavar="", help='delay between each download in seconds')
42-
42+
4343
special = parser.add_argument_group('special')
4444
special.add_argument('--reset', action='store_true', help='reset the job and ignore existing cdx/db/csv files')
4545
special.add_argument('--keep', action='store_true', help='keep all files after the job finished')
46-
46+
4747
args = parser.parse_args(args=None if sys.argv[1:] else ['--help']) # if no arguments are given, print help
48-
48+
4949
required_args = {action.dest: getattr(args, action.dest) for action in exclusive_required._group_actions}
5050
optional_args = {action.dest: getattr(args, action.dest) for action in optional._group_actions}
5151
args.query_identifier = str(args.url) + str(required_args) + str(optional_args)
52-
52+
5353
# if args.convert_links and not args.current:
5454
# parser.error("--convert-links can only be used with the -c/--current option")
55-
55+
5656
self.args = args
57-
57+
5858
def get_args(self):
5959
return self.args
6060

6161
class Configuration:
62-
62+
6363
@classmethod
6464
def init(cls):
65-
65+
6666
cls.args = Arguments().get_args()
6767
for key, value in vars(cls.args).items():
6868
setattr(Configuration, key, value)
69-
69+
7070
# args now attributes of Configuration // Configuration.output, ...
7171
cls.command = ' '.join(sys.argv[1:])
7272
cls.domain, cls.subdir, cls.filename = url_split(cls.url)
73-
73+
7474
if cls.output is None:
7575
cls.output = os.path.join(os.getcwd(), "waybackup_snapshots")
7676
os.makedirs(cls.output, exist_ok=True)
77-
77+
7878
if cls.log is True:
7979
cls.log = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.log")
80-
81-
if cls.full:
82-
cls.mode = "full"
83-
if cls.current:
84-
cls.mode = "current"
85-
80+
81+
if cls.all:
82+
cls.mode = "all"
83+
if cls.last:
84+
cls.mode = "last"
85+
if cls.first:
86+
cls.mode = "first"
87+
8688
if cls.filetype:
8789
cls.filetype = [ft.lower().strip() for ft in cls.filetype.split(",")]
88-
90+
8991
cls.cdxfile = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.cdx")
9092
cls.dbfile = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.db")
9193
cls.csvfile = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.csv")
92-
94+
9395
if cls.reset:
9496
os.remove(cls.cdxfile) if os.path.isfile(cls.cdxfile) else None
9597
os.remove(cls.dbfile) if os.path.isfile(cls.dbfile) else None

pywaybackup/Exception.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
import re
99

10-
from pywaybackup.__version__ import __version__
10+
from importlib.metadata import version
1111

1212
class Exception:
1313

@@ -59,7 +59,7 @@ def exception(cls, message: str, e: Exception, tb=None):
5959
cls.new_debug = False
6060
f = open(debug_file, "w")
6161
f.write("-------------------------\n")
62-
f.write(f"Version: {__version__}\n")
62+
f.write(f"Version: {version("pywaybackup")}\n")
6363
f.write("-------------------------\n")
6464
f.write(f"Command: {cls.command}\n")
6565
f.write("-------------------------\n\n")

0 commit comments

Comments
 (0)