You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Downloading archived web pages from the [Wayback Machine](https://archive.org/web/).
@@ -37,18 +38,20 @@ This tool allows you to download content from the Wayback Machine (archive.org).
37
38
## Arguments
38
39
39
40
-`-h`, `--help`: Show the help message and exit.
40
-
-`-a`, `--about`: Show information about the tool and exit.
41
+
-`-v`, `--version`: Show information about the tool and exit.
41
42
42
43
### Required
43
44
44
45
-**`-u`**, **`--url`**:<br>
45
46
The URL of the web page to download. This argument is required.
46
47
47
48
#### Mode Selection (Choose One)
48
-
-**`-c`**, **`--current`**:<br>
49
-
Download the latest version of each file snapshot. You will get a rebuild of the current website with all available files (but not any original state because new and old versions are mixed).
50
-
-**`-f`**, **`--full`**:<br>
49
+
-**`-a`**, **`--all`**:<br>
51
50
Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
51
+
-**`-l`**, **`--last`**:<br>
52
+
Download the last version of each file snapshot. You will get one directory with a rebuild of the page. It contains the last version of each file of your specified `--range`.
53
+
-**`-f`**, **`--first`**:<br>
54
+
Download the first version of each file snapshot. You will get one directory with a rebuild of the page. It contains the first version of each file of your specified `--range`.
52
55
-**`-s`**, **`--save`**:<br>
53
56
Save a page to the Wayback Machine. (beta)
54
57
@@ -65,7 +68,7 @@ Limits the amount of snapshots to query from the CDX server. If an existing CDX
65
68
66
69
-**Range Selection:**<br>
67
70
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
(year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
69
72
-**`-r`**, **`--range`**:<br>
70
73
Specify the range in years for which to search and download snapshots.
71
74
-**`--start`**:<br>
@@ -102,39 +105,47 @@ Specifies delay between download requests in seconds. Default is no delay (0).
102
105
<!-- - **`--convert-links`**:<br>
103
106
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
104
107
105
-
## Special:
108
+
###Special:
106
109
107
110
-**`--reset`**:
108
111
If set, the job will be reset, and any existing `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch without considering previously downloaded data.
109
112
110
113
-**`--keep`**:
111
114
If set, all files will be kept after the job is finished. This includes the `cdx` and `db` file. Without this argument, they will be deleted if the job finished successfully.
112
115
113
-
### Examples
116
+
#Usage
114
117
115
-
Download the latest snapshot of all available files:<br>
116
-
`waybackup -u http://example.com -c`
118
+
### Handling Interrupted Jobs
119
+
When a job is interrupted (by any reason), `pywaybackup` is designed to resume the job from where it left off. It automatically detects existing job data (based on the URL and <u>**optional query parameters**</u> - including output directory) and resumes the process without requiring manual intervention. Here's how the tool handles different scenarios:
117
120
118
-
Download the latest snapshot of a specific file (e.g., a login page):<br>
1. Start downloading all available snapshots:<br>`waybackup -u https://example.com -a`
144
+
2. Interrupt the process `CTRL+C`<br>
145
+
3. The tool will detect the existing job data and resume downloading from the last completed point:<br>`waybackup -u https://example.com -a`
146
+
> **Important:**`waybackup -u https://example.com -c` -> The tool will NOT resume because a necessary identifier-changed
147
+
4. This deletes any existing .cdx and .db files associated with the job and starts the process from scratch:<br>`waybackup -u https://example.com -a --reset`
148
+
5. This ensures all job-related files are kept for future use, such as re-analysis or extending the query later:<br>`waybackup -u https://example.com -a --keep`
138
149
139
150
## Output path structure
140
151
@@ -195,22 +206,6 @@ For download queries:
195
206
]
196
207
```
197
208
198
-
For list queries:
199
-
200
-
```
201
-
[
202
-
{
203
-
"digest": "DIGESTOFSNAPSHOT",
204
-
"id": 1,
205
-
"mimetype": "text/html",
206
-
"status": "200",
207
-
"timestamp": "yyyymmddhhmmss",
208
-
"url": "http://example.com/"
209
-
},
210
-
...
211
-
]
212
-
```
213
-
214
209
### Debugging
215
210
216
211
Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
0 commit comments