All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
0.5.2 - 2023-09-24
- Improved handling of empty attribute values (
<img alt="">) and valueless attributes (<iframe seamless>).
0.5.1 - 2022-10-08
- Document the function of the
WebResource.frame_nameproperty.
0.5.0 - 2022-04-16
- More complete documentation for the
WebArchiveandWebResourceclasses. - Documentation on pywebarchive's internals.
- Unit test for subresource URLs occurring as literal text.
- Massively overhaul the README.
- Improved the documentation for the
webarchivemodule. - Expanded and clarified various code comments.
- Use a
withclause for proper cleanup in test/extracted_archive_display.py. - Rename
WebArchive.extract()'ssingle_fileargument to the more descriptiveembed_subresources(potentially backwards-incompatible change).
- Raise a
WebArchiveErrorwhen attempting to extract a webarchive with no main resource. - Raise a
WebArchiveErrorwhen attempting to convert a webarchive with no main resource to HTML. - Return the correct value for
WebArchive.resource_count()if no main resource is present.
- The unnecessary
<!-- Processed by pywebarchive -->tag previously added to extracted pages.
0.4.1 - 2022-03-26
- Call
close()inWebArchive.__exit__().
0.4.0 - 2022-03-26
- Context manager (
withstatement) support in theWebArchiveclass. - The
WebArchive.close()method. - The
WebArchive.parentproperty. - Support for the
modeargument inwebarchive.open()(though only read mode remains implemented).
- Further cleaned up internal APIs.
- Improved module documentation.
- Ensure an encoding is always specified when creating a text
WebResource. - Removed duplicated code in test/extracted_archive_display.py.
0.3.3 - 2021-11-05
- Unit tests for HTML- and CSS-rewriting logic.
- Build script for the Windows version of Webarchive Extractor.
- Clean up the
WebResourceclass's internal API. - Do not force a newline after the doctype in
HTMLRewriter.handle_decl(). - Moved
test_extracted_archive_displayfrom the unit tests to a separate script. - Removed
test_extracted_archive_display's dependency on Tkinter.
- Rewrite URLs in inline CSS code when extracting.
0.3.2 - 2021-09-26
- The module version number in
webarchive.__version__. - Initial support for command-line arguments in
extractor-gui.py. - The
--versionargument inextractor.pyandextractor-gui.py.
- Further code cleanup.
- Give more descriptive names to various internals.
- Support HTML subresources.
- Handle non-HTML subresources incorrectly served as
text/html. - Update the module description in
setup.pyto match its documentation. - Specify a text encoding in
WebArchiveTest.test_webarchive_to_html()so the test will pass on Windows. - Make
webbrowseran optional dependency inextractor.pyto matchextractor-gui.py.
0.3.1 - 2021-09-25
- Unit test for
WebArchive.to_html().
- Massively expanded module documentation.
- Don't delete the
srcsetattribute from<img>. - Embed style sheets in single-file mode using data URIs rather than
<style>. - Cleaned up various internals.
- Handle
srcsetentries without a width or pixel density descriptor. - Embed subresources recursively when calling
WebResource.to_data_uri()on an archive's main resource. - Don't escape HTML entities in a
<script>or<style>block. - Correctly handle non-HTML main resources.
0.3.0 - 2021-07-18
- Experimental support for extracting webarchives to single-file HTML documents.
- External scripts and style sheets are replaced with inline content.
- External images are embedded using data URIs.
- New command-line options for
extractor.py:-s/--single-fileto extract archive contents to a single HTML file.-o/--open-pageto open the extracted webpage when finished.
- New
WebArchiveclass methods:get_local_path()returns the basename of the file created when a specified subresource is extracted.get_subframe_archive()returns the subframe archive corresponding to a specified URL.get_subresource()returns the subresource corresponding to a specified URL.to_html()returns the archive's contents as a single-file HTML document.
- The
WebResource.archiveproperty, which identifies a given resource's parentWebArchive. - The
WebArchiveErrorexception.
- Moved the development status up to beta.
- Correctly handle "empty" tags like
<img />in XHTML documents. - Fixed local resource paths for extracted subframe archives.
- The
Extractorclass, included only for backwards compatibility with the poorly thought-out 0.1.0 API.
0.2.4 - 2020-02-22
- Unit tests.
extractor-gui.pycan now open converted files on non-Windows platforms.
0.2.3 - 2019-09-02
- Code cleanup release; no user-visible changes.
0.2.2 - 2018-10-21
- Various bugfixes, mainly involving subframe archives.
0.2.1 - 2018-10-20
- Graphical extraction tool.
- Support for subframe archives.
- Various bugfixes.
Note: Version 0.2.0 was pulled shortly after posting due to problems with its setup.py script.
0.1.1 - 2018-10-19
- The
open()function as the preferred way to open a WebArchive.
- Moved extraction into the main
WebArchiveclass. - Massive internal cleanup.
- The
Extractorclass from the poorly thought-out initial API.
0.1.0 - 2018-10-16
- Initial public release.