diff --git a/tutorials/FITS-large/FITS-large.ipynb b/tutorials/FITS-large/FITS-large.ipynb new file mode 100644 index 00000000..e5deb248 --- /dev/null +++ b/tutorials/FITS-large/FITS-large.ipynb @@ -0,0 +1,989 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "5cd63a37-6a0c-4c92-a970-3e2ad24cbde2", + "metadata": {}, + "source": [ + "# Working with large FITS files\n", + "\n", + "This tutorial, built on the [Create a very large FITS file from scratch](https://docs.astropy.org/en/stable/generated/examples/io/skip_create-large-fits.html) guide, works through building a very large (too large to fit in memory) FITS file with multiple HDUs. It covers creating both large Image and Table extensions. It is aimed at users already quite familiar with the FITS format.\n", + "\n", + "\n", + "## Authors\n", + "C. E. Brasseur\n", + "\n", + "## Learning Goals\n", + "* Build a *large* FITS file (*large* means is too large to fit in memory all at once)\n", + "* Make a *large* FITS Image extension\n", + "* Make a *large* FITS Table extension\n", + "\n", + "## Keywords\n", + "FITS, file input/output, memory mapping\n", + "\n", + "## Companion Content\n", + "- [FITS Standard](https://fits.gsfc.nasa.gov/fits_standard.html)\n", + "- [Astropy FITS Documentation](https://docs.astropy.org/en/stable/io/fits/)\n", + "- [Python madvise documentation](https://docs.python.org/3/library/mmap.html#mmap.mmap.madvise)\n", + "- [Madvise system call documentation](https://man7.org/linux/man-pages/man2/madvise.2.html)\n", + "- [Create a very large FITS file from scratch](https://docs.astropy.org/en/stable/generated/examples/io/skip_create-large-fits.html)\n", + "\n", + "## Summary\n", + "\n", + "This is an advanced tutorial. We will be building a very large multi-extension FITS file from scratch, going through both how to create Images and Arrays too large to fit in memory, and how to fill those structures once created.\n", + "\n", + "If you don't want to know about the inner workings of the FITS format, just stop here. If you don't want to know but nevertheless neeed to, proceed with caution, that's how I started and now here I am writing this tutorial. " + ] + }, + { + "cell_type": "markdown", + "id": "7dfd952d-56e3-4963-8540-f43020a7ea9e", + "metadata": {}, + "source": [ + "## Building a large FITS file\n", + "\n", + "1. [Imports](#Imports)\n", + "2. [Primary HDU](#Primary-HDU)\n", + "3. [Large Image HDU](#Large-Image-HDU)\n", + "4. [Large Table HDU](#Large-Table-HDU)\n", + "5. [Adding an Extra Small HDU](#Adding-an-Extra-Small-HDU)\n", + "6. [Cleanup](#Cleanup)\n", + "\n", + "### But before we begin...\n", + "\n", + "There are a few things we need to know about the FITS format so I will collect them there. And if you are either now, or in the future, wondering \"why is it like that\" the answer is that the FITS format was originally designed and optimised for magnetic tape. This means that the FITS format was originally designed for sequentual reading rather than random access, so the FITS format has no index (listing at what bytes various parts of the file start) but instead is formatted in 2880 byte chunks so that a tape reader head can simply skip forward by 2880 bytes repeatedly and check if a new section has begun. This is mostly trivia for us as modern users, but there are a few implications. Firstly, when reading FITS files the Astropy FITS module by default reads them \"lazily\" meaning that it does not tabulate all the extensions until it needs to (i.e. when the user requests a specific extension or calls the `info()` function). Secondly, and most crucially for this tutorial, when creating FITS extension manually, the most critical part of creating a new *valid* FITS extension is making sure the number of bytes is a multiple of 2880. These are of course but a few of the quirks of the FITS format, to read about all of them in their full and eccentric glory, see the [FITS standard](https://fits.gsfc.nasa.gov/fits_standard.html) document.\n", + "\n", + "**A short review of terminology:**\n", + "- The basic block of a FITS file is called a Header Data Unit (HDU).\n", + "- Each HDU contains two elements, the header and the data.\n", + "- A FITS file consistes of one or more HDUs.\n", + "- Astropy represents a FITS file as an HDUList, where each extension is an HDU of a specific type (i.e PrimaryHDU, ImageHDU, etc).\n", + "- For more details see the [Astropy FITS Documentation](https://docs.astropy.org/en/stable/io/fits/)." + ] + }, + { + "cell_type": "markdown", + "id": "e51c21c1", + "metadata": {}, + "source": [ + "## Imports\n", + "\n", + "We don't need many modules for this. The central one is of course `astropy.io.fits`; the `mmap` import helps with efficiency, but is not available on all systems, and is ultimately not essential. So if yours is a system without it never fear, you can just comment out those lines." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "670f28e1", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "from time import time\n", + "\n", + "import numpy as np\n", + "\n", + "from astropy.io import fits\n", + "from astropy.table import Table\n", + "\n", + "from mmap import MADV_SEQUENTIAL" + ] + }, + { + "cell_type": "markdown", + "id": "97374113", + "metadata": {}, + "source": [ + "### A little helper function\n", + "\n", + "Because this tutorial is all about building a huge file, we'll write a little function to print the file size in a a variety of units." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab165db6", + "metadata": {}, + "outputs": [], + "source": [ + "def print_file_size(path, unit=\"B\"):\n", + " \n", + " size = os.path.getsize(path)\n", + " \n", + " if unit==\"KB\":\n", + " size /= 1e3\n", + " fmt = '.1f'\n", + " elif unit==\"MB\":\n", + " size /= 1e6\n", + " fmt = '.1f'\n", + " elif unit==\"GB\":\n", + " size /= 1e9\n", + " fmt = '.1f'\n", + " elif unit==\"FITS\":\n", + " size /= 2880\n", + " unit = \"FITS block(s)\"\n", + " fmt = '.1f'\n", + " \n", + " else:\n", + " unit = \"Bytes\"\n", + " fmt = 'd'\n", + " \n", + " print(f\"{size:{fmt}} {unit}\")" + ] + }, + { + "cell_type": "markdown", + "id": "fc642cac", + "metadata": {}, + "source": [ + "## Primary HDU\n", + "\n", + "In this tutorial we are going to build a properly formated multi-extension FITS file, so before we get into the matter of creating a massive FITS file we will build a basic Primary HDU, and write it to file where it will be the basis for our monstrous FITS file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5f301e48-ab85-422d-817e-7926e4c1d6c8", + "metadata": {}, + "outputs": [], + "source": [ + "# Make some header entries for important information\n", + "primary_header_cards = [(\"ORIGIN\", 'Fancy Archive', \"Where the data came from\"),\n", + " (\"DATE\", '2024-03-05', \"Creation date\"),\n", + " (\"MJD\", 60374, \"Creation date in MJD\"),\n", + " (\"CREATOR\", 'Me', \"Who created this file\")] \n", + "\n", + "# Build the Primary HDU object and put it in an HDU list\n", + "primary_hdu = fits.PrimaryHDU(header=fits.Header(primary_header_cards))\n", + "hdu_list = fits.HDUList([primary_hdu])\n", + "\n", + "# Write the HDU list to file\n", + "big_fits_fle = \"./patagotitan.fits\"\n", + "hdu_list.writeto(big_fits_fle, overwrite=True)" + ] + }, + { + "cell_type": "markdown", + "id": "049580b1", + "metadata": {}, + "source": [ + "Before we continue let's verify our (currently tiny) FITS file is valid. We will do this by calling the `info()` function which will hang if the file is not a valid FITS format." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c38858f", + "metadata": {}, + "outputs": [], + "source": [ + "with fits.open(big_fits_fle) as hdu_list:\n", + " hdu_list.info()" + ] + }, + { + "cell_type": "markdown", + "id": "8e14e379", + "metadata": {}, + "source": [ + "Let's also look at the file size in bytes and FITS blocks. Note that it is exactly one FITS block." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "714965ad", + "metadata": {}, + "outputs": [], + "source": [ + "print_file_size(big_fits_fle)\n", + "print_file_size(big_fits_fle, \"FITS\")" + ] + }, + { + "cell_type": "markdown", + "id": "71cf02d6", + "metadata": {}, + "source": [ + "## Large Image HDU\n", + "\n", + "Now we get to the meat of the tutorial. We are going to expand out the FITS file to fit a 40,000 x 40,000 pixel image (~13 GB).\n", + "\n", + "*Note*: If this is problamatically big for your system, adjust `array_dims` below. All of the steps will still work as expected with smaller data, it's simply an unnessesarily complex method when dealing with data sizes that fit in memory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "042b5a63", + "metadata": {}, + "outputs": [], + "source": [ + "array_dims = [40_000, 40_000]" + ] + }, + { + "cell_type": "markdown", + "id": "41ef407f", + "metadata": {}, + "source": [ + "First we build an ImageHDU object with a small data array. The data in the array does not matter because we won't be using it, but the data type needs to be correct, and you need to take note of how many bytes per element goes with that data type. In this example we are building a `float64` array, so each element uses 8 bytes of memory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f9c8df42", + "metadata": {}, + "outputs": [], + "source": [ + "data = np.zeros((100, 100), dtype=np.float64)\n", + "hdu = fits.ImageHDU(data)" + ] + }, + { + "cell_type": "markdown", + "id": "43bac0e2", + "metadata": {}, + "source": [ + "Now we pull out just the header, and adjust the NAXIS keywords to match our desired giant-array dimensions. This is a critical step, it is telling the FITS file how large the data array is and must match the data array size we add.\n", + "\n", + "We also set an EXTNAME which is optional, but helpful because it allows us to refer to that extension by name as well as index. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c8d874e3", + "metadata": {}, + "outputs": [], + "source": [ + "header = hdu.header\n", + "\n", + "header[\"NAXIS2\"] = array_dims[0]\n", + "header[\"NAXIS1\"] = array_dims[1]\n", + "\n", + "header[\"EXTNAME\"] = 'BIG_IMG' " + ] + }, + { + "cell_type": "markdown", + "id": "943335ed", + "metadata": {}, + "source": [ + "The next step is to write *just the header* to the end of our soon to balloon FITS file.\n", + "\n", + "*Note:* At the end of this step our file is temporarily NOT a valid FITS file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "93568d65", + "metadata": {}, + "outputs": [], + "source": [ + "with open(big_fits_fle, 'ab') as FITSFLE: # 'ab' means open to append bytes\n", + " FITSFLE.write(bytearray(header.tostring(), encoding=\"utf-8\"))" + ] + }, + { + "cell_type": "markdown", + "id": "1704c93b", + "metadata": {}, + "source": [ + "Now we calculate the number of bytes we need for our gargantuan array, remembering that the result *must* to be a multiple of 2880 bytes to conform to the FITS standard. \n", + "\n", + "*Note:* The `astype(np.int64)` is not necessary on all systems, but some still default to int32 and therefor throw an overflow error. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0a494a5a", + "metadata": {}, + "outputs": [], + "source": [ + "elt_size = 8 # Bytes needed for an array element\n", + "arraysize_in_bytes = ((np.prod(array_dims).astype(np.int64) * elt_size + 2880 - 1) // 2880) * 2880" + ] + }, + { + "cell_type": "markdown", + "id": "4a970eb1", + "metadata": {}, + "source": [ + "Now we need to expand the file by that many bytes. To do this we seek to the desired new end of the file and write a null byte." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f99cd09b", + "metadata": {}, + "outputs": [], + "source": [ + "filelen = os.path.getsize(big_fits_fle) \n", + " \n", + "with open(big_fits_fle, 'r+b') as FITSFLE:\n", + " FITSFLE.seek(filelen + arraysize_in_bytes - 1)\n", + " FITSFLE.write(b'\\0')" + ] + }, + { + "cell_type": "markdown", + "id": "5c168f65", + "metadata": {}, + "source": [ + "Now lets see how big our FITS file has become." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea9e1dd7", + "metadata": {}, + "outputs": [], + "source": [ + "print_file_size(big_fits_fle, \"GB\")\n", + "print_file_size(big_fits_fle, \"FITS\")" + ] + }, + { + "cell_type": "markdown", + "id": "c0c1e191", + "metadata": {}, + "source": [ + "So just about 13 GB as expected. And a lot more FITS blocks, but still an exact multiple of the FITS block size which is what we want to see." + ] + }, + { + "cell_type": "markdown", + "id": "0910723e", + "metadata": {}, + "source": [ + "### Filling the big array\n", + "\n", + "Now we have a big ol' empty array that we want to put some stuff in. At this point we are working with a gigantic file, so we need to start being careful we don't ask for it to be loaded wholesale into memory (if you *do* try to do that, it won't break anything you will just get a `Cannot allocate memory` error). \n", + "\n", + "What this means is that the memmap argument must be set to True. This is usually the default (unless you have changed it in your astropy confuguration settings). There are also a number of modes the file can be opened with:\n", + "- `readonly`: Default behavior. Opens the file in readonly mode, meaning that to save any changes you need to write to a whole new file (or overwrite the existing one). This means that while with `memmap=True` the entire FITS file is not loaded into memory, the system is prepared to load it all in memory if the user changes something, and so will still throw an error if it is not *possible* to allocate memory for the whole file even as it does not allocate it at the minute. For big files where this is not possible it will fall back to `denywrite` mode and produce a warning.\n", + "- `denywrite`: This is similar to readonly except that it does not allow the FITS object to be altered and then written to a new file. For our tremendous FITS file we will use this mode when we want to access but not change the file.\n", + "- `update`: This mode allows a file to be updated in place.\n", + "- `append`: This allows more extensions to be added to an existing FITS file, but doe *not* allow changing data already in the file when it is opened. \n", + "\n", + "We will use the `update` mode to fill our immense array in place." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd09b429", + "metadata": {}, + "outputs": [], + "source": [ + "hdu_list = fits.open(big_fits_fle, mode='update', memmap=True)" + ] + }, + { + "cell_type": "markdown", + "id": "a98664fb", + "metadata": {}, + "source": [ + "Before we comence we will call the `info()` function to verify that our FITS file is valid and the enormous array is the size we expect." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6426bc2d", + "metadata": {}, + "outputs": [], + "source": [ + "hdu_list.info()" + ] + }, + { + "cell_type": "markdown", + "id": "8ac94786", + "metadata": {}, + "source": [ + "Pulling out the majestic array for convenience." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3be4e66f", + "metadata": {}, + "outputs": [], + "source": [ + "data_array = hdu_list[1].data" + ] + }, + { + "cell_type": "markdown", + "id": "42698c9a", + "metadata": {}, + "source": [ + "If you are on a system with the `madvise` call (you're on your own figuring that out), you can set madvise to `MADV_SEQUENTIAL` for the `data_array`. This tells the memory mapping that you are going to be accessing the array in a sequential manner and allows it to be more efficient in how it handles memory allocation based on that. How much this actually affects the time it takes to perform the filling operation will depend on your specific system and the array sizes you are working with.\n", + "\n", + "*Note:* If you change how you fill your array to something not sequential, don't set this." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "85eb00fe", + "metadata": {}, + "outputs": [], + "source": [ + "mm = fits.util._get_array_mmap(data_array)\n", + "mm.madvise(MADV_SEQUENTIAL)" + ] + }, + { + "cell_type": "markdown", + "id": "306ca830", + "metadata": {}, + "source": [ + "Now we fill the jumbo array block by block. We want the block size to comfortably fit in memory. The `block_size` I am using yields an ~1.3 GB array, adjust as your system requires.\n", + "\n", + "We also print the time it take for every block. If you have the `MADV_SEQUENTIAL` flag set, the individual block fill operation will generally take longer, and the close operation quite fast, while if the `MADV_SEQUENTIAL` flag is not set the reverse is generally true. This is because in the first case, the data is being flushed to disk at once, while in the second it builds up untill the system needs more memory or the file is closed and it writes it all at once. Which is more efficent on your setup will vary with block and file size." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bb2a0903", + "metadata": {}, + "outputs": [], + "source": [ + "block_size = 4000\n", + "\n", + "tt = 0\n", + "for i,j in enumerate(range(0, array_dims[0], block_size)):\n", + " it = time()\n", + " \n", + " sub_arr = np.ones((block_size,array_dims[1]))*i\n", + " data_array[j:j+block_size,:] = sub_arr\n", + " \n", + " tm = time()-it\n", + " print(f\"{i}: {tm:.0f} sec\")\n", + " tt += tm\n", + " \n", + "it = time()\n", + "hdu_list.close()\n", + "tm = time()-it\n", + "print(f\"Closing: {tm:.0f} sec\")\n", + "tt += tm\n", + " \n", + "print(f\"Total fill time: {tt/60:.1f} min\")" + ] + }, + { + "cell_type": "markdown", + "id": "54875a56", + "metadata": {}, + "source": [ + "### Checking the file contents\n", + "\n", + "We've filled our outsize array, but we want to make sure that it is correct. So we'll open the elephantine file in `denywrite` mode and check." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c452f3c6", + "metadata": {}, + "outputs": [], + "source": [ + "hdu_list = fits.open(big_fits_fle, mode='denywrite', memmap=True)\n", + "data_array = hdu_list[1].data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ba09b5a", + "metadata": {}, + "outputs": [], + "source": [ + "it = time()\n", + "for i,j in enumerate(range(0, array_dims[0], block_size)):\n", + " print(f\"{i}: Data match is {(data_array[j:j+block_size,:] == i).all()}: {time()-it:.0f} sec\")\n", + " it = time()\n", + " \n", + "hdu_list.close()" + ] + }, + { + "cell_type": "markdown", + "id": "4a7298d9", + "metadata": {}, + "source": [ + "## Large Table HDU\n", + "\n", + "In the last section we expanded our FITS file to add a colossal image extension, in this section we will do the same for a table extension. The method is similar, but with a few key differences.\n", + "\n", + "As with the mighty array, we start by making a small table where the specific data is not important but the data types are. In particular, the maximum string length for columns cannot be changed on the fly, so any string columns must be given the maximum number of characters needed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f1b4c31e", + "metadata": {}, + "outputs": [], + "source": [ + "small_tbl = Table(names=[\"Name\", \"Population\", \"Prince\", \"Years since fall\", \"Imports\", \"Exports\"],\n", + " dtype=['U128', int, 'U128', np.float64, 'U2048', 'U2048'],\n", + " rows=[[\"Vangaveyave\", 1297382, \"Oriana\", 34.6, \"wine, cheese\", \"ahalo cloth, pearls, foamwork\"],\n", + " [\"Tkinele\", 50000, \"n/a\", 92.3, \"none\", \"none\"],\n", + " [\"Amboloyo\", 50937253, \"Rufus\", 1504.2, \"pears, textiles, spices\", \"wine, timber\"],\n", + " [\"Xiputl\", 3627373, \"Anastasiya\", 346.8, \"silk, perfumes, pigments\", \"stone, cotton\"],\n", + " [\"Old Damara\", 437226732, \"Melissa Damara\", 25.3, \"wool, timber\", \"spices, silk\"],\n", + " [\"Western Dair\", 8045728302, \"Belu\", 876.3, \"shellfish, salt\", \"cured meat, wool\"]])\n", + "\n", + "table_hdu = fits.BinTableHDU(data=small_tbl)\n", + "table_hdu.header[\"EXTNAME\"] = \"BIG_TABLE\"" + ] + }, + { + "cell_type": "markdown", + "id": "6be82b91", + "metadata": {}, + "source": [ + "The header for this table HDU gives us the information to determine how many bytes we need for our mammoth table." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "528aa250", + "metadata": {}, + "outputs": [], + "source": [ + "table_hdu.header" + ] + }, + { + "cell_type": "markdown", + "id": "535ee86c", + "metadata": {}, + "source": [ + "The `NAXIS1` keyword gives the length of a single table row in bytes, and the `NAXIS2` keyword holds the number of rows in the table. So to get the total size of the humongous table in bytes we simply multiply `NAXIS1` by the number of rows desired (adjusting for FITS block size). Here I choose a million rows which is about 4GB, adjust as necessary for your system." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b42bae4b", + "metadata": {}, + "outputs": [], + "source": [ + "num_rows = 1_000_000\n", + "tablesize_in_bytes = ((table_hdu.header[\"NAXIS1\"]*num_rows + 2880 - 1) // 2880) * 2880" + ] + }, + { + "cell_type": "markdown", + "id": "95a95ad0", + "metadata": {}, + "source": [ + "Now we adjust the `NAXIS2` keyword to match our new table length and write just the header to the end of our towering FITS file, as we did for the oversize array extension." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "13c3c36e", + "metadata": {}, + "outputs": [], + "source": [ + "table_hdu.header[\"NAXIS2\"] = num_rows\n", + "\n", + "with open(big_fits_fle, 'ab') as FITSFLE:\n", + " FITSFLE.write(bytearray(table_hdu.header.tostring(), encoding=\"utf-8\"))" + ] + }, + { + "cell_type": "markdown", + "id": "a8ea704c", + "metadata": {}, + "source": [ + "Before we expand the file, we'll look at the current file size." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d86193af", + "metadata": {}, + "outputs": [], + "source": [ + "print_file_size(big_fits_fle, \"GB\")" + ] + }, + { + "cell_type": "markdown", + "id": "e9579e19", + "metadata": {}, + "source": [ + "Now, just as for the vast array, we seek `tablesize_in_bytes` beyond the current end of the file and write a null byte." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b51a1aec", + "metadata": {}, + "outputs": [], + "source": [ + "filelen = os.path.getsize(big_fits_fle)\n", + "\n", + "with open(big_fits_fle, 'r+b') as FITSFLE:\n", + " FITSFLE.seek(filelen + tablesize_in_bytes - 1)\n", + " FITSFLE.write(b'\\0')" + ] + }, + { + "cell_type": "markdown", + "id": "1180aa5d", + "metadata": {}, + "source": [ + "And we can see that the filesize has indeed increased by about 4GB." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8f9a7371", + "metadata": {}, + "outputs": [], + "source": [ + "print_file_size(big_fits_fle, \"GB\")" + ] + }, + { + "cell_type": "markdown", + "id": "b44ff84f", + "metadata": {}, + "source": [ + "### Adding data to the titanic table\n", + "\n", + "We can now open the prodigious FITS file in `update` mode and fill in our table. Here we'll run a little comparison. FITS files store table data row by row, so it should be faster to fill the table by row rather than column (and doing so allows us to again advise the memory mapper with `MADV_SEQUENTIAL`), but memory handling is complex and system dependent so when it really matters it's best to do testing for your individual setup." + ] + }, + { + "cell_type": "markdown", + "id": "1490c29a", + "metadata": {}, + "source": [ + "We'll start by printing the file info to ensure we have the valid FITS file we expect." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e0fecbe9", + "metadata": {}, + "outputs": [], + "source": [ + "hdu_list = fits.open(big_fits_fle, mode='update', memmap=True)\n", + "hdu_list.info()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ccfc3ff2", + "metadata": {}, + "outputs": [], + "source": [ + "table_data = hdu_list[\"BIG_TABLE\"].data\n", + "\n", + "tt = 0\n", + "\n", + "for col in small_tbl.colnames:\n", + " it = time()\n", + " \n", + " if col == \"Name\":\n", + " table_data[col] = [\"Zunidth\", \"Astandalas\"]*(num_rows//2)\n", + " elif col == \"Population\":\n", + " table_data[col] = np.arange(num_rows)\n", + " elif col == \"Prince\":\n", + " table_data[col] = [\"Cliopher Lord Mdang\", \"His Radiancy Artorin Damara\"]*(num_rows//2)\n", + " elif col == \"Years since fall\":\n", + " table_data[col] = np.linspace(5,1000,1000000)\n", + " elif col == \"Imports\":\n", + " table_data[col] = [\"tea\", \"roses\"]*(num_rows//2)\n", + " elif col == \"Exports\":\n", + " table_data[col] = [\"magic\", \"empire\"]*(num_rows//2)\n", + "\n", + " tm = time()-it\n", + " print(f\"{col}: {tm:.0f} sec\")\n", + " tt += tm\n", + " \n", + "it = time() \n", + "hdu_list.close()\n", + "tm = time()-it\n", + "print(f\"Closing: {tm:.0f} sec\")\n", + "tt += tm\n", + "\n", + "print(f\"Total fill time: {tt/60:.1f} min\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "122c4f4d", + "metadata": {}, + "outputs": [], + "source": [ + "hdu_list = fits.open(big_fits_fle, mode='update', memmap=True)\n", + "table_data = hdu_list[\"BIG_TABLE\"].data\n", + "\n", + "# Comment out these lines if your system does not have madvise\n", + "mm = fits.util._get_array_mmap(table_data)\n", + "mm.madvise(MADV_SEQUENTIAL)\n", + "\n", + "block_len = 200_000\n", + "data_list = (list(small_tbl.as_array())*(block_len//len(small_tbl) + 1))[:block_len]\n", + "\n", + "tt = 0\n", + "for j,i in enumerate(range(0, num_rows, block_len)):\n", + " it = time()\n", + " \n", + " table_data[i:i+block_len] = data_list\n", + " \n", + " tm = time()-it\n", + " print(f\"{j}: {tm:.0f} sec\")\n", + " tt += tm\n", + "\n", + "it = time()\n", + "hdu_list.close()\n", + "tm = time()-it\n", + "print(f\"Closing: {tm:.0f} sec\")\n", + "tt += tm\n", + " \n", + "print(f\"Total fill time: {tt/60:.1f} min\")" + ] + }, + { + "cell_type": "markdown", + "id": "25393d83", + "metadata": {}, + "source": [ + "For my system as I write this tutorial the times are incredibly close, but your milleage may vary." + ] + }, + { + "cell_type": "markdown", + "id": "0446f16a", + "metadata": {}, + "source": [ + "### Checking our data\n", + "\n", + "Now let's again open up our behemothic FITS file and check that the data was loaded correctly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e0351aad", + "metadata": {}, + "outputs": [], + "source": [ + "hdu_list = fits.open(big_fits_fle, mode='denywrite', memmap=True)\n", + "hdu_list.info()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "753fa7d0", + "metadata": {}, + "outputs": [], + "source": [ + "table_data = hdu_list[\"BIG_TABLE\"].data" + ] + }, + { + "cell_type": "markdown", + "id": "b76e5baa", + "metadata": {}, + "source": [ + "Checking the first few rows of each fill block." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea442ec3", + "metadata": {}, + "outputs": [], + "source": [ + "for j,i in enumerate(range(0, num_rows, block_len)):\n", + " val = (small_tbl == Table(table_data[i:i+6])).all() \n", + " print(f\"{j}: {val}\") " + ] + }, + { + "cell_type": "markdown", + "id": "cdb1ae37", + "metadata": {}, + "source": [ + "Closing the file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25c00a1f", + "metadata": {}, + "outputs": [], + "source": [ + "hdu_list.close()" + ] + }, + { + "cell_type": "markdown", + "id": "d49f4502", + "metadata": {}, + "source": [ + "## Adding an Extra Small HDU\n", + "\n", + "The last thing we will do is add another small HDU to the monumental FITS file. We can do this in the usual way because the extension we are adding is of a normal size." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "abaddafc", + "metadata": {}, + "outputs": [], + "source": [ + "small_hdu = fits.ImageHDU(data=np.random.random((10,10)))\n", + "small_hdu.header[\"EXTNAME\"] = \"MINI_IMG\"" + ] + }, + { + "cell_type": "markdown", + "id": "4b17f3c9", + "metadata": {}, + "source": [ + "Before we add the additional HDU we'll remind ourself of the current whopping filesize." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9093ab91", + "metadata": {}, + "outputs": [], + "source": [ + "print_file_size(big_fits_fle, \"KB\")" + ] + }, + { + "cell_type": "markdown", + "id": "4fb74ccb", + "metadata": {}, + "source": [ + "Because we don't have to do anything funky with the file size we can just open the magnificent FITS file in `append` mode and write the whole HDU, and it is a very fast operation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "20d9f611", + "metadata": {}, + "outputs": [], + "source": [ + "with fits.open(big_fits_fle, mode='append', memmap=True) as hdu_list:\n", + " hdu_list.append(small_hdu)" + ] + }, + { + "cell_type": "markdown", + "id": "786dfbe5", + "metadata": {}, + "source": [ + "And we can see that adding this tiny additional extension barely changes the monster file size." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "04b27963", + "metadata": {}, + "outputs": [], + "source": [ + "print_file_size(big_fits_fle, \"KB\")" + ] + }, + { + "cell_type": "markdown", + "id": "5eaf267e", + "metadata": {}, + "source": [ + "And now if we open the mondo FITS file we can see that additional extension." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1f48dbf6", + "metadata": {}, + "outputs": [], + "source": [ + "with fits.open(big_fits_fle, mode='denywrite', memmap=True) as hdu_list:\n", + " hdu_list.info()" + ] + }, + { + "cell_type": "markdown", + "id": "7849a3b4", + "metadata": {}, + "source": [ + "## Cleanup\n", + "\n", + "Lastly, we'll remove the leviathan file we created." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1031b8db", + "metadata": {}, + "outputs": [], + "source": [ + "os.remove(big_fits_fle)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70d1dc83", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython" + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/tutorials/FITS-large/requirements.txt b/tutorials/FITS-large/requirements.txt new file mode 100644 index 00000000..daf575c0 --- /dev/null +++ b/tutorials/FITS-large/requirements.txt @@ -0,0 +1,2 @@ +astropy +numpy