Skip to content

Commit

Permalink
Some formulation review to the subsection on the structure of ICAT
Browse files Browse the repository at this point in the history
data files
  • Loading branch information
RKrahl committed Jan 3, 2024
1 parent 3b9367e commit 33f8650
Showing 1 changed file with 29 additions and 22 deletions.
51 changes: 29 additions & 22 deletions doc/src/file-icatdata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,37 +29,44 @@ Data files are partitioned in chunks. This is done to avoid having
the whole file, e.g. the complete inventory of the ICAT, at once in
memory. The problem is that objects contain references to other
objects (e.g. Datafiles refer to Datasets, the latter refer to
Investigations, and so forth). We keep an index of the objects in
order to resolve these references. But there is a memory versus time
tradeoff: we cannot keep all the objects in the index, that would
again mean the complete inventory of the ICAT. And we can't know
beforehand which object is going to be referenced later on, so we
don't know which one to keep and which one to discard from the index.
Fortunately we can query objects we discarded once back from the ICAT
server. But this is expensive. So the strategy is as follows: keep
all objects from the current chunk in the index and discard the
complete index each time a chunk has been processed. This will work
fine if objects are mostly referencing other objects from the same
chunk and only a few references go across chunk boundaries.
Investigations, and so forth). We keep an index of the objects as
cache in order to resolve these references. But there is a memory
versus time tradeoff: we cannot keep all the objects in the index,
that would again mean the complete inventory of the ICAT. And we
can't know beforehand which object is going to be referenced later on,
so we don't know which one to keep and which one to discard from the
index. Fortunately we can query objects that we discarded once back
from the ICAT server. But this is expensive. So the strategy is as
follows: keep all objects from the current chunk in the index and
discard the complete index each time a chunk has been
processed. [#dc]_ This will work fine if objects are mostly
referencing other objects from the same chunk and only a few
references go across chunk boundaries.

Therefore, we want these chunks to be small enough to fit into memory,
but at the same time large enough to keep as many relations between
objects as possible local in a chunk. It is in the responsibility of
the writer of the data file to create the chunks in this manner.

The objects that get written to the data file and how this file is
organized is controlled by lists of ICAT search expressions, see
:meth:`icat.dumpfile.DumpFileWriter.writeobjs`. There is some degree
of flexibility: an object may include related objects in an
one-to-many relation, just by including them in the search expression.
In this case, these related objects should not have a search
expression on their own again. For instance, the search expression
for Grouping may include UserGroup. The UserGroups will then be
embedded in their respective grouping in the data file. There should
not be a search expression for UserGroup then.
organized is controlled by lists of ICAT search expressions or entity
objects, see :meth:`icat.dumpfile.DumpFileWriter.writeobjs`. There is
some degree of flexibility: an object may include related objects in
an one-to-many relation. In this case, these related objects should
not be added on their own again. For instance, you may write User,
Grouping, and UserGroup as separate objects into the file. In this
case, the UserGroup entries must properly reference related User and
Grouping. Alternatively you may include the UserGroups in the
corresponding Grouping objects. In this case, you must not add the
UserGroups again on their own.

Objects related in a many-to-one relation must always be included in
the search expression. This is also true if the object is
indirectly related to one of the included objects. In this case,
only a reference to the related object will be included in the data
file. The related object must have its own list entry.
file. The related object must have its own entry.


.. [#dc] There is one exception: DataCollections don't have a
uniqueness constraint and can't reliably be searched by
attributes. They are always kept in the index.

0 comments on commit 33f8650

Please sign in to comment.