Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
223 changes: 223 additions & 0 deletions proposals/000-content-iterator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
# Content Iterator

* Author: [Mickaël Menu](https://github.com/mickael-menu)
* Review PR: [#177](https://github.com/readium/architecture/pull/177)

## Summary

A new Publication Service to iterate through a publication's content extracted as semantic elements.


## Motivation

The Content Iterator service provides a building block for many high-level features requiring access to the raw content of a publication, such as:

* Text to speech
* Accessibility readers
* Basic search
* Full-text search indexing
* Image or audio indexes

Today, implementing such features is complex because you need to:

1. (maybe) Find a starting location in one of the reading order resources.
2. Iterate through the reading order, opening and closing resources when needed.
3. Extracting the textual or media content from the raw resource, which is different for every supported media type.

The Content Iterator handles all that in a media type agnostic way.

## Developer Guide

### Iterating the content of a publication

First, check if a `Publication` can be iterated using `publication.isContentIterable`.

Then, you can request a `ContentIterator` from the publication:

```typescript
let iterator = publication.contentIteratorFrom(locator)
if (iterator) {
...
}
```

The starting `Locator` is optional. When missing, the iteration will start from the very beginning of the publication.


Once you have a valid `ContentIterator` instance, you can crawl through the publication content in both directions, using `previous` or `next`.

```typescript
var content: Content?
while (content = iterator.next()) {
...
}

iterator.close()
```

These APIs will return the `Content` elements in the reading order until it reaches the end (or the beginning for `previous`) of the publication. In which case, any additional calls will return `null`.

:warning: Don't forget to `close` the iterator when you are done, to discard opened resources.

### Extracting the data from `Content` elements

The `Content` elements are value objects containing:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some new elements to take into account:

  • SVG
    • can have an aria-label or title child element
  • MathML
    • extracting the tt child elements could be useful
  • MathJax


* a `Locator` targeting the piece of content
* the associated `Data`

There are several types of `Data` which can be returned by the iterator. Depending on your use case, you might want to filter on the type of data to get only what you need.

#### Media data

Media data are returned when a resource embeds another media resource in the reading flow. They hold a publication `Link` to the embedded resource.

Two kind of media data are currently supported:

* `Audio` for audio clips.
* `Image` for embedded images.
* It also holds an optional `description` string for accessibility purposes.

#### Text data

The `Text` data is used for the text elements inlined in the publication resources. Each text element matches a semantic item represented as a `TextStyle` in the data object:

* `heading(level: Int)` for text headings, with an associated level
* `body` for a basic body paragraph
* `caption` for a caption associated to an image
* `footnote` for footnotes at the end of the resource
* `quote` for a blockquote
* `listItem` for a single list item

Each `Text` data is split in one or more spans containing the text content, a `Locator` and the language.

## Reference Guide

### Types and APIs

#### `ContentIterationService` Interface (implements `PublicationService`)

Provides `ContentIterator` instances to crawl the content of a `Publication`.

##### Methods

* `iteratorFrom(start: Locator?) -> ContentIterator?`
* Creates a `ContentIterator` starting from the given location.
* Returns `null` if no iterator can be created, for example because no resources are iterable.

##### `Publication` Helpers

* `isContentIterable: Boolean = findService(ContentIterationService::class) != null`
* Returns whether this `Publication` can be iterated on.
* `contentIteratorFrom(start: Locator?) -> ContentIterator? = findService(ContentIterationService::class)?.iteratorFrom(start)`
* Creates a `ContentIterator` starting from the given location.

#### `ContentIterator` Interface

Iterates over `Content` elements.

This interface does not depend on `Publication`.

##### Methods

* `previous() -> Content?`
* Returns the previous `Content` element in the iterator, or `null` when reaching the beginning.
* `next() -> Content?`
* Returns the next `Content` element in the iterator, or `null` when reaching the end.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should make it explicit that these methods make the iterator move forward or backward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the documentation comment? I'm fine making it more explicit there.

I wouldn't change the function signature as it's pretty common though:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was talking about the documentation.

* `close()`
* Closes and discard resources held by this iterator.

#### `Content` Class

Represents a single semantic content element.

##### Properties

* `locator: Locator`
* Locator targeting this element in the `Publication`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we be able to get a locator for any target data? Think of three successive images without HTML ids. I believe that neither fragments nor text after/before can be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't guarantee that it will be the case for all media types, but for the ones we have so far I think so.

With image elements, you can use a cssSelector in the locations object. I filled it for all locators in the HtmlResourceContentIterator, as even with a text context it can help limit the search scope to a parent element.

If there's ever a case where we can't target precisely, this Locator should at least match the closest targetable parent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see.

* `data: Content.Data`
* Data associated with this element.

#### `Content.Data` Interface

A marker interface for a `Content` associated data.

#### `Content.Data.Text` Class (implements `Content.Data`)

Holds a textual element's spans and style.

##### Properties

* `style: TextStyle`
* Semantic style for this element.
* `spans: [TextSpan]`
* List of text spans in this element.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale behind grouping multiple TextSpans into a single Content.Data.Text?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example if you have the body paragraph:

<body lang="en">
    <p>The correct pronunciation is <span lang="fr">croissant</span>, and not croissant.</p>
</body>

We want the whole paragraph as a single semantic Text element. However to not loose the information about the language (which is important for TTS), you can split the Text element into a list of spans which are basically "attributed ranges" in the text element:

{
  "locator": {},
  "data": {
    "style": "body",
    "spans": [
      { "text": "The correct pronunciation is ", "language": "en" },
      { "text": "croissant", "language": "fr" },
      { "text": ", and not croissant.", "language": "en" }
    ]
  }
}

Like @danielweck mentioned on the call, the term span can be confusing when parsing HTML, as they are no 1-1 relationship between the HTML spans and the produced elements.

However thinking more about this, I think the term is still the most accurate. The problem happens only with HTML media types and the semantic seems to be correct:

I think it even matches the meaning from the HTML spec:

The span element doesn't mean anything on its own, but can be useful when used together with the global attributes, e.g. class, lang, or dir.
https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-span-element

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielweck I'm renaming span into segment to lift any ambiguity. (Other candidates: part, chunk, portion)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also rename style into role, to remove the notion of appearance that style brings.


#### `TextStyle` Enum

Semantic style for a text element.

* `heading(level: Int)` for text headings, with an associated level
* `body` for a basic body paragraph
* `caption` for a caption associated to an image
* `footnote` for footnotes at the end of the resource
* `quote` for a blockquote
* `listItem` for a single list item

#### `TextSpan`

A span is a ranged text in a parent text element holding attributes.

##### Properties

* `locator: Locator`
* Locator targeting the text span in the resource.
* `language: Language?`
* BCP-47 language code.
* `text: String`
* Actual text content.

### Default implementations

#### `PublicationContentIterator` Class (implements `ContentIterator`)

A composite `ContentIterator` which iterates through a whole `Publication` and delegates the iteration inside a given resource to media type-specific iterators.

##### Constructors

* `PublicationContentIterator(publication: Publication, start: Locator?, resourceContentIteratorFactories: [ResourceContentIteratorFactory])`
* `publication` – The `Publication` which will be iterated through.
* `start` – Starting `Locator` in the publication.
* `resourceContentIteratorFactories` – List of `ResourceContentIteratorFactory` which will be used to create the iterator for each resource. The factories are tried in order until there's a match.

##### Function types

* `ResourceContentIteratorFactory = (resource: Resource, locator: Locator) -> ContentIterator?`
* Creates a `ContentIterator` instance for the given `resource`, starting from `locator`.
* Returns `null` if the resource media type is not supported.

#### `DefaultContentIterationService` Class (implements `ContentIterationService`)

This `ContentIterationService` takes a list of `ResourceContentIteratorFactory` and returns instances of `PublicationContentIterator`.

#### `HTMLResourceContentIterator` Class (implements `ContentIterator`)

A `ContentIterator` which can crawl through an HTML resource.


## Drawbacks and Limitations

### List or Tree?

Many media types represent their content as a tree (e.g. HTML DOM), while the proposed service iterates through a flat list of elements.

A flat list is much easier to manipulate, especially when starting from a mid-resource location. Besides, the features that would use this building block are mostly interested in getting the content flatten in the reading flow direction rather than a tree. However, we do loose information from the original content that might be useful in other use cases. There is a tension here that is difficult to resolve.

One particular element that would benefit from a tree structure is a list of items. We could add a new `TextStyle` supporting some local tree nodes for this.

## Future Possibilities

### Extending `HTMLResourceContentIterator`

HTML content can be quite complex and an app might want to filter out some DOM elements in the iteration. This could be implemented as an extension point in `HTMLResourceContentIterator`.