-
Notifications
You must be signed in to change notification settings - Fork 35
Proposal: Content Iterator #177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,223 @@ | ||
| # Content Iterator | ||
|
|
||
| * Author: [Mickaël Menu](https://github.com/mickael-menu) | ||
| * Review PR: [#177](https://github.com/readium/architecture/pull/177) | ||
|
|
||
| ## Summary | ||
|
|
||
| A new Publication Service to iterate through a publication's content extracted as semantic elements. | ||
|
|
||
|
|
||
| ## Motivation | ||
|
|
||
| The Content Iterator service provides a building block for many high-level features requiring access to the raw content of a publication, such as: | ||
|
|
||
| * Text to speech | ||
| * Accessibility readers | ||
| * Basic search | ||
| * Full-text search indexing | ||
| * Image or audio indexes | ||
|
|
||
| Today, implementing such features is complex because you need to: | ||
|
|
||
| 1. (maybe) Find a starting location in one of the reading order resources. | ||
| 2. Iterate through the reading order, opening and closing resources when needed. | ||
| 3. Extracting the textual or media content from the raw resource, which is different for every supported media type. | ||
|
|
||
| The Content Iterator handles all that in a media type agnostic way. | ||
|
|
||
| ## Developer Guide | ||
|
|
||
| ### Iterating the content of a publication | ||
|
|
||
| First, check if a `Publication` can be iterated using `publication.isContentIterable`. | ||
|
|
||
| Then, you can request a `ContentIterator` from the publication: | ||
|
|
||
| ```typescript | ||
| let iterator = publication.contentIteratorFrom(locator) | ||
| if (iterator) { | ||
| ... | ||
| } | ||
| ``` | ||
|
|
||
| The starting `Locator` is optional. When missing, the iteration will start from the very beginning of the publication. | ||
|
|
||
|
|
||
| Once you have a valid `ContentIterator` instance, you can crawl through the publication content in both directions, using `previous` or `next`. | ||
|
|
||
| ```typescript | ||
| var content: Content? | ||
| while (content = iterator.next()) { | ||
| ... | ||
| } | ||
|
|
||
| iterator.close() | ||
| ``` | ||
|
|
||
| These APIs will return the `Content` elements in the reading order until it reaches the end (or the beginning for `previous`) of the publication. In which case, any additional calls will return `null`. | ||
|
|
||
| :warning: Don't forget to `close` the iterator when you are done, to discard opened resources. | ||
|
|
||
| ### Extracting the data from `Content` elements | ||
|
|
||
| The `Content` elements are value objects containing: | ||
|
|
||
| * a `Locator` targeting the piece of content | ||
| * the associated `Data` | ||
|
|
||
| There are several types of `Data` which can be returned by the iterator. Depending on your use case, you might want to filter on the type of data to get only what you need. | ||
|
|
||
| #### Media data | ||
|
|
||
| Media data are returned when a resource embeds another media resource in the reading flow. They hold a publication `Link` to the embedded resource. | ||
|
|
||
| Two kind of media data are currently supported: | ||
|
|
||
| * `Audio` for audio clips. | ||
| * `Image` for embedded images. | ||
| * It also holds an optional `description` string for accessibility purposes. | ||
|
|
||
| #### Text data | ||
|
|
||
| The `Text` data is used for the text elements inlined in the publication resources. Each text element matches a semantic item represented as a `TextStyle` in the data object: | ||
|
|
||
| * `heading(level: Int)` for text headings, with an associated level | ||
| * `body` for a basic body paragraph | ||
| * `caption` for a caption associated to an image | ||
| * `footnote` for footnotes at the end of the resource | ||
| * `quote` for a blockquote | ||
| * `listItem` for a single list item | ||
|
|
||
| Each `Text` data is split in one or more spans containing the text content, a `Locator` and the language. | ||
|
|
||
| ## Reference Guide | ||
|
|
||
| ### Types and APIs | ||
|
|
||
| #### `ContentIterationService` Interface (implements `PublicationService`) | ||
|
|
||
| Provides `ContentIterator` instances to crawl the content of a `Publication`. | ||
|
|
||
| ##### Methods | ||
|
|
||
| * `iteratorFrom(start: Locator?) -> ContentIterator?` | ||
| * Creates a `ContentIterator` starting from the given location. | ||
| * Returns `null` if no iterator can be created, for example because no resources are iterable. | ||
|
|
||
| ##### `Publication` Helpers | ||
|
|
||
| * `isContentIterable: Boolean = findService(ContentIterationService::class) != null` | ||
| * Returns whether this `Publication` can be iterated on. | ||
| * `contentIteratorFrom(start: Locator?) -> ContentIterator? = findService(ContentIterationService::class)?.iteratorFrom(start)` | ||
| * Creates a `ContentIterator` starting from the given location. | ||
|
|
||
| #### `ContentIterator` Interface | ||
|
|
||
| Iterates over `Content` elements. | ||
|
|
||
| This interface does not depend on `Publication`. | ||
|
|
||
| ##### Methods | ||
|
|
||
| * `previous() -> Content?` | ||
| * Returns the previous `Content` element in the iterator, or `null` when reaching the beginning. | ||
| * `next() -> Content?` | ||
| * Returns the next `Content` element in the iterator, or `null` when reaching the end. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should make it explicit that these methods make the iterator move forward or backward.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the documentation comment? I'm fine making it more explicit there. I wouldn't change the function signature as it's pretty common though:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I was talking about the documentation. |
||
| * `close()` | ||
| * Closes and discard resources held by this iterator. | ||
|
|
||
| #### `Content` Class | ||
|
|
||
| Represents a single semantic content element. | ||
|
|
||
| ##### Properties | ||
|
|
||
| * `locator: Locator` | ||
| * Locator targeting this element in the `Publication`. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will we be able to get a locator for any target data? Think of three successive images without HTML ids. I believe that neither fragments nor text after/before can be used.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can't guarantee that it will be the case for all media types, but for the ones we have so far I think so. With image elements, you can use a If there's ever a case where we can't target precisely, this
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, I see. |
||
| * `data: Content.Data` | ||
| * Data associated with this element. | ||
|
|
||
| #### `Content.Data` Interface | ||
|
|
||
| A marker interface for a `Content` associated data. | ||
|
|
||
| #### `Content.Data.Text` Class (implements `Content.Data`) | ||
|
|
||
| Holds a textual element's spans and style. | ||
|
|
||
| ##### Properties | ||
|
|
||
| * `style: TextStyle` | ||
| * Semantic style for this element. | ||
| * `spans: [TextSpan]` | ||
| * List of text spans in this element. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the rationale behind grouping multiple
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For example if you have the body paragraph: We want the whole paragraph as a single semantic {
"locator": {},
"data": {
"style": "body",
"spans": [
{ "text": "The correct pronunciation is ", "language": "en" },
{ "text": "croissant", "language": "fr" },
{ "text": ", and not croissant.", "language": "en" }
]
}
}Like @danielweck mentioned on the call, the term However thinking more about this, I think the term is still the most accurate. The problem happens only with HTML media types and the semantic seems to be correct:
I think it even matches the meaning from the HTML spec:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good!
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @danielweck I'm renaming
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will also rename |
||
|
|
||
| #### `TextStyle` Enum | ||
|
|
||
| Semantic style for a text element. | ||
|
|
||
| * `heading(level: Int)` for text headings, with an associated level | ||
| * `body` for a basic body paragraph | ||
| * `caption` for a caption associated to an image | ||
| * `footnote` for footnotes at the end of the resource | ||
| * `quote` for a blockquote | ||
| * `listItem` for a single list item | ||
|
|
||
| #### `TextSpan` | ||
|
|
||
| A span is a ranged text in a parent text element holding attributes. | ||
|
|
||
| ##### Properties | ||
|
|
||
| * `locator: Locator` | ||
| * Locator targeting the text span in the resource. | ||
| * `language: Language?` | ||
| * BCP-47 language code. | ||
| * `text: String` | ||
| * Actual text content. | ||
|
|
||
| ### Default implementations | ||
|
|
||
| #### `PublicationContentIterator` Class (implements `ContentIterator`) | ||
|
|
||
| A composite `ContentIterator` which iterates through a whole `Publication` and delegates the iteration inside a given resource to media type-specific iterators. | ||
|
|
||
| ##### Constructors | ||
|
|
||
| * `PublicationContentIterator(publication: Publication, start: Locator?, resourceContentIteratorFactories: [ResourceContentIteratorFactory])` | ||
| * `publication` – The `Publication` which will be iterated through. | ||
| * `start` – Starting `Locator` in the publication. | ||
| * `resourceContentIteratorFactories` – List of `ResourceContentIteratorFactory` which will be used to create the iterator for each resource. The factories are tried in order until there's a match. | ||
|
|
||
| ##### Function types | ||
|
|
||
| * `ResourceContentIteratorFactory = (resource: Resource, locator: Locator) -> ContentIterator?` | ||
| * Creates a `ContentIterator` instance for the given `resource`, starting from `locator`. | ||
| * Returns `null` if the resource media type is not supported. | ||
|
|
||
| #### `DefaultContentIterationService` Class (implements `ContentIterationService`) | ||
|
|
||
| This `ContentIterationService` takes a list of `ResourceContentIteratorFactory` and returns instances of `PublicationContentIterator`. | ||
|
|
||
| #### `HTMLResourceContentIterator` Class (implements `ContentIterator`) | ||
|
|
||
| A `ContentIterator` which can crawl through an HTML resource. | ||
|
|
||
|
|
||
| ## Drawbacks and Limitations | ||
|
|
||
| ### List or Tree? | ||
|
|
||
| Many media types represent their content as a tree (e.g. HTML DOM), while the proposed service iterates through a flat list of elements. | ||
|
|
||
| A flat list is much easier to manipulate, especially when starting from a mid-resource location. Besides, the features that would use this building block are mostly interested in getting the content flatten in the reading flow direction rather than a tree. However, we do loose information from the original content that might be useful in other use cases. There is a tension here that is difficult to resolve. | ||
|
|
||
| One particular element that would benefit from a tree structure is a list of items. We could add a new `TextStyle` supporting some local tree nodes for this. | ||
|
|
||
| ## Future Possibilities | ||
|
|
||
| ### Extending `HTMLResourceContentIterator` | ||
|
|
||
| HTML content can be quite complex and an app might want to filter out some DOM elements in the iteration. This could be implemented as an extension point in `HTMLResourceContentIterator`. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some new elements to take into account:
aria-labelortitlechild elementttchild elements could be useful