Reproducible experimental protocols for multimedia (audio, video, text) database.
$ pip install pyannote.database- pyannote-database
In pyannote.database jargon, a resource can be any multimedia entity (e.g. an image, an audio file, a video file, or a webpage). In its most simple form, it is modeled as a pyannote.database.ProtocolFile instance (basically a dict on steroids) with a uri key (URI stands for unique resource identifier) that identifies the entity.
Metadata may be associated to a resource by adding keys to its ProtocolFile. For instance, one could add a label key to an image resource describing whether it depicts a chihuahua or a muffin.
A database is a collection of resources of the same nature (e.g. a collection of audio files). It is modeled as a pyannote.database.Database instance.
An experimental protocol (pyannote.database.Protocol) usually defines three subsets:
- a train subset (e.g. used to train a neural network),
- a development subset (e.g. used to tune hyper-parameters),
- a test subset (e.g. used for evaluation).
Experimental protocols are defined via YAML configuration files:
Protocols:
MyDatabase:
Protocol:
MyProtocol:
train:
uri: /path/to/train.lst
development:
uri: /path/to/development.lst
test:
uri: /path/to/test.lstwhere /path/to/train.lst contains the list of unique resource identifier (URI) of the
files in the train subset:
# /path/to/train.lst
filename1
filename2
Since version 5.0, configuration files must be loaded into the registry like that:
from pyannote.database import registry
registry.load_database("/path/to/database.yml")registry.load_database takes an optional mode keyword argument that controls what to do when loading a protocol whose name (e.g. MyDatabase.Protocol.MyProtocol) is already used by another protocol:
LoadingMode.OVERRIDEto override existing protocol by the new one (default behavior);LoadingMode.KEEPto keep existing protocol;LoadingMode.ERRORto raise aRuntimeExceptionwhen such a conflict occurs.
For backward compatibility with 4.x branch, the following configuration files are loaded automatically when importing pyannote.database, in that order:
~/.pyannote/database.ymldatabase.ymlin current working directory- list of
;-separated path(s) in thePYANNOTE_DATABASE_CONFIGenvironment variable (e.g./absolute/path.yml;relative/path.yml)
Once loaded in the registry, protocols can be used in Python like this:
from pyannote.database import registry
registry.load_database("/path/to/database.yml")
protocol = registry.get_protocol('MyDatabase.Protocol.MyProtocol')
for resource in protocol.train():
print(resource["uri"])
filename1
filename2Paths defined in the configuration file can be absolute or relative to the directory containing the configuration file. For instance, the following file organization should work just fine:
.
├── database.yml
└── lists
└── train.lst
with the content of database.yml as follows:
Protocols:
MyDatabase:
Protocol:
MyProtocol:
train:
uri: lists/train.lstThe above MyDatabase.Protocol.MyProtocol protocol is not very useful as it only allows to iterate over a list of resources with a single 'uri' key. Metadata can be added to each resource with the following syntax:
Protocols:
MyDatabase:
Protocol:
MyProtocol:
train:
uri: lists/train.lst
speaker: rttms/train.rttm
transcription: ctms/{uri}.ctmand the following directory structure:
.
├── database.yml
├── lists
| └── train.lst
├── rttms
| └── train.rttm
└── ctms
├── filename1.ctm
└── filename2.ctm
Now, resources have both 'speaker' and 'transcription' keys:
from pyannote.database import registry
protocol = registry.get_protocol('MyDatabase.Protocol.MyProtocol')
for resource in protocol.train():
assert "speaker" in resource
assert isinstance(resource["speaker"], pyannote.core.Annotation)
assert "transcription" in resource
assert isinstance(resource["transcription"], spacy.tokens.Doc)What happened exactly? Data loaders were automatically selected based on metadata file suffix:
pyannote.database.loader.RTTMLoaderforspeakerentry with.rttmsuffixpyannote.database.loader.CTMLoaderfortranscriptionentry withctmsuffix).
and used to populate speaker and transcription keys. In pseudo-code:
# instantiate loader registered with `.rttm` suffix
speaker = RTTMLoader('rttms/train.rttm')
# entries with {placeholders} serve as path templates
transcription_template = 'ctms/{uri}.ctm'
for resource in protocol.train():
# unique resource identifier
uri = resource['uri']
# only select parts of `rttms/train.rttm` that are relevant to current resource,
# convert it into a convenient data structure (here pyannote.core.Annotation),
# and assign it to `'speaker'` resource key
resource['speaker'] = speaker[uri]
# replace placeholders in `transcription` path template
ctm = transcription_template.format(uri=uri)
# instantiate loader registered with `.ctm` suffix
transcription = CTMLoader(ctm)
# only select parts of the `ctms/{uri}.ctm` that are relevant to current resource
# (here, most likely the whole file), convert it into a convenient data structure
# (here spacy.tokens.Doc), and assign it to `'transcription'` resource key
resource['transcription'] = transcription[uri]pyannote.database provides built-in data loaders for a limited set of file formats: RTTMLoader for .rttm files, UEMLoader for .uem files, and CTMLoader for .ctm files. See Custom data loaders section to learn how to add your own.
When iterating over a protocol subset (e.g. using for resource in protocol.train()), resources are provided as instances of pyannote.database.ProtocolFile, which are basically dict instances whose values are computed lazily.
For instance, in the code above, the value returned by resource['speaker'] is only computed the first time it is accessed and then cached for all subsequent calls. See Custom data loaders section for more details.
Similarly, resources can be augmented (or modified) on-the-fly with the preprocessors options for get_protocol. In the example below, a dummy key is added that simply returns the length of the uri string:
def compute_dummy(resource: ProtocolFile):
print(f"Computing 'dummy' key")
return len(resource["uri"])
from pyannote.database import registry
protocol = registry.get_protocol('Etape.SpeakerDiarization.TV',
preprocessors={"dummy": compute_dummy})
resource = next(protocol.train())
resource["dummy"]
Computing 'dummy' keyFileFinder is a special case of preprocessors is pyannote.database.FileFinder meant to automatically locate the media file associated with the uri.
Say audio files are available at the following paths:
.
└── /path/to
└── audio
├── filename1.wav
├── filename2.mp3
├── filename3.wav
├── filename4.wav
└── filename5.mp3
The FileFinder preprocessor relies on a Databases: section that should be added to the database.yml configuration files and indicates where to look for media files (using resource key placeholders):
Databases:
MyDatabase:
- /path/to/audio/{uri}.wav
- /path/to/audio/{uri}.mp3
Protocols:
MyDatabase:
Protocol:
MyProtocol:
train:
uri: lists/train.lstNote that any pattern supported by pathlib.Path.glob is supported (but avoid ** as much as possible). Paths can also be relative to the location of database.yml. It will then do its best to locate the file at runtime:
from pyannote.database import registry
from pyannote.database import FileFinder
protocol = registry.get_protocol('MyDatabase.SpeakerDiarization.MyProtocol',
preprocessors={"audio": FileFinder()})
for resource in protocol.train():
print(resource["audio"])
/path/to/audio/filename1.wav
/path/to/audio/filename2.mp3A raw collection of files (i.e. without any train/development/test split) can be defined using the Collection task:
# ~/database.yml
Protocols:
MyDatabase:
Collection:
MyCollection:
uri: /path/to/collection.lst
any_other_key: ... # see custom loader documentationwhere /path/to/collection.lst contains the list of identifiers of the
files in the collection:
# /path/to/collection.lst
filename1
filename2
filename3
It can the be used in Python like this:
from pyannote.database import registry
collection = registry.get_protocol('MyDatabase.Collection.MyCollection')
for file in collection.files():
print(file["uri"])
filename1
filename2
filename3A (temporal) segmentation protocol can be defined using the Segmentation task:
Protocols:
MyDatabase:
Segmentation:
MyProtocol:
classes:
- speech
- noise
- music
train:
uri: /path/to/train.lst
annotation: /path/to/train.rttm
annotated: /path/to/train.uemwhere /path/to/train.lst contains the list of identifiers of the
files in the training set:
# /path/to/train.lst
filename1
filename2
/path/to/train.rttm contains the reference segmentation using
RTTM format:
# /path/to/reference.rttm
SPEAKER filename1 1 3.168 0.800 <NA> <NA> speech <NA> <NA>
SPEAKER filename1 1 5.463 0.640 <NA> <NA> speech <NA> <NA>
SPEAKER filename1 1 5.496 0.574 <NA> <NA> music <NA> <NA>
SPEAKER filename1 1 10.454 0.499 <NA> <NA> music <NA> <NA>
SPEAKER filename2 1 2.977 0.391 <NA> <NA> noise <NA> <NA>
SPEAKER filename2 1 18.705 0.964 <NA> <NA> noise <NA> <NA>
SPEAKER filename2 1 22.269 0.457 <NA> <NA> speech <NA> <NA>
SPEAKER filename2 1 28.474 1.526 <NA> <NA> speech <NA> <NA>
/path/to/train.uem describes the annotated regions using UEM format:
filename1 NA 0.000 30.000
filename2 NA 0.000 30.000
filename2 NA 40.000 70.000
It is recommended to provide the annotated key even if it covers the whole file. Any part of annotation that lives outside of the provided annotated will be removed. It is also used by pyannote.metrics to remove un-annotated regions from the evaluation, and to prevent pyannote.audio from incorrectly considering empty un-annotated regions as negatives.
It can then be used in Python like this:
from pyannote.database import registry
protocol = registry.get_protocol('MyDatabase.Segmentation.MyProtocol')
for file in protocol.train():
print(file["uri"])
assert "annotation" in file
assert "annotated" in file
filename1
filename2A protocol can be defined specifically for speaker diarization using the SpeakerDiarization task:
Protocols:
MyDatabase:
SpeakerDiarization:
MyProtocol:
scope: file
train:
uri: /path/to/train.lst
annotation: /path/to/train.rttm
annotated: /path/to/train.uemwhere /path/to/train.lst contains the list of identifiers of the
files in the training set:
# /path/to/train.lst
filename1
filename2
/path/to/train.rttm contains the reference speaker diarization using
RTTM format:
# /path/to/reference.rttm
SPEAKER filename1 1 3.168 0.800 <NA> <NA> speaker_A <NA> <NA>
SPEAKER filename1 1 5.463 0.640 <NA> <NA> speaker_A <NA> <NA>
SPEAKER filename1 1 5.496 0.574 <NA> <NA> speaker_B <NA> <NA>
SPEAKER filename1 1 10.454 0.499 <NA> <NA> speaker_B <NA> <NA>
SPEAKER filename2 1 2.977 0.391 <NA> <NA> speaker_C <NA> <NA>
SPEAKER filename2 1 18.705 0.964 <NA> <NA> speaker_C <NA> <NA>
SPEAKER filename2 1 22.269 0.457 <NA> <NA> speaker_A <NA> <NA>
SPEAKER filename2 1 28.474 1.526 <NA> <NA> speaker_A <NA> <NA>
/path/to/train.uem describes the annotated regions using UEM format:
filename1 NA 0.000 30.000
filename2 NA 0.000 30.000
filename2 NA 40.000 70.000
It is recommended to provide the annotated key even if it covers the whole file. Any part of annotation that lives outside of the provided annotated will be removed. It is also used by pyannote.metrics to remove un-annotated regions from the evaluation, and to prevent pyannote.audio from incorrectly considering empty un-annotated regions as non-speech.
It can then be used in Python like this:
from pyannote.database import registry
protocol = registry.get_protocol('MyDatabase.SpeakerDiarization.MyProtocol')
for file in protocol.train():
print(file["uri"])
assert "annotation" in file
assert "annotated" in file
filename1
filename2The scope parameters indicates the scope of speaker labels:
fileindicates that each file has its own set of speaker labels. There is no guarantee thatspeaker1infilename1is the same speaker asspeaker1infilename2.databaseindicates that all files in the database share the same set of speaker labels.speaker1indatabase1/filename1is the same speaker asspeaker1indatabase1/filename2.globalindicates that the set of speaker labels is the same across all databases.speaker1indatabase1is the same speaker asspeaker1indatabase2.
scope is then directly accessible from file['scope'].
A simple speaker verification protocol can be defined by adding a trial entry to a SpeakerVerification task:
Protocols:
MyDatabase:
SpeakerVerification:
MyProtocol:
train:
uri: /path/to/train.lst
duration: /path/to/duration.map
trial: /path/to/trial.txtwhere /path/to/train.lst contains the list of identifiers of the
files in the collection:
# /path/to/collection.lst
filename1
filename2
filename3
...
/path/to/duration.map contains the duration of the files:
filename1 30.000
filename2 30.000
...
/path/to/trial.txt contains a list of trials :
1 filename1 filename2
0 filename1 filename3
...
1 stands for target trials and 0 for non-target trials.
In the example below, it means that the same speaker uttered files filename1 and filename2 and that filename1 and filename3 are from two different speakers.
It can then be used in Python like this:
from pyannote.database import registry
protocol = registry.get_protocol('MyDatabase.SpeakerVerification.MyProtocol')
for trial in protocol.train_trial():
print(f"{trial['reference']} {trial['file1']['uri']} {trial['file2']['uri']}")
1 filename1 filename2
0 filename1 filename3Note that speaker verification protocols (SpeakerVerificationProtocol) are a subclass of speaker diarization protocols (SpeakerDiarizationProtocol). As such, they also define regular {subset} methods.
pyannote.database provides a way to combine several protocols (possibly
from different databases) into one.
This is achieved by defining those "meta-protocols" into the configuration file with the special X database:
Requirements:
- /path/to/my/database/database.yml # defines MyDatabase protocols
- /path/to/my/other/database/database.yml # defines MyOtherDatabase protocols
Protocols:
X:
Protocol:
MyMetaProtocol:
train:
MyDatabase.Protocol.MyProtocol: [train, development]
MyOtherDatabase.Protocol.MyOtherProtocol: [train, ]
development:
MyDatabase.Protocol.MyProtocol: [test, ]
MyOtherDatabase.Protocol.MyOtherProtocol: [development, ]
test:
MyOtherDatabase.Protocol.MyOtherProtocol: [test, ]The new X.Protocol.MyMetaProtocol combines the train and development subsets of MyDatabase.Protocol.MyProtocol with the train subset of MyOtherDatabase.Protocol.MyOtherProtocol to build a meta train subset.
This new "meta-protocol" can be used like any other protocol of the (fake) X database:
from pyannote.database import registry
protocol = registry.get_protocol('X.Protocol.MyMetaProtocol')
for resource in protocol.train():
passFor more complex protocols, you can create (and share) your own pyannote.database plugin.
A bunch of pyannote.database plugins are already available (search for pyannote.db on pypi)
Everything about databases is stored in the registry.
from pyannote.database import registryAny database can then be instantiated as follows:
database = registry.get_database("MyDatabase")Some databases (especially multimodal ones) may be used for several tasks.
One can get a list of tasks using get_tasks method:
database.get_tasks()
["SpeakerDiarization"]pyannote.database provides built-in data loaders for a limited set of file formats: RTTMLoader for .rttm files, UEMLoader for .uem files, and CTMLoader for .ctm files.
In case those are not enough, pyannote.database supports the addition of custom data loaders using the pyannote.database.loader entry point.
Here is an example of a Python package called your_package that defines two custom data loaders for files with .ext1 and .ext2 suffix respectively.
# ~~~~~~~~~~~~~~~~ YourPackage/your_package/loader.py ~~~~~~~~~~~~~~~~
from pyannote.database import ProtocolFile
from pathlib import Path
class Ext1Loader:
def __init__(self, ext1: Path):
print(f'Initializing Ext1Loader with {ext1}')
# your code should obviously do something smarter.
# see pyannote.database.loader.RTTMLoader for an example.
self.ext1 = ext1
def __call__(self, current_file: ProtocolFile) -> Text:
uri = current_file["uri"]
print(f'Processing {uri} with Ext1Loader')
# your code should obviously do something smarter.
# see pyannote.database.loader.RTTMLoader for an example.
return f'{uri}.ext1'
class Ext2Loader:
def __init__(self, ext2: Path):
print(f'Initializing Ext2Loader with {ext2}')
# your code should obviously do something smarter.
# see pyannote.database.loader.RTTMLoader for an example.
self.ext2 = ext2
def __call__(self, current_file: ProtocolFile) -> Text:
uri = current_file["uri"]
print(f'Processing {uri} with Ext2Loader')
# your code should obviously do something smarter.
# see pyannote.database.loader.RTTMLoader for an example.
return f'{uri}.ext2'
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~The __init__ method expects a unique positional argument of type Path that provides the path to the data file in the custom data format.
__call__ expects a unique positional argument of type ProtocolFile and returns the data for the given file.
It is recommended to make __init__ as fast and light as possible and delegate all the data filtering and formatting to __call__. For instance, RTTMLoader.__init__ uses pandas to load the full .rttm file as fast as possible in a DataFrame, while RTTMLoader.__call__ takes care of selecting rows that correspond to the requested file and convert them into a pyannote.core.Annotation.
At this point, pyannote.database has no idea of the existence of these new custom data loaders. They must be registered using the pyannote.database.loader entry-point in your_package's setup.py, and then install the library pip install your_package (or pip install -e YourPackage/ if it is not published on PyPI yet).
# ~~~~~~~~~~~~~~~~~~~~~~~ YourPackage/setup.py ~~~~~~~~~~~~~~~~~~~~~~~
from setuptools import setup, find_packages
setup(
name="your_package",
packages=find_packages(),
install_requires=[
"pyannote.database >= 4.0",
]
entry_points={
"pyannote.database.loader": [
# load files with extension '.ext1'
# with your_package.loader.Ext1Loader
".ext1 = your_package.loader:Ext1Loader",
# load files with extension '.ext2'
# with your_package.loader.Ext2Loader
".ext2 = your_package.loader:Ext2Loader",
],
}
)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Now that .ext1 and .ext2 data loaders are registered, they will be used automatically by pyannote.database when parsing the sample demo/database.yml custom protocol configuration file.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~ demo/database.yml ~~~~~~~~~~~~~~~~~~~~~~~~~~
Protocols:
MyDatabase:
SpeakerDiarization:
MyProtocol:
train:
uri: train.lst
key1: train.ext1
key2: train.ext2# tell pyannote.database about the configuration file
>>> from pyannote.database import registry
>>> registry.load_database('demo/database.yml')
# load custom protocol
>>> protocol = registry.get_protocol('MyDatabase.SpeakerDiarization.MyProtocol')
# get first file of training set
>>> first_file = next(protocol.train())
Initializing Ext1Loader with file train.ext1
Initializing Ext2Loader with file train.ext2
# access its "key1" and "key2" keys.
>>> assert first_file["key1"] == 'fileA.ext1'
Processing fileA with Ext1Loader
>>> assert first_file["key2"] == 'fileA.ext2'
Processing fileA with Ext2Loader
# note how __call__ is only called now (and not before)
# this is why it is better to delegate all the filtering and formatting to __call__
>>> assert first_file["key1"] == 'fileA.ext1'
# note how __call__ is not called the second time thanks to ProtocolFile built-in cacheAn experimental protocol can be defined programmatically by creating a
class that inherits from SpeakerDiarizationProtocol and implements at least
one of train_iter, development_iter and test_iter methods:
class MyProtocol(Protocol):
def train_iter(self) -> Iterator[Dict]:
yield {"uri": "filename1", "any_other_key": "..."}
yield {"uri": "filename2", "any_other_key": "..."}{subset}_iter should return an iterator of dictionnaries with
- "uri" key (mandatory) that provides a unique file identifier (usually
the filename),
- any other key that the protocol may provide.
It can then be used in Python like this:
protocol = MyProtocol()
for file in protocol.train():
print(file["uri"])
filename1
filename2A collection can be defined programmatically by creating a class that
inherits from CollectionProtocol and implements the files_iter method:
class MyCollection(CollectionProtocol):
def files_iter(self) -> Iterator[Dict]:
yield {"uri": "filename1", "any_other_key": "..."}
yield {"uri": "filename2", "any_other_key": "..."}
yield {"uri": "filename3", "any_other_key": "..."}files_iter should return an iterator of dictionnaries with
- a mandatory "uri" key that provides a unique file identifier (usually
the filename),
- any other key that the collection may provide.
It can then be used in Python like this:
collection = MyCollection()
for file in collection.files():
print(file["uri"])
filename1
filename2
filename3A speaker diarization protocol can be defined programmatically by creating
a class that inherits from SpeakerDiarizationProtocol and implements at
least one of train_iter, development_iter and test_iter methods:
class MySpeakerDiarizationProtocol(SpeakerDiarizationProtocol):
def train_iter(self) -> Iterator[Dict]:
yield {"uri": "filename1",
"annotation": Annotation(...),
"annotated": Timeline(...)}
yield {"uri": "filename2",
"annotation": Annotation(...),
"annotated": Timeline(...)}{subset}_iter should return an iterator of dictionnaries with
- "uri" key (mandatory) that provides a unique file identifier (usually the filename),
- "annotation" key (mandatory for train and development subsets) that
provides reference speaker diarization as a
pyannote.core.Annotationinstance, - "annotated" key (recommended) that describes which part of the file
has been annotated, as a
pyannote.core.Timelineinstance. Any part of "annotation" that lives outside of the provided "annotated" will be removed. This is also used bypyannote.metricsto remove un-annotated regions from its evaluation report, and bypyannote.audioto not consider empty un-annotated regions as non-speech. - any other key that the protocol may provide.
It can then be used in Python like this:
protocol = MySpeakerDiarizationProtocol()
for file in protocol.train():
print(file["uri"])
filename1
filename2A speaker verification protocol implement the {subset}_trial functions, useful in speaker verification validation process. Note that SpeakerVerificationProtocol is a subclass of SpeakerDiarizationProtocol. As such, it shares the same {subset}_iter methods, and need a mandatory {subset}_iter method.
A speaker verification protocol can be defined programmatically by creating a class that inherits from SpeakerVerificationProtocol and implement at least one of train_trial_iter, development_trial_iter and test_trial_iter methods:
class MySpeakerVerificationProtocol(SpeakerVerificationProtocol):
def train_iter(self) -> Iterator[Dict]:
yield {"uri": "filename1",
"annotation": Annotation(...),
"annotated": Timeline(...)}
yield {"uri": "filename2",
"annotation": Annotation(...),
"annotated": Timeline(...)}
def train_trial_iter(self) -> Iterator[Dict]:
yield {"reference": 1,
"file1": ProtocolFile(...),
"file2": ProtocolFile(...)}
yield {"reference": 0,
"file1": {
"uri":"filename1",
"try_with":Timeline(...)
},
"file1": {
"uri":"filename3",
"try_with":Timeline(...)
}
}{subset}_trial_iter should return an iterator of dictionnaries with
referencekey (mandatory) that provides an int portraying whetherfile1andfile2are uttered by the same speaker (1 is same, 0 is different),file1key (mandatory) that provides the first file,file2key (mandatory) that provides the second file.
Both file1 and file2 should be provided as dictionaries or pyannote.database.protocol.protocol.ProtocolFile instances with
urikey (mandatory),try_withkey (mandatory) that describes which part of the file should be used in the validation process, as apyannote.core.Timelineinstance.- any other key that the protocol may provide.
It can then be used in Python like this:
protocol = MySpeakerVerificationProtocol()
for trial in protocol.train_trial():
print(f"{trial['reference']} {trial['file1']['uri']} {trial['file2']['uri']}")
1 filename1 filename2
0 filename1 filename3