Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,8 @@ target/

#Ipython Notebook
.ipynb_checkpoints
/.env*
*.out
*.geany
/data
/raw
6 changes: 6 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
repos:
- repo: https://github.com/pre-commit/mirrors-yapf
rev: v0.28.0
hooks:
- id: yapf
args: [--in-place, --parallel, --recursive]
13 changes: 13 additions & 0 deletions .style.yapf
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[style]
BASED_ON_STYLE = pep8
COLUMN_LIMIT = 160
COALESCE_BRACKETS = true
DEDENT_CLOSING_BRACKETS = true
BLANK_LINE_BEFORE_NESTED_CLASS_OR_DEF = true
SPACES_BEFORE_COMMENT = 1
SPLIT_COMPLEX_COMPREHENSION = true
SPACE_BETWEEN_ENDING_COMMA_AND_CLOSING_BRACKET = false
SPLIT_PENALTY_FOR_ADDED_LINE_SPLIT = 10
CONTINUATION_INDENT_WIDTH = 4
INDENT_WIDTH = 4
CONTINUATION_ALIGN_STYLE = SPACE
13 changes: 13 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
dist: xenial
sudo: required
language: python
python:
- "3.6"
install:
- sudo add-apt-repository -y ppa:deadsnakes/ppa
- sudo apt-get -yq update
- sudo apt-get -yq install python3.6 python3.6-dev python3.7 python3.7-dev
- pip install -r requirements-test.txt
script:
- ./pep8.sh
- tox
1 change: 1 addition & 0 deletions .yapfignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

3 changes: 3 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include README.md
include requirements.txt
include requirements-test.txt
65 changes: 46 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,52 @@
**[DEMO](http://bark.phon.ioc.ee/punctuator)** and **[DEMO2](http://bark.phon.ioc.ee/punctuator/game)**

# Punctuator

[![](https://img.shields.io/pypi/v/punctuator.svg)](https://pypi.python.org/pypi/punctuator)
[![Build Status](https://img.shields.io/travis/chrisspen/punctuator2.svg?branch=master)](https://travis-ci.org/chrisspen/punctuator)

This is a fork of [Ottokar Tilk's punctuator2](https://github.com/ottokart/punctuator2) cleaned up into a formal Python3 package with testing.

**[DEMO](http://bark.phon.ioc.ee/punctuator)** and **[DEMO2](http://bark.phon.ioc.ee/punctuator/game)**

A bidirectional recurrent neural network model with attention mechanism for restoring missing inter-word punctuation in unsegmented text.

The model can be trained in two stages (second stage is optional):

1. First stage is trained on punctuation annotated text. Here the model learns to restore puncutation based on textual features only.
2. Optional second stage can be trained on punctuation *and* pause annotated text. In this stage the model learns to combine pause durations with textual features and adapts to the target domain. If pauses are omitted then only adaptation is performed. Second stage with pause durations can be used for example for restoring punctuation in automatic speech recognition system output.

# Installation

To install:

virtualenv -p python3.7 .env
. .env/bin/activate
pip install punctuator

Additionally, you'll need a trained model. You can create your own following the instructions below, or you can use a pre-trained model from [here](https://drive.google.com/drive/folders/0B7BsN5f2F1fZQnFsbzJ3TWxxMms?usp=sharing).

Place these models in `PUNCTUATOR_DATA_DIR` directory, which defaults to `~/.punctuator`.

For example, to download `Demo-Europarl-EN.pcl`, activate your virtual environment and run:

. .env/bin/activate
mkdir -p ~/.punctuator
cd ~/.punctuator
gdown https://drive.google.com/uc?id=0B7BsN5f2F1fZd1Q0aXlrUDhDbnM

To download other model files, find the Google Drive id via the share link, and substitute that in the command above.

# Usage

To use from the command line:

cat input.txt | python punctuator.py model.pcl output.txt

To use from Python:

from punctuator import Punctuator
p = Punctuator('model.pcl')
print(p.punctuate('some text'))

# How well does it work?

* A working demo can be seen here: http://bark.phon.ioc.ee/punctuator
Expand Down Expand Up @@ -64,7 +102,7 @@ _Overall_ | _75.7_ | _73.9_ | _74.8_
```to <sil=0.000> be <sil=0.100> ,COMMA or <sil=0.000> not <sil=0.000> to <sil=0.000> be <sil=0.150> ,COMMA that <sil=0.000> is <sil=0.000> the <sil=0.000> question <sil=1.000> .PERIOD```

Second phase data can also be without pause annotations to do just target domain adaptation.

Make sure that first words of sentences don't have capitalized first letters. This would give the model unfair hints about period locations. Also, the text files you use for training and validation must be large enough (at least minibatch_size x sequence_length of words, which is 128x50=6400 words with default settings), otherwise you might get an error.

# Configuration
Expand Down Expand Up @@ -123,23 +161,12 @@ or with:

if you want to see, which words the model sees as UNKs (OOVs).

# Development

# Citing

The software is described in:

@inproceedings{tilk2016,
author = {Ottokar Tilk and Tanel Alum{\"a}e},
title = {Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration},
booktitle = {Interspeech 2016},
year = {2016}
}
Run all tests with:

We used the [release v1.0](https://github.com/ottokart/punctuator2/releases/tag/v1.0) in the paper.
export TESTNAME=; tox

# Alternatives
Run a specific test in a specific environment with:

* A fork from this repository that uses additional prosodic features: https://github.com/alpoktem/punkProse
* Convolutional neural network with slightly smaller accuracy but much higher speed (50x faster): https://github.com/vackosar/keras-punctuator (additional details here: https://github.com/ottokart/punctuator2/issues/14)
* A general sequence labeling model: https://github.com/marekrei/sequence-labeler that can be used for punctuation restoration with small modifications (example here: https://github.com/ottokart/sequence-labeler). Punctuator2 can be probably used for other sequence labeling problems as well.
* Our previous approach with unidirectional LSTM (less accurate, but useful if you don't want to use Theano): https://github.com/ottokart/punctuator
export TESTNAME=.test_punctuate; tox -e py37
33 changes: 0 additions & 33 deletions convert_to_readable.py

This file was deleted.

19 changes: 0 additions & 19 deletions example/run.sh

This file was deleted.

11 changes: 11 additions & 0 deletions format-yapf-changed.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
FILES=`git status --porcelain | grep -E "*\.py$" | grep -v migration | grep -v "^D " | grep -v "^ D " | grep -v "^R " | awk '{print $2}'`
VENV=${VENV:-.env}
$VENV/bin/yapf --version
if [ -z "$FILES" ]
then
echo "No Python changes detected."
else
echo "Checking: $FILES"
$VENV/bin/yapf --in-place --recursive $FILES
fi
4 changes: 4 additions & 0 deletions format-yapf.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/bin/bash
# Note, this should be used rarely, and instead the pre-commit hook relied upon.
yapf --in-place --recursive punctuator
yapf --in-place --recursive setup.py
19 changes: 19 additions & 0 deletions init_virtualenv.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash
set -e

cd "$(dirname "$0")"

CACHE_DIR=/tmp/pip
REL_DIR=./

# Remove existing virtualenv if it exists.
[ -d $REL_DIR.env ] && rm -Rf $REL_DIR.env

# Create virtual environment with Python 3.7 (requires python3-venv package on Ubuntu)
python3.7 -m venv $REL_DIR.env
. $REL_DIR.env/bin/activate
pip install -U pip
pip install -U setuptools
pip install pypandoc

pip install --cache-dir $CACHE_DIR -r requirements.txt -r requirements-test.txt
2 changes: 2 additions & 0 deletions pep8.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#!/bin/bash
pylint --rcfile=pylint.rc punctuator setup.py
5 changes: 5 additions & 0 deletions publish.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/bash
set -e
. .env/bin/activate
python setup.py sdist
twine upload dist/*
Loading