Quantifying the Syntactic Information Content in Duration and Pause

This repository contains the code, data processing scripts, and modeling framework that accompanies "Quantifying the Syntactic Information Content in Duration and Pause."

Overview

Our primary research questions are:

How much information about syntax is carried by these specific prosodic features?
Words also carry syntactic information; do prosodic features carry syntactic information beyond what is carried by words?
How does this information content vary across different speech styles?

By answering these questions, we aim to provide formal, empirical evidence to evaluate accounts of the syntax-prosody interface (e.g. direct reference, indirect reference, prosody-driven syntax), pinpointing the degree and the manner to which word duration and pause length align with syntactic structure. Furthermore, this quantification offers a theoretical grounding for natural language processing (NLP) systems that leverage prosodic features to improve syntactic parsing or vice versa.

Repo structure

├── data/                  # Includes .jsonl datafiles--contact Junghyun Min for them
├── constituency/          
│   ├── candor/            # CANDOR pre-processing and alignment code and guide from Thomas H. Clark
│   │    ├── get_surprisals_candor.py
│   │    ├── merge_mfa_durations_candor.py
│   │    ├── postprocess_candor.py
│   │    ├── prep_for_mfa_candor.py
│   │    ├── process_av_features_candor.py
│   │    ├── mfa.sh
│   │    └── WORKFLOW_EXPLANATION.md
│   ├── data.py
│   ├── infer.py
│   ├── model.py
│   ├── significance.py
│   ├── stats.py
│   ├── util.py
│   └── wp2parse.py
├── scripts/                          # Scripts for calculating information-theoretic metrics
│   ├── run-mfa.sh                    # Run MFA to align CANDOR
│   ├── run-candor-preprocess.sh      # Preprocess CANDOR dataset
│   ├── run-stanza.sh                 # Obtain silver stanza parses
│   ├── run-t5-autoreg.sh             # Estimate H(S)
│   ├── run-dur2parse.sh              # Estimate H(S|duration)
│   ├── run-pau2parse.sh              # Estimate H(S|pause)
│   ├── run-text2parse.sh             # Estimate H(S|W)
│   ├── run-wd2parse.sh               # Estimate H(S|W,duration)
│   └── run-wp2parse.sh               # Estimate H(S|W,pause)
├── output/                # Output tables, logs, and generated plots
├── requirements.txt       # Python dependencies
├── viz.Rmd
└── README.md

Installation and setup

Clone the repository

git clone https://github.com/yourusername/syntax-prosody-information.git
cd syntax-prosody-information

Create a virtual environment and install dependencies:

python -m venv prosody
source prosody/bin/activate
pip install -r requirements.txt

Prepare the data

LibriTTS

Download LibriTTS data from https://www.openslr.org/60/.
Also download the (MFA) aligned LibriTTSLabel data from https://github.com/kan-bayashi/LibriTTSLabel.
Run constituency.util.LibriCorpusBuilder to generate data/constituency_corpus.json.

CANDOR

Download the CANDOR data from https://guscooney.com/candor-dataset/.
Run Stages 1 and 2 outlined in constituency/candor/WORKFLOW_EXPLANATION.md; scripts/run-candor-preprocess.sh should be helpful.
Run MFA and merge resulting values. scripts/run-MFA.sh should be helpful.
Run constituency.util.CandorCorpusBuilder to generate data/candor_corpus.json.

MFA and experimental environments did not play well together in my case. I am happy to provide you with the data at your request.

Usage

All the experiments to estimate entropy will be in the following form:

python -m constituency.wp2parse --use-duration --use-parse --data=candor --lr=3e-4

Refer to bash scripts in scripts/ and the argument parser in constituency/wp2parse.py for more details.

Citation

A preprint of this work is currently in preparation. If you use this code or methodology in your research prior to publication, please link to this repository and contact the author for the most up-to-date citation information.

License

Please refer to LICENSE.

Contact: Junghyun Min (jm3743@georgetown.edu)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quantifying the Syntactic Information Content in Duration and Pause

Overview

Repo structure

Installation and setup

LibriTTS

CANDOR

Usage

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
constituency		constituency
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.project-root		.project-root
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run-candor-preprocess.sh		run-candor-preprocess.sh
run-dur2parse.sh		run-dur2parse.sh
run-gpt-autoreg.sh		run-gpt-autoreg.sh
run-mfa.sh		run-mfa.sh
run-pau2parse.sh		run-pau2parse.sh
run-stanza.sh		run-stanza.sh
run-t5-autoreg.sh		run-t5-autoreg.sh
run-text2parse.sh		run-text2parse.sh
run-wd2parse.sh		run-wd2parse.sh
run-wp2parse.sh		run-wp2parse.sh
run-wz2parse.sh		run-wz2parse.sh
run-zero2parse.sh		run-zero2parse.sh
setup.py		setup.py
viz.Rmd		viz.Rmd

Folders and files

Latest commit

History

Repository files navigation

Quantifying the Syntactic Information Content in Duration and Pause

Overview

Repo structure

Installation and setup

LibriTTS

CANDOR

Usage

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages