Corplet

Corplet is a binary-corpus storage system that is designed to store extremely large sets of words in a space-optimized format. Corplet is particularly useful for speech and word tagging, and can be used for more complex categories than parts of speech.

Usage

Corplet is a scala package built with the sbt build system. Running the command $ sbt run in your shell or running the main object will start an interactive Repl where you can play around with Corplets.

----- Corplet Repl -----
------************------
|--By: Josh Weinstein--|
corp> open foo
:[Opened Corplet foo]
corp> insert foo rats
:[Inserted rats into Corp: foo]
corp> contains foo rats
:[foo = rats -> true]
corp> contains foo rats hello
:[foo = rats -> true]
:[foo = hello -> true]
corp> close foo
:[Closed Corplet foo]
corp> quit
:[Quit Corplet]

You can also use Corplet in scala or java projects.

Documentation

Click here to see the detailed documentation

Implementation

Corplet uses a binary-digit-trie data structure that are stored in .corp files. These tries are traversed through Java's nio file channels and MappedByteBuffers, which allow for asynchronus file input/output, and to only load minimal data into memory when traversing the corpus.

File Format

A .corp file consists of a three byte header, followed by an unlimited number of 243 byte BodyChunks that act as nodes in the binary trie.

Header:

[33][39][83]

BodyChunk:

-Gates-(Bytes)
[0][0][0][0][0]............[0] #27 Gates, 27 bytes

-Indexes-(64 bit Longs)
[0][0][0][0][0]............[0] #27 Indexes, 216 Bytes

Each index is a 64 bit number, identical to a Java Long, that acts as a pointer to some byte address later in the file. This is so the Corplet can be continuously grown by appending more chunks to the file without moving anything.

However, this also means a Corplet does not allow deletion of any words. You will need to create a new Corplet if you want to remove words.

`Phrases`:

A phrase is some string that contains any lower case letter or a space character. The characters in these strings

`Gate`:

Each gate can have the value 0, 1 or 2. This corresponds to whether or not a character mapped digit exists as a child node, or a phrase has ended.

0 -> There is no child chunk, and there is no phrase ending here.
1 -> There is a child chunk at the corresponding index slot, and a phrase has not ended here.
2 -> There could be a child chunk, or not, but a phrase ends here. This means the contans method will return true when finding a phrase that it's last character is mapped here.

Example

Here is a diagram to demonstrate how this works.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
images		images
project		project
src		src
target		target
.gitattributes		.gitattributes
LICENSE.md		LICENSE.md
README.md		README.md
_config.yml		_config.yml
build.sbt		build.sbt
foo.corp		foo.corp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Corplet

Usage

Documentation

Implementation

File Format

`Phrases`:

`Gate`:

Example

About

Releases

Packages

Languages

License

jweinst1/corplet

Folders and files

Latest commit

History

Repository files navigation

Corplet

Usage

Documentation

Implementation

File Format

Phrases:

Gate:

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Phrases`:

`Gate`:

Packages