Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/src/user-docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ Reference docs like format specifications, etc.
:caption: Quick start

quick-start/index
quick-start/text-v-json
quick-start/clp-json
quick-start/clp-text
:::
Expand Down
2 changes: 1 addition & 1 deletion docs/src/user-docs/quick-start/clp-json.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ sbin/compress.sh --timestamp-key '<timestamp-key>' <path1> [<path2> ...]

* `<path...>` are paths to JSON log files or directories containing such files.
* Each JSON log file should contain each log event as a
[separate JSON object](./index.md#clp-json), i.e., *not* as an array.
[separate JSON object](./text-v-json.md#clp-json), i.e., *not* as an array.

The compression script will output the compression ratio of each dataset you compress, or you can
use the UI to view overall statistics.
Expand Down
61 changes: 6 additions & 55 deletions docs/src/user-docs/quick-start/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,68 +49,19 @@ install or upgrade it by following the instructions for your OS.

There are two flavors of CLP:

* **[clp-json](#clp-json)** for compressing and searching **JSON** logs.
* **[clp-text](#clp-text)** for compressing and searching **unstructured text** logs.
* **`clp-json`** for compressing and searching **JSON** logs.
* **`clp-text`** for compressing and searching **unstructured text** logs.

:::{note}
Both flavors contain the same binaries but are configured with different values for the
`package.storage_engine` key in the package's config file (`etc/clp-config.yml`).
:::

### clp-json

The JSON flavor of CLP is appropriate for JSON logs, where each log event is an independent JSON
object. For example:

```json lines
{
"t": {
"$date": "2023-03-21T23:46:37.392"
},
"ctx": "conn11",
"msg": "Waiting for write concern."
}
{
"t": {
"$date": "2023-03-21T23:46:37.392"
},
"msg": "Set last op to system time"
}
```

The log file above contains two log events represented by two JSON objects printed one after the
other. Whitespace is ignored, so the log events could also appear with no newlines and indentation.

If you're using JSON logs, download and extract the `clp-json` release from the
[Releases][clp-releases] page, then proceed to the [clp-json quick-start](./clp-json.md) guide.

### clp-text

The text flavor of CLP is appropriate for unstructured text logs, where each log event contains a
timestamp and may span one or more lines.

:::{note}
If your logs don't contain timestamps or CLP can't automatically parse the timestamps in your logs,
it will treat each line as an independent log event.
:::

For example:

```text
2015-03-23T15:50:17.926Z INFO container_1 Transitioned from ALLOCATED to ACQUIRED
2015-03-23T15:50:17.927Z ERROR Scheduler: Error trying to assign container token
java.lang.IllegalArgumentException: java.net.UnknownHostException: i-e5d112ea
at org.apache.hadoop.security.buildTokenService(SecurityUtil.java:374)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
Caused by: java.net.UnknownHostException: i-e5d112ea
... 17 more
```

The log file above contains two log events, both beginning with a timestamp. The first is a single
line, while the second contains multiple lines.
Download and extract your chosen flavor from the [Releases][clp-releases] page, and then proceed to
the [appropriate quick-start guide](#using-clp).

If you're using unstructured text logs, download and extract the `clp-text` release from the
[Releases][clp-releases] page, then proceed to the [clp-text quick-start](./clp-text.md) guide.
If you're having trouble selecting which flavor will work best for you, or you'd like to compare the
capabilities of the two flavors, check out the [clp-text vs. clp-json](./text-v-json.md) page.

---

Expand Down
115 changes: 115 additions & 0 deletions docs/src/user-docs/quick-start/text-v-json.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# clp-text vs. clp-json

CLP comes in two flavors:

* **[clp-json](#clp-json)** for compressing and searching **JSON** logs.
* **[clp-text](#clp-text)** for compressing and searching **unstructured text** logs.

:::{note}
Both flavors contain the same binaries but are configured with different values for the
`package.storage_engine` key in the package's config file (`etc/clp-config.yml`).
:::

[Table 1](#table-1) compares the different capabilities and limitations of each of the two flavors.

(table-1)=
:::{card}
<style>
.g,.r,.o{font-weight:700;font-style:normal}
.g::after{content:"";color:green}
.r::after{content:"";color:red}
.o::after{content:"";color:orange}
</style>

|Capability|`clp-text`|`clp-json`|
|---|:---:|:---:|
|Compression of unstructured text logs|<b class="g"></b>|<b class="r"></b>|
|Compression of JSON logs|<b class="o"></b><sup>1</sup>|<b class="g"></b>|
|Compression of CLP IR files|<b class="g"></b>|<b class="r"></b>|
|Compression of CLP KV-IR files|<b class="r"></b>|<b class="r"></b>|
|Command line search|<b class="g"></b>|<b class="g"></b>|
|WebUI search|<b class="g"></b>|<b class="g"></b>|
|Decompression|<b class="g"></b>|<b class="r"></b>|
|Automatic timestamp parsing|<b class="o"></b><sup>2</sup>|<b class="o"></b><sup>2, 3</sup>|
|Preservation of time zone information|<b class="r"></b><sup>4</sup>|<b class="r"></b><sup>4</sup>|
|Retention control|<b class="g"></b>|<b class="g"></b>|
|Archive management|<b class="g"></b>|<b class="g"></b>|
|Dataset management|<b class="r"></b>|<b class="g"></b>|
|S3 support|<b class="r"></b>|<b class="g"></b>|
|Multi-node deployment|<b class="g"></b>|<b class="g"></b>|
|CLP + Presto integration|<b class="r"></b>|<b class="g"></b>|
|Parallel compression|<b class="g"></b>|<b class="g"></b>|

+++
**Table 1**: The capabilities and limitations of CLP's two flavors.

1) `clp-text` is able to compress and search JSON logs as if they were unstructured text, but
`clp-text` cannot query individual fields.
2) Timestamp parsing is limited to specific supported formats: see
[clp-text timestamp formats][ts-text] and [clp-json timestamp formats][ts-json] for more details.
3) Timestamps are parsed automatically as long as the timestamp key for the logs is provided at
compression time using the `--timestamp-key` flag.
4) We hope to introduce support for the preservation of time zone information in a future update
(issue is up [here](https://github.com/y-scope/clp/issues/1290))
:::

## clp-json

The JSON flavor of CLP is appropriate for JSON logs, where each log event is an independent JSON
object. For example:

```json lines
{
"t": {
"$date": "2023-03-21T23:46:37.392"
},
"ctx": "conn11",
"msg": "Waiting for write concern."
}
{
"t": {
"$date": "2023-03-21T23:46:37.392"
},
"msg": "Set last op to system time"
}
```

The log file above contains two log events represented by two JSON objects printed one after the
other. Whitespace is ignored, so the log events could also appear with no newlines and indentation.

If you're using JSON logs, download and extract the `clp-json` release from the
[Releases][clp-releases] page, then proceed to the [clp-json quick-start](./clp-json.md) guide.

## clp-text

The text flavor of CLP is appropriate for unstructured text logs, where each log event contains a
timestamp and may span one or more lines.

:::{note}
If your logs don't contain timestamps or CLP can't automatically parse the timestamps in your logs,
it will treat each line as an independent log event.
:::

For example:

```text
2015-03-23T15:50:17.926Z INFO container_1 Transitioned from ALLOCATED to ACQUIRED
2015-03-23T15:50:17.927Z ERROR Scheduler: Error trying to assign container token
java.lang.IllegalArgumentException: java.net.UnknownHostException: i-e5d112ea
at org.apache.hadoop.security.buildTokenService(SecurityUtil.java:374)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
Caused by: java.net.UnknownHostException: i-e5d112ea
... 17 more
```

The log file above contains two log events, both beginning with a timestamp. The first is a single
line, while the second contains multiple lines.

If you're using unstructured text logs, download and extract the `clp-text` release from the
[Releases][clp-releases] page, then proceed to the [clp-text quick-start](./clp-text.md) guide.

[clp-releases]: https://github.com/y-scope/clp/releases
<!-- markdownlint-disable-next-line MD013 -->
[ts-text]: https://github.com/y-scope/clp/blob/main/components/core/src/clp/TimestampPattern.cpp#L120
<!-- markdownlint-disable-next-line MD013 -->
[ts-json]: https://github.com/y-scope/clp/blob/main/components/core/src/clp_s/TimestampPattern.cpp#L210