Skip to content

CDK-29, CDK-466, CDK-827: Include use of CLI flume-config. #79

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions tutorials/create-events-dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
layout: page
title: Creating the Events Dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this tutorial needs more context as well. Maybe there should be an overall tutorial page that outlines the example goal, or maybe it can be included at the start of the tutorials. Otherwise, it isn't clear why we're creating "the" events dataset or what this is used for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add context at the start of each lesson.

---
## Purpose

This lesson shows you how to create a dataset suitable for storing standard event records, as defined in [The Unified Logging Infrastructure for Data Analytics at Twitter][paper]. You define a [dataset schema][schema], a [partition strategy][partstrat], and a URI that specifies the storage [scheme][scheme], then use [`kite-dataset create`][create] to make a Hive dataset.

[paper]:http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf
[schema]:{{site.baseurl}}/introduction-to-datasets.html#schemas
[partstrat]:{{site.baseurl}}/Partitioned-Datasets.html#partition-strategies
[scheme]:{{site.baseurl}}/introduction-to-datasets.html#uri-schemes
[create]:{{site.baseurl}}/cli-reference.html#create

### Prerequisites

* A [Quickstart VM][prepare] or instance of CDH 5.2 or later.
* The [kite-dataset][kite-dataset] command.

[prepare]:{{site.baseurl}}/tutorials/preparing-the-vm.html
[kite-dataset]:{{site.baseurl}}/Install-Kite.html

### Result

You create `dataset:hive:events`, where you can store standard event objects. You can use the dataset with several Kite tutorials that demonstrate data capture, storage, and analysis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like "You can use the dataset with...". Could you add links to those tutorials?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modules stand alone. Structure is provided by a TOC.


## Defining the Schema

The `standard_event.avsc` schema is self-describing, with a _doc_ property for each field. StandardEvent records store the `user_id` for the person who initiates an event, the user's IP address, and a timestamp for when the event occurred.

### standard_event.avsc

```JSON
{
"name": "StandardEvent",
"namespace": "org.kitesdk.data.event",
"type": "record",
"doc": "A standard event type for logging, based on the paper 'The Unified Logging Infrastructure for Data Analytics at Twitter' by Lee et al, http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf",
"fields": [
{
"name": "event_initiator",
"type": "string",
"doc": "Source of the event in the format {client,server}_{user,app}; for example, 'client_user'. Required."
},
{
"name": "event_name",
"type": "string",
"doc": "A hierarchical name for the event, with parts separated by ':'. Required."
},
{
"name": "user_id",
"type": "long",
"doc": "A unique identifier for the user. Required."
},
{
"name": "session_id",
"type": "string",
"doc": "A unique identifier for the session. Required."
},
{
"name": "ip",
"type": "string",
"doc": "The IP address of the host where the event originated. Required."
},
{
"name": "timestamp",
"type": "long",
"doc": "The point in time when the event occurred, represented as the number of milliseconds since January 1, 1970, 00:00:00 GMT. Required."
}
]
}
```

## Defining the Partition Strategy

Analytics for the `events` dataset are time-based. Partitioning the dataset on the `timestamp` field allows Kite to go directly to the files for a particular day, ignoring files outside the time period. Partition strategies are defined in JSON format. See [Partition Strategy JSON Format][partition-strategies].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you define what you mean by analytics are time-based? An example would help understanding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the assumptions for users reading this tutorial? This paragraph assumes the reader is familiar with partitioning, which may not be the case. I think it would be better to explain it in this order:

  1. Typical uses are time-limited: you care about a day or an hour of the data, not all of it.
  2. To take advantage of this and not read data you don't need, use a year/month/day organization for the data.
  3. This is called partitioning and configured with a partition strategy. Include links to relevant reference docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added links in the Purpose section, to cover conceptual topics.


The following sample defines a strategy that partitions a dataset by _year_, _month_, and _day_, based on a _timestamp_ field.

### partition_year_month_day.json

```
[ {
"source" : "timestamp",
"type" : "year",
"name" : "year"
}, {
"source" : "timestamp",
"type" : "month",
"name" : "month"
}, {
"source" : "timestamp",
"type" : "day",
"name" : "day"
} ]
```

[partition-strategies]:{{site.baseurl}}/Partition-Strategy-Format.html

## Creating the Events Dataset Using the Kite CLI

Create the _events_ dataset using the default Hive scheme.

To create the _events_ dataset:

1. Open a terminal window.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This doesn't need to be a step because we don't want the user to open a new terminal each time. You could instead say "In a terminal window, run the create command . . ." That would get rid of the need for numbering here since it's just one command and some explanation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this as is.

1. Use the `create` command to create the dataset. This example assumes that you stored the schema and partition definitions in your home directory. Substitute the correct path if you stored them in a different location.

```
kite-dataset create events \
--schema ~/standard_event.avsc \
--partition-by ~/partition_year_month_day.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more clear to have the user do this in a known directory, like the kite-examples directory. This would remove the need to discuss directories other than to include a step to go to the appropriate one at the start of each tutorial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been back and forth on the directory. The home directory is fine for now. I might revisit this in the future.

```

Use [Hue][hue] to confirm that the dataset appears in your table list and is ready to use.

[hue]:http://quickstart.cloudera:8888/beeswax/execute#query
199 changes: 199 additions & 0 deletions tutorials/flume-capture-events.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
---
layout: page
title: Capturing Events with Flume
---

## Purpose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere in this section, please note that this is optional and the reader can generate data with the other tutorial. That way, readers that aren't interested in Flume can skip this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readers who are not interested in Flume will not go to a page titled "Capturing Events with Flume."


This lesson demonstrates how you can configure Flume to capture events from a web application with minimal impact on performance or the user. Flume collects individual events and writes them in groups to the dataset.

The Flume agent receives the events over inter-process communication (IPC), and writes the events to the Hive file sink. Each time you send a message, Log4j writes a new `INFO` line in the terminal window.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to mention IPC. IPC covers a wide variety of communication methods. What actually happens is the Flume agent listens for events that are sent from an application, in this case the sample web application. That application sends events by "logging" them through Log4j, which also logs the event to the terminal.


This example demonstrates how to generate Flume configuration information from the Kite CLI. In addition, JSP and servlet samples allow you to test the data capture mechanism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The web app demonstrates how to send events to Flume using Log4j. You can use the web app to send a few sample events. Is that what you meant by "test the data capture mechanism"?


### Prerequisites

* A VM or cluster configured with Flume user impersonation. See [Preparing the Virtual Machine][vm].
* An [Events dataset][events] in which to capture session events.

[vm]:{{site.baseurl}}/tutorials/preparing-the-vm.html
[events]:{{site.baseurl}}/tutorials/create-events-dataset.html

### Result

Flume is configured to listen for events on a Tomcat server instance. Use the JSP and servlets to send events to Tomcat. Log4j logs each event to the terminal window. Flume stores the events in `dataset:hive:events`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tomcat is just a service that runs the web application. Flume doesn't work with Tomcat at all, though the application running in Tomcat is sending events to Flume.


## Configuring Flume

Follow these steps to configure Flume to channel log information directly to the `events` dataset. You first generate the configuration information using the Kite command-line interface, copy the results, paste them in the Flume configuration file, and then restart Flume.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section should explain what the Kite command generates and what Flume is doing as a result.


You can configure Flume for this example using either Cloudera Manager or the command line.

### Configuring Flume in Cloudera Manager

1. In a terminal window, type `kite-dataset flume-config --channel-type memory events`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the command could use more explanation. What doe sit do and why? What might I change for production instead of the demo? (The channel would be a file channel in production, by the way)

1. Copy the output from the terminal window.
1. Open Cloudera Manager.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VM setup instructions don't assume you're using Cloudera Manager, so I don't think this should either. The alternative is to copy the flume config and restart:

sudo cp flume.conf /etc/flume-ng/conf/flume.conf
sudo /etc/init.d/flume-ng-agent restart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a separate list of instructions for configuring Flume from the command line.

1. Under __Status__, click the link to __Flume__.
1. Choose the __Configuration__ tab.
1. Click __Agent Base Group__.
1. Right-click the Configuration File text area and choose __Select All__.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "Configuration File" should be distinguished as something to look for with formatting.

1. Right-click the Configuration File text area and choose __Paste__.
1. Click __Save Changes__.
1. From the __Actions__ menu, choose __Restart__, and confirm the action.

### Configuring Flume from the Command Line

1. In a terminal window, enter `kite-dataset flume-config --channel-type memory events -o flume.conf`.
1. To update Flume configuration, enter `sudo cp flume.conf /etc/flume-ng/conf/flume.conf`.
1. To restart the Flume agent, enter `sudo /etc/init.d/flume-ng-agent restart`.

Flume is now configured to listen for web application events and record them in the `events` dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to state "web application" events. Just events. It would be great to explain more of the config and what this is doing with Flume. As it is, this treats Flume as a black box and doesn't help the reader understand what is going on inside it. Kite's configuration is a starting point, so it is important for the reader to know what Kite is telling Flume to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted above where the extra explanation should go.


## Running the Web Application

Follow these steps to build the web application, start the Tomcat server, and then use the web application to generate events that are sent to the Hadoop dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section should start with context about how the web application sends data to Flume. That's the most important part of it and the other half of the important tutorial content.


1. In a terminal window, navigate to `kite-examples/demo`.
1. To compile the application, enter `mvn install`.
1. To start the Tomcat server, enter `mvn tomcat7:run`.
1. In a web browser, enter the URL [`http://quickstart.cloudera:8034/demo-logging-webapp/`][logging-app].
1. On the web form, enter any user ID and a message, and then click **Send** to create a web event.

View the log messages in the terminal window where you launched Tomcat. View the records in Hive using the Hue File Browser.

[logging-app]:http://quickstart.cloudera:8034/demo-logging-webapp/

## Creating Web Application Pages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a lot of discussion on the servlet and JSP, but the focus should be on the architecture: The servlet (or any application) uses Log4j to "log" events, which are Avro records. Log4j is configured to send those events to Flume. Flume accumulates events and writes them to the Dataset.

Most people want to know why we are doing it this way:

  1. Why Flume? Because it is reliable, widely used for this purpose, and has good integration.
  2. Why not write directly to the dataset in Hadoop? Because we want to accumulate events to write into the same file without disrupting the web application. Flume will take responsibility for individual events and see that they are written in groups to the dataset. The application wants to move on after a single event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding context will help, but the essential issue here is that all we're doing is configuring Flume to watch for events, and that's achieved using the CLI. There's very little Kite stuff involved in this example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's very little Kite stuff involved in this example.

I agree, but that highlights the need to make what is happening very clear. Focusing on the web application doesn't help the user understand the message we want them to, but discussing why they care about Kite's support in Flume does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be "Understanding" not "Creating" because the app is already written.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logging configuration and StandardEvent code should be explained above this point, so the rest of the tutorial is optional for those readers that want to understand the rest of the web app in detail.


These JSP and servlet examples create message events that can be captured by Flume. These examples are not Kite- or Flume-specific; they send messages to the Tomcat server, and Flume captures the events independent of the web application.

## index.jsp

The default landing page for the web application is `index.jsp`. It defines a form with fields for an arbitrary User ID and a message. The __Send__ button submits the input values to the Tomcat server.

```JSP
<html>
<head>
<title>Kite Example</title>
<head>
<body>
<h2>Kite Example</h2>
<form name="input" action="send" method="get">
User ID: <input type="text" name="user_id" value="1">
Message: <input type="text" name="message" value="Hello!">
<input type="submit" value="Send">
</form>
</body>
</html>
```

## LoggingServlet

When you submit a message from the JSP, the LoggingServlet receives and processes the request. The following is mostly standard servlet code, with some notes about application-specific snippets.

```Java
package org.kitesdk.examples.demo;
```

The servlet parses information from the request to create a StandardEvent object. However, you won't find any source code for `org.kitesdk.data.event.StandardEvent`. During the Maven build, the `avro-maven-plugin` runs before the compile phase. Any `.avsc` file in the `/main/avro` folder is defined as a Java class. The autogenerated classes have the methods required to build corresponding Avro `SpecificRecord` objects of that type. `SpecificRecord` objects permit efficient access to object fields.

```Java

import org.kitesdk.data.event.StandardEvent;
import java.io.IOException;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

```

This example sends Log4j messages directly to the Hive data sink via Flume.

```Java
import org.apache.log4j.Logger;

public class LoggingServlet extends HttpServlet {

private final Logger logger = Logger.getLogger(LoggingServlet.class);

@Override
protected void doGet(HttpServletRequest request, HttpServletResponse
response) throws ServletException, IOException {

response.setContentType("text/html");
```

Create a PrintWriter instance to write the response page.

```Java
PrintWriter pw = response.getWriter();

pw.println("<html>");
pw.println("<head><title>Kite Example</title></head>");
pw.println("<body>");
```

Get the user ID and message values from the servlet request.

```Java
String userId = request.getParameter("user_id");
String message = request.getParameter("message");
```

If there's no message, don't create a log entry.

```Java
if (message == null) {
pw.println("<p>No message specified.</p>");

```

Otherwise, print the message at the top of the page body.

```Java
} else {
pw.println("<p>Message: " + message + "</p>");

```

Create a new StandardEvent builder.

```Java
StandardEvent event = StandardEvent.newBuilder()
```
The event initiator is a user on the client. The event is a web message. You can set these values as string literals, because the event initiator and event name are always the same.

```Java
.setEventInitiator("client_user")
.setEventName("web:message")
```

Parse the arbitrary user ID, provided by the user, as a long integer.

```Java
.setUserId(Long.parseLong(userId))

```

The application obtains the session ID and IP address from the request object, and creates a timestamp based on the local machine clock.

```Java
.setSessionId(request.getSession(true).getId())
.setIp(request.getRemoteAddr())
.setTimestamp(System.currentTimeMillis())
```

Build the StandardEvent object, and then send the object to the logger with the level _info_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the important part. Log4j is configured to pass these events to Flume, so this is where the data is actually sent from the application.


```Java
.build();
logger.info(event);
}
pw.println("<p><a href=\"/demo-logging-webapp\">Home</a></p>");
pw.println("</body></html>");
}
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be some summary, next steps, etc. here.

Loading