-
Notifications
You must be signed in to change notification settings - Fork 11
CDK-29, CDK-466, CDK-827: Include use of CLI flume-config. #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
f9e33e7
16e423c
b9f1150
eef961f
bf0f4fa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
--- | ||
layout: page | ||
title: Creating the Events Dataset | ||
--- | ||
## Purpose | ||
|
||
This lesson shows you how to create a dataset suitable for storing standard event records, as defined in [The Unified Logging Infrastructure for Data Analytics at Twitter][paper]. You define a [dataset schema][schema], a [partition strategy][partstrat], and a URI that specifies the storage [scheme][scheme], then use [`kite-dataset create`][create] to make a Hive dataset. | ||
|
||
[paper]:http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf | ||
[schema]:{{site.baseurl}}/introduction-to-datasets.html#schemas | ||
[partstrat]:{{site.baseurl}}/Partitioned-Datasets.html#partition-strategies | ||
[scheme]:{{site.baseurl}}/introduction-to-datasets.html#uri-schemes | ||
[create]:{{site.baseurl}}/cli-reference.html#create | ||
|
||
### Prerequisites | ||
|
||
* A [Quickstart VM][prepare] or instance of CDH 5.2 or later. | ||
* The [kite-dataset][kite-dataset] command. | ||
|
||
[prepare]:{{site.baseurl}}/tutorials/preparing-the-vm.html | ||
[kite-dataset]:{{site.baseurl}}/Install-Kite.html | ||
|
||
### Result | ||
|
||
You create `dataset:hive:events`, where you can store standard event objects. You can use the dataset with several Kite tutorials that demonstrate data capture, storage, and analysis. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like "You can use the dataset with...". Could you add links to those tutorials? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The modules stand alone. Structure is provided by a TOC. |
||
|
||
## Defining the Schema | ||
|
||
The `standard_event.avsc` schema is self-describing, with a _doc_ property for each field. StandardEvent records store the `user_id` for the person who initiates an event, the user's IP address, and a timestamp for when the event occurred. | ||
|
||
### standard_event.avsc | ||
|
||
```JSON | ||
{ | ||
"name": "StandardEvent", | ||
"namespace": "org.kitesdk.data.event", | ||
"type": "record", | ||
"doc": "A standard event type for logging, based on the paper 'The Unified Logging Infrastructure for Data Analytics at Twitter' by Lee et al, http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf", | ||
"fields": [ | ||
{ | ||
"name": "event_initiator", | ||
"type": "string", | ||
"doc": "Source of the event in the format {client,server}_{user,app}; for example, 'client_user'. Required." | ||
}, | ||
{ | ||
"name": "event_name", | ||
"type": "string", | ||
"doc": "A hierarchical name for the event, with parts separated by ':'. Required." | ||
}, | ||
{ | ||
"name": "user_id", | ||
"type": "long", | ||
"doc": "A unique identifier for the user. Required." | ||
}, | ||
{ | ||
"name": "session_id", | ||
"type": "string", | ||
"doc": "A unique identifier for the session. Required." | ||
}, | ||
{ | ||
"name": "ip", | ||
"type": "string", | ||
"doc": "The IP address of the host where the event originated. Required." | ||
}, | ||
{ | ||
"name": "timestamp", | ||
"type": "long", | ||
"doc": "The point in time when the event occurred, represented as the number of milliseconds since January 1, 1970, 00:00:00 GMT. Required." | ||
} | ||
] | ||
} | ||
``` | ||
|
||
## Defining the Partition Strategy | ||
|
||
Analytics for the `events` dataset are time-based. Partitioning the dataset on the `timestamp` field allows Kite to go directly to the files for a particular day, ignoring files outside the time period. Partition strategies are defined in JSON format. See [Partition Strategy JSON Format][partition-strategies]. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you define what you mean by analytics are time-based? An example would help understanding. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the assumptions for users reading this tutorial? This paragraph assumes the reader is familiar with partitioning, which may not be the case. I think it would be better to explain it in this order:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added links in the Purpose section, to cover conceptual topics. |
||
|
||
The following sample defines a strategy that partitions a dataset by _year_, _month_, and _day_, based on a _timestamp_ field. | ||
|
||
### partition_year_month_day.json | ||
|
||
``` | ||
[ { | ||
"source" : "timestamp", | ||
"type" : "year", | ||
"name" : "year" | ||
}, { | ||
"source" : "timestamp", | ||
"type" : "month", | ||
"name" : "month" | ||
}, { | ||
"source" : "timestamp", | ||
"type" : "day", | ||
"name" : "day" | ||
} ] | ||
``` | ||
|
||
[partition-strategies]:{{site.baseurl}}/Partition-Strategy-Format.html | ||
|
||
## Creating the Events Dataset Using the Kite CLI | ||
|
||
Create the _events_ dataset using the default Hive scheme. | ||
|
||
To create the _events_ dataset: | ||
|
||
1. Open a terminal window. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: This doesn't need to be a step because we don't want the user to open a new terminal each time. You could instead say "In a terminal window, run the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prefer this as is. |
||
1. Use the `create` command to create the dataset. This example assumes that you stored the schema and partition definitions in your home directory. Substitute the correct path if you stored them in a different location. | ||
|
||
``` | ||
kite-dataset create events \ | ||
--schema ~/standard_event.avsc \ | ||
--partition-by ~/partition_year_month_day.json | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be more clear to have the user do this in a known directory, like the kite-examples directory. This would remove the need to discuss directories other than to include a step to go to the appropriate one at the start of each tutorial. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We've been back and forth on the directory. The home directory is fine for now. I might revisit this in the future. |
||
``` | ||
|
||
Use [Hue][hue] to confirm that the dataset appears in your table list and is ready to use. | ||
|
||
[hue]:http://quickstart.cloudera:8888/beeswax/execute#query |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,199 @@ | ||
--- | ||
layout: page | ||
title: Capturing Events with Flume | ||
--- | ||
|
||
## Purpose | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Somewhere in this section, please note that this is optional and the reader can generate data with the other tutorial. That way, readers that aren't interested in Flume can skip this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Readers who are not interested in Flume will not go to a page titled "Capturing Events with Flume." |
||
|
||
This lesson demonstrates how you can configure Flume to capture events from a web application with minimal impact on performance or the user. Flume collects individual events and writes them in groups to the dataset. | ||
|
||
The Flume agent receives the events over inter-process communication (IPC), and writes the events to the Hive file sink. Each time you send a message, Log4j writes a new `INFO` line in the terminal window. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this needs to mention IPC. IPC covers a wide variety of communication methods. What actually happens is the Flume agent listens for events that are sent from an application, in this case the sample web application. That application sends events by "logging" them through Log4j, which also logs the event to the terminal. |
||
|
||
This example demonstrates how to generate Flume configuration information from the Kite CLI. In addition, JSP and servlet samples allow you to test the data capture mechanism. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The web app demonstrates how to send events to Flume using Log4j. You can use the web app to send a few sample events. Is that what you meant by "test the data capture mechanism"? |
||
|
||
### Prerequisites | ||
|
||
* A VM or cluster configured with Flume user impersonation. See [Preparing the Virtual Machine][vm]. | ||
* An [Events dataset][events] in which to capture session events. | ||
|
||
[vm]:{{site.baseurl}}/tutorials/preparing-the-vm.html | ||
[events]:{{site.baseurl}}/tutorials/create-events-dataset.html | ||
|
||
### Result | ||
|
||
Flume is configured to listen for events on a Tomcat server instance. Use the JSP and servlets to send events to Tomcat. Log4j logs each event to the terminal window. Flume stores the events in `dataset:hive:events`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tomcat is just a service that runs the web application. Flume doesn't work with Tomcat at all, though the application running in Tomcat is sending events to Flume. |
||
|
||
## Configuring Flume | ||
|
||
Follow these steps to configure Flume to channel log information directly to the `events` dataset. You first generate the configuration information using the Kite command-line interface, copy the results, paste them in the Flume configuration file, and then restart Flume. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this section should explain what the Kite command generates and what Flume is doing as a result. |
||
|
||
You can configure Flume for this example using either Cloudera Manager or the command line. | ||
|
||
### Configuring Flume in Cloudera Manager | ||
|
||
1. In a terminal window, type `kite-dataset flume-config --channel-type memory events`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the command could use more explanation. What doe sit do and why? What might I change for production instead of the demo? (The channel would be a file channel in production, by the way) |
||
1. Copy the output from the terminal window. | ||
1. Open Cloudera Manager. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The VM setup instructions don't assume you're using Cloudera Manager, so I don't think this should either. The alternative is to copy the flume config and restart:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a separate list of instructions for configuring Flume from the command line. |
||
1. Under __Status__, click the link to __Flume__. | ||
1. Choose the __Configuration__ tab. | ||
1. Click __Agent Base Group__. | ||
1. Right-click the Configuration File text area and choose __Select All__. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: "Configuration File" should be distinguished as something to look for with formatting. |
||
1. Right-click the Configuration File text area and choose __Paste__. | ||
1. Click __Save Changes__. | ||
1. From the __Actions__ menu, choose __Restart__, and confirm the action. | ||
|
||
### Configuring Flume from the Command Line | ||
|
||
1. In a terminal window, enter `kite-dataset flume-config --channel-type memory events -o flume.conf`. | ||
1. To update Flume configuration, enter `sudo cp flume.conf /etc/flume-ng/conf/flume.conf`. | ||
1. To restart the Flume agent, enter `sudo /etc/init.d/flume-ng-agent restart`. | ||
|
||
Flume is now configured to listen for web application events and record them in the `events` dataset. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't need to state "web application" events. Just events. It would be great to explain more of the config and what this is doing with Flume. As it is, this treats Flume as a black box and doesn't help the reader understand what is going on inside it. Kite's configuration is a starting point, so it is important for the reader to know what Kite is telling Flume to do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I noted above where the extra explanation should go. |
||
|
||
## Running the Web Application | ||
|
||
Follow these steps to build the web application, start the Tomcat server, and then use the web application to generate events that are sent to the Hadoop dataset. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section should start with context about how the web application sends data to Flume. That's the most important part of it and the other half of the important tutorial content. |
||
|
||
1. In a terminal window, navigate to `kite-examples/demo`. | ||
1. To compile the application, enter `mvn install`. | ||
1. To start the Tomcat server, enter `mvn tomcat7:run`. | ||
1. In a web browser, enter the URL [`http://quickstart.cloudera:8034/demo-logging-webapp/`][logging-app]. | ||
1. On the web form, enter any user ID and a message, and then click **Send** to create a web event. | ||
|
||
View the log messages in the terminal window where you launched Tomcat. View the records in Hive using the Hue File Browser. | ||
|
||
[logging-app]:http://quickstart.cloudera:8034/demo-logging-webapp/ | ||
|
||
## Creating Web Application Pages | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This has a lot of discussion on the servlet and JSP, but the focus should be on the architecture: The servlet (or any application) uses Log4j to "log" events, which are Avro records. Log4j is configured to send those events to Flume. Flume accumulates events and writes them to the Dataset. Most people want to know why we are doing it this way:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adding context will help, but the essential issue here is that all we're doing is configuring Flume to watch for events, and that's achieved using the CLI. There's very little Kite stuff involved in this example. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree, but that highlights the need to make what is happening very clear. Focusing on the web application doesn't help the user understand the message we want them to, but discussing why they care about Kite's support in Flume does. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this should be "Understanding" not "Creating" because the app is already written. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the logging configuration and StandardEvent code should be explained above this point, so the rest of the tutorial is optional for those readers that want to understand the rest of the web app in detail. |
||
|
||
These JSP and servlet examples create message events that can be captured by Flume. These examples are not Kite- or Flume-specific; they send messages to the Tomcat server, and Flume captures the events independent of the web application. | ||
|
||
## index.jsp | ||
|
||
The default landing page for the web application is `index.jsp`. It defines a form with fields for an arbitrary User ID and a message. The __Send__ button submits the input values to the Tomcat server. | ||
|
||
```JSP | ||
<html> | ||
<head> | ||
<title>Kite Example</title> | ||
<head> | ||
<body> | ||
<h2>Kite Example</h2> | ||
<form name="input" action="send" method="get"> | ||
User ID: <input type="text" name="user_id" value="1"> | ||
Message: <input type="text" name="message" value="Hello!"> | ||
<input type="submit" value="Send"> | ||
</form> | ||
</body> | ||
</html> | ||
``` | ||
|
||
## LoggingServlet | ||
|
||
When you submit a message from the JSP, the LoggingServlet receives and processes the request. The following is mostly standard servlet code, with some notes about application-specific snippets. | ||
|
||
```Java | ||
package org.kitesdk.examples.demo; | ||
``` | ||
|
||
The servlet parses information from the request to create a StandardEvent object. However, you won't find any source code for `org.kitesdk.data.event.StandardEvent`. During the Maven build, the `avro-maven-plugin` runs before the compile phase. Any `.avsc` file in the `/main/avro` folder is defined as a Java class. The autogenerated classes have the methods required to build corresponding Avro `SpecificRecord` objects of that type. `SpecificRecord` objects permit efficient access to object fields. | ||
|
||
```Java | ||
|
||
import org.kitesdk.data.event.StandardEvent; | ||
import java.io.IOException; | ||
import java.io.PrintWriter; | ||
import javax.servlet.ServletException; | ||
import javax.servlet.http.HttpServlet; | ||
import javax.servlet.http.HttpServletRequest; | ||
import javax.servlet.http.HttpServletResponse; | ||
|
||
``` | ||
|
||
This example sends Log4j messages directly to the Hive data sink via Flume. | ||
|
||
```Java | ||
import org.apache.log4j.Logger; | ||
|
||
public class LoggingServlet extends HttpServlet { | ||
|
||
private final Logger logger = Logger.getLogger(LoggingServlet.class); | ||
|
||
@Override | ||
protected void doGet(HttpServletRequest request, HttpServletResponse | ||
response) throws ServletException, IOException { | ||
|
||
response.setContentType("text/html"); | ||
``` | ||
|
||
Create a PrintWriter instance to write the response page. | ||
|
||
```Java | ||
PrintWriter pw = response.getWriter(); | ||
|
||
pw.println("<html>"); | ||
pw.println("<head><title>Kite Example</title></head>"); | ||
pw.println("<body>"); | ||
``` | ||
|
||
Get the user ID and message values from the servlet request. | ||
|
||
```Java | ||
String userId = request.getParameter("user_id"); | ||
String message = request.getParameter("message"); | ||
``` | ||
|
||
If there's no message, don't create a log entry. | ||
|
||
```Java | ||
if (message == null) { | ||
pw.println("<p>No message specified.</p>"); | ||
|
||
``` | ||
|
||
Otherwise, print the message at the top of the page body. | ||
|
||
```Java | ||
} else { | ||
pw.println("<p>Message: " + message + "</p>"); | ||
|
||
``` | ||
|
||
Create a new StandardEvent builder. | ||
|
||
```Java | ||
StandardEvent event = StandardEvent.newBuilder() | ||
``` | ||
The event initiator is a user on the client. The event is a web message. You can set these values as string literals, because the event initiator and event name are always the same. | ||
|
||
```Java | ||
.setEventInitiator("client_user") | ||
.setEventName("web:message") | ||
``` | ||
|
||
Parse the arbitrary user ID, provided by the user, as a long integer. | ||
|
||
```Java | ||
.setUserId(Long.parseLong(userId)) | ||
|
||
``` | ||
|
||
The application obtains the session ID and IP address from the request object, and creates a timestamp based on the local machine clock. | ||
|
||
```Java | ||
.setSessionId(request.getSession(true).getId()) | ||
.setIp(request.getRemoteAddr()) | ||
.setTimestamp(System.currentTimeMillis()) | ||
``` | ||
|
||
Build the StandardEvent object, and then send the object to the logger with the level _info_. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the important part. Log4j is configured to pass these events to Flume, so this is where the data is actually sent from the application. |
||
|
||
```Java | ||
.build(); | ||
logger.info(event); | ||
} | ||
pw.println("<p><a href=\"/demo-logging-webapp\">Home</a></p>"); | ||
pw.println("</body></html>"); | ||
} | ||
} | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There should be some summary, next steps, etc. here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this tutorial needs more context as well. Maybe there should be an overall tutorial page that outlines the example goal, or maybe it can be included at the start of the tutorials. Otherwise, it isn't clear why we're creating "the" events dataset or what this is used for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add context at the start of each lesson.