-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDK-928: Utility to generate events to existing table. #23
base: master
Are you sure you want to change the base?
Conversation
import org.apache.hadoop.conf.Configured; | ||
import org.apache.hadoop.util.Tool; | ||
|
||
public abstract class BaseEventsTool extends Configured implements Tool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like any of the code in this class is used, so it would be better to remove it and make GenerateEvents implement Tool directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent. Done.
baseTimestamp = System.currentTimeMillis(); | ||
|
||
View<StandardEvent> events = Datasets.load( | ||
(args.length==1 ? args[0] : "dataset:hive:events"), StandardEvent.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noted this elsewhere, but I think it would be better to use a variable rather than the inline test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this wrong, or just different? Are you suggesting that the test should set the variable before the load method? If the argument is invalid, does it change the result by setting it outside the load method? If the code must change before publication, please provide the acceptable alternate code, rather than have me guess at what I should do.
With an eye toward modularization, I've repurposed CreateEvents.java from the Spark example and placed it in
org/kitesdk/examples/data
. This lets the customer create the events dataset using the CLI, then populate it with a substantial number of records using the Java utility. The same dataset can be used for the Flume and Spark examples, without having to delete them after running their respective jobs.In GenerateEvents, I essentially swapped the CreateEvents
create()
method withload()
. I added the Avro plug-in topom.xml
, copied theavro
folder withstandard_event.avsc
into themain
directory, and copiedBaseEventsTool.java
toorg/kitesdk/examples/data
.In my environment, it compiles, runs, and populates the events table as expected.
**Update
The random records were a little too random: if the user_id, session_id, and ip are different each time, when the Crunch utility runs, there are no sessions to aggregate. I revised the run method to first generate the user_id, session_id, and ip, then used a for loop to generate 1-25 random events. I also modified the randomTimestamp method to increase the base length of time and add random padding to create more realistic session duration.
I'm happy to incorporate any changes that make the code more elegant, my changes just make it work.