Dumbo

Validate segments generated by realtime against raw data in HDFS.
Rebuild segments that have discrepancies from raw data in HDFS.
Collapse existing segments into lower granularity segments.

Usage

$ bin/dumbo
You must supply -s PATH!
Usage: bin/dumbo (options)
    -d, --database PATH              path to database config, defaults to "database.json"
    -D, --debug                      Enable debug output
    -N, --dryrun                     do not submit tasks to overlord (dry-run)
    -e, --environment ENVIRONMENT    Set the daemon environment
        --force                      force segment generation regardless of state
    -i, --interval INTERVAL          force an explicit interval
    -l, --limit LIMIT                limit the number of tasks to spawn (defaults to unlimited)
    -m, --mode MODE                  mode to perform (verify, merge, compact)
        --name NAME                  Process name
    -n, --namenodes LIST             HDFS namenodes (comma seperated), defaults to "localhost"
    -f, --offset HOURS               offset from now used as interval end, defaults to 2 hours
    -o, --overlord HOST[:PORT]       overlord hostname and port, defaults to "localhost:8090"
    -r, --reverse BOOL               run jobs in reverse order
    -s, --sources PATH               path to sources config (required)
    -t, --topics LIST                Topics to process (comma seperated), defaults to all in sources.json
    -w, --window HOURS               scan window in hours, defaults to 24 hours
    -z, --zookeeper URI              zookeeper URI, defaults to "localhost:2181/druid"
        --zookeeper-path PATH        druid's discovery path within zookeeper, defaults to "/discovery"
    -h, --help                       Show this message

The repo contains examples for database.json and sources.json.

Assumption / Notes

HDFS contains data in gzip'd files in gobblin-style folders
database.json is used to initialize sequel
sources.json uses keys in the format "service/dataSource" as established in ruby-druid

About verify

Verify uses gobblin counters in HDFS to compare the total number of events in HDFS vs. in druid. To do this, there is a hard coded aggregation count named "events".

If a source['input']['epoc'] is set, verify will enforce the interval to not go beyond this point. This is useful, if you know you have incomplete HDFS data and want to keep the existing segments.

About compacting

Compacting verifies segmentGranularity and schema.

All tasks spawn Druid 0.17 native tasks

Contributing

Fork it
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Credit

Based on remerge/dumbo (which is a rewrite from scratch of druid-dumbo v1)

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
bin		bin
lib/dumbo		lib/dumbo
.gitignore		.gitignore
.ruby-version		.ruby-version
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
database.json.example		database.json.example
dumbo.gemspec		dumbo.gemspec
sources.json.example		sources.json.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dumbo

Usage

Assumption / Notes

About verify

About compacting

Contributing

Credit

About

Releases

Packages

Contributors 7

Languages

License

liquidm/druid-dumbo

Folders and files

Latest commit

History

Repository files navigation

Dumbo

Usage

Assumption / Notes

About verify

About compacting

Contributing

Credit

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages