Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Lambda Scanner #645

Open
gladkikhartem opened this issue Jan 30, 2024 · 1 comment
Open

Feature: Lambda Scanner #645

gladkikhartem opened this issue Jan 30, 2024 · 1 comment
Labels
enhancement New feature or request future

Comments

@gladkikhartem
Copy link

gladkikhartem commented Jan 30, 2024

Hey guys,
Just wanna say that you are on the right track!
I had Idea similar to yours - to store logs on S3 and then query them and I've spend some time analyzing it.
Maybe I'll use your tool instead of inventing bycicle on the next project :)
But now I just wanted to share some info that may be useful to you:

My motivation for building such tool was the fact that you can't store AWS logs outside of AWS, because you have to pay insane DataTransfer costs to AWS to move logs to some cheaper storage.
Because of this - all SaaS log solutions for AWS are insanely expensive.
I think that S3 is the only decent solution to cost-effectively store logs on AWS.

However - when I want to find something on those terabytes of logs on S3 - i have a problem - too much data has to be scanned.
There are some queries that simply won't leverage Parquet push-down filtering and will have to scan all data on S3.

If you have same issue - I would suggest to add a powerful feature to parseable: Lambda Scanner
You can delegate filtering to 1000s of Lambdas in parallel and be able to scan files on S3 with speeds of 100Gb/s or more.
In this case I can run cheap instance for Logs ingestion and delegate infrequent heavy query request processing to Lambda.
I've tested this already on Lambda and simple JSON parsing in Go and it works pretty well.
I was able to achieve incredible speeds with this approach and you might want to try it too.

I was able to run 500 Lambdas, each scanning 100MB compressed json file on S3 ( 273MB raw size).
Each lambda took on average 2 seconds to process the file all 500 lambdas finished in 11 seconds. (would have been 3 seconds If I would start the job from AWS server and not local machine. My network wasn't able to handle 500 simultaneous requests)
So I was able to almost instantly scan 50GB of compressed (136 GB RAW) JSON data.
And I paid for this request only $0.01.

In theory Raw JSON scanning speeds up to 250GBytes per second can be achieved with this solution.
With such power and cost-efficiency you can sell your SaaS solution to big clients with Terabytes of logs that are currently paying millions to Datadog,Splunk, etc..

@nitisht
Copy link
Member

nitisht commented Jan 30, 2024

Thanks for the kind words @gladkikhartem this seems interesting. Please allow us sometime to discuss and we'll get back.

@nitisht nitisht added enhancement New feature or request future labels Feb 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request future
Projects
None yet
Development

No branches or pull requests

2 participants