-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change partitioning strategy for online processing #793
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
* add the fink_mm pipeline into raw2science * pep8 requirements and add documentation and comments * fix bugs and problems with fink CI, restore the stream test, preparation for fink-mm test * pep8 * unit test fixed * add stream_integration argument * add echo path in test * fixed pythonpath * fixed pythonpath * install fink-mm dev version, to remove after the test dev phase * add datatest for all topics * add datasim for join with gcn * add gcn data test * pep8 * integrate fink-mm distribution to the broker * update fink-mm commit in workflow, pep8 * add mechanism to avoid bad schema inference of spark dataframe with fink-mm * raw2science too short in CI to generate MM join data * add tests for fink-mm offline * review modification, fix the fink_mm offline test conf * fix distribution CI, drop new timestamp column for fink_mm, convert new timestamp column into string for fink-broker * pep8 * fix parser default * Format files * Remove NIGHT declaration duplicate * Style * Fix headers * Ruff formatting * Fix module path * Merge mm utils into a single module mm_utils.py * Refactor the fink-mm section in raw2science * Refactor distribute * Cleaning files (#849) * Remove the need for SCRAM * Update fink bin * Increase the number of shuffle partition for SSO * Push all alerts in once * Update science elasticc * Use subscribePattern instead of subscribe * Update scheduler * Format code * Discard alerts with i band measurements (#839) * Add new argument in configuration file * Update conf files * Add missing argument * Format raw2science * Better path management * Fix bug in path * Check if files exist -- not just the folder (#851) * Check if files exist -- not just the folder * Bump fink-filters to 3.29, and test it on CI * Bump fink-filters to 3.30 * Improve verbosity when trying to launch fink-mm * Apply ruff * Add missing parameter in the conf file * Check HDFS folder is not empty before launching services * Wait for one batch to complete before launching * Switch Docker image * Use the streaming DF to infer schema (#853) --------- Co-authored-by: JulienPeloton <[email protected]>
|
* Update the rowkey construction * Update tester to enable capability to test one file * Improve CD process - increase argoCD usage - use Spark operator - use Minio operator - add Helm chart for fink-broker - Improve logging management - Use finkctl to create kafka secret - Bump ciux to v0.0.3-rc4 - Bump ktbx to v1.1.3-rc1 - Increase sync checks in CI * Wait for input topic to exist * Wait for fink-producer secret to appear * Add temporary hack to CI * Improve code format an linting * add delta time (#860) * Fix typo * Fix column name when constructing the rowkey (#864) * Fix column name when constructing the rowkey * Reformat * Remove unused (and wrong) row key addition * PEP8 * Fix bug in column names * Improve logging message * Trigger GHA build via cron * Remove tmate session in ci * Remove sudo for docker prune in ci * Improve pip dependencies management Add Dockerfile to ciux source pathes Increase parameters management Add separate log level for spark Improve build script configuration * Improve fink-broker configuration * Fix ciux init in CI * Improve fink startup script * Use finkctl new release * Document release management * Ruff * clean CI yaml * Ruff * Force ipv4 for Kafka * Change the path to the fink alert simulator * Restore the path. We have a problem because schema cannot be read. * Update the configuration for the schema * Change get_fink_logger into init_logger --------- Co-authored-by: Fabrice Jammes <[email protected]> Co-authored-by: Anais Möller <[email protected]> Co-authored-by: Fabrice Jammes <[email protected]>
…n Sentinel for the science and e2e-gha for the noscience
* Add script to detect hostless candidates * notes * updated model (#869) * Bump fink-science and fink-filters * Switch to Telegram bot for hostless detection * Ruff * Bump requirements for fink-utils * Fix missing args * Split database operations --------- Co-authored-by: Anais Möller <[email protected]>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
IMPORTANT: Please create an issue first before opening a Pull Request.
Linked to issue(s):
What changes were proposed in this pull request?
This PR modifies the online processing to remove the partitioning of the data. Concretely, we were storing data under:
now it changed to:
where
NIGHT=YYYYMMDD
. Note that thearchive
folder remains untouched however:Also
dropduplicates
condition in stream2raw which was slowing down the processing (and was fixed before).brokerIngestTimestamp
,brokerStartProcessTimestamp
,brokerEndProcessTimestamp
) to profile the processing at large. It is not clear if the last 2 will work as expected (functional programming!)How was this patch tested?
CI & cloud