Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change partitioning strategy for online processing #793

Merged
merged 36 commits into from
Aug 26, 2024

Conversation

JulienPeloton
Copy link
Member

@JulienPeloton JulienPeloton commented Jan 18, 2024

IMPORTANT: Please create an issue first before opening a Pull Request.
Linked to issue(s):

What changes were proposed in this pull request?

This PR modifies the online processing to remove the partitioning of the data. Concretely, we were storing data under:

online/raw/year=<YYYY>/month=<MM>/day=<DD>
online/science/year=<YYYY>/month=<MM>/day=<DD>

now it changed to:

online/raw/<NIGHT>
online/science/<NIGHT>

where NIGHT=YYYYMMDD. Note that the archive folder remains untouched however:

archive/raw/year=<YYYY>/month=<MM>/day=<DD>
archive/science/year=<YYYY>/month=<MM>/day=<DD>

Also

  1. we remove the dropduplicates condition in stream2raw which was slowing down the processing (and was fixed before).
  2. we add new fields in the alert packets (brokerIngestTimestamp, brokerStartProcessTimestamp, brokerEndProcessTimestamp) to profile the processing at large. It is not clear if the last 2 will work as expected (functional programming!)

How was this patch tested?

CI & cloud

@JulienPeloton JulienPeloton added this to the 3.2 milestone Jan 18, 2024
@JulienPeloton JulienPeloton changed the title [stream2raw] Change partitioning strategy for online processing Jan 18, 2024
Copy link

Quality Gate Passed Quality Gate passed

Issues
1 New issue

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

FusRoman and others added 4 commits June 11, 2024 07:14
* add the fink_mm pipeline into raw2science

* pep8 requirements and add documentation and comments

* fix bugs and problems with fink CI, restore the stream test, preparation for fink-mm test

* pep8

* unit test fixed

* add stream_integration argument

* add echo path in test

* fixed pythonpath

* fixed pythonpath

* install fink-mm dev version, to remove after the test dev phase

* add datatest for all topics

* add datasim for join with gcn

* add gcn data test

* pep8

* integrate fink-mm distribution to the broker

* update fink-mm commit in workflow, pep8

* add mechanism to avoid bad schema inference of spark dataframe with fink-mm

* raw2science too short in CI to generate MM join data

* add tests for fink-mm offline

* review modification, fix the fink_mm offline test conf

* fix distribution CI, drop new timestamp column for fink_mm, convert new timestamp column into string for fink-broker

* pep8

* fix parser default

* Format files

* Remove NIGHT declaration duplicate

* Style

* Fix headers

* Ruff formatting

* Fix module path

* Merge mm utils into a single module mm_utils.py

* Refactor the fink-mm section in raw2science

* Refactor distribute

* Cleaning files (#849)

* Remove the need for SCRAM

* Update fink bin

* Increase the number of shuffle partition for SSO

* Push all alerts in once

* Update science elasticc

* Use subscribePattern instead of subscribe

* Update scheduler

* Format code

* Discard alerts with i band measurements (#839)

* Add new argument in configuration file

* Update conf files

* Add missing argument

* Format raw2science

* Better path management

* Fix bug in path

* Check if files exist -- not just the folder (#851)

* Check if files exist -- not just the folder

* Bump fink-filters to 3.29, and test it on CI

* Bump fink-filters to 3.30

* Improve verbosity when trying to launch fink-mm

* Apply ruff

* Add missing parameter in the conf file

* Check HDFS folder is not empty before launching services

* Wait for one batch to complete before launching

* Switch Docker image

* Use the streaming DF to infer schema (#853)

---------

Co-authored-by: JulienPeloton <[email protected]>
Copy link

Quality Gate Passed Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

JulienPeloton and others added 2 commits July 26, 2024 07:57
* Update the rowkey construction

* Update tester to enable capability to test one file

* Improve CD process

- increase argoCD usage
- use Spark operator
- use Minio operator
- add Helm chart for fink-broker
- Improve logging management
- Use finkctl to create kafka secret
- Bump ciux to v0.0.3-rc4
- Bump ktbx to v1.1.3-rc1
- Increase sync checks in CI
  * Wait for input topic to exist
  * Wait for fink-producer secret to appear

* Add temporary hack to CI

* Improve code format an linting

* add delta time (#860)

* Fix typo

* Fix column name when constructing the rowkey (#864)

* Fix column name when constructing the rowkey

* Reformat

* Remove unused (and wrong) row key addition

* PEP8

* Fix bug in column names

* Improve logging message

* Trigger GHA build via cron

* Remove tmate session in ci

* Remove sudo for docker prune in ci

* Improve pip dependencies management

Add Dockerfile to ciux source pathes
Increase parameters management
Add separate log level for spark
Improve build script configuration

* Improve fink-broker configuration

* Fix ciux init in CI

* Improve fink startup script

* Use finkctl new release

* Document release management

* Ruff

* clean CI yaml

* Ruff

* Force ipv4 for Kafka

* Change the path to the fink alert simulator

* Restore the path. We have a problem because schema cannot be read.

* Update the configuration for the schema

* Change get_fink_logger into init_logger

---------

Co-authored-by: Fabrice Jammes <[email protected]>
Co-authored-by: Anais Möller <[email protected]>
Co-authored-by: Fabrice Jammes <[email protected]>
Copy link

@JulienPeloton JulienPeloton merged commit dbb8f5c into master Aug 26, 2024
14 checks passed
@fjammes fjammes deleted the issue/770/duplicates branch August 28, 2024 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No need for online partitioning? [stream2raw] do not remove duplicate
2 participants