Change partitioning strategy for online processing #793

JulienPeloton · 2024-01-18T09:42:41Z

IMPORTANT: Please create an issue first before opening a Pull Request.
Linked to issue(s):

Closes [stream2raw] do not remove duplicate #770
Closes No need for online partitioning? #783

What changes were proposed in this pull request?

This PR modifies the online processing to remove the partitioning of the data. Concretely, we were storing data under:

online/raw/year=<YYYY>/month=<MM>/day=<DD>
online/science/year=<YYYY>/month=<MM>/day=<DD>

now it changed to:

online/raw/<NIGHT>
online/science/<NIGHT>

where NIGHT=YYYYMMDD. Note that the archive folder remains untouched however:

archive/raw/year=<YYYY>/month=<MM>/day=<DD>
archive/science/year=<YYYY>/month=<MM>/day=<DD>

Also

we remove the dropduplicates condition in stream2raw which was slowing down the processing (and was fixed before).
we add new fields in the alert packets (brokerIngestTimestamp, brokerStartProcessTimestamp, brokerEndProcessTimestamp) to profile the processing at large. It is not clear if the last 2 will work as expected (functional programming!)

How was this patch tested?

CI & cloud

sonarqubecloud · 2024-02-20T09:43:13Z

Quality Gate passed

Issues
1 New issue

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

* add the fink_mm pipeline into raw2science * pep8 requirements and add documentation and comments * fix bugs and problems with fink CI, restore the stream test, preparation for fink-mm test * pep8 * unit test fixed * add stream_integration argument * add echo path in test * fixed pythonpath * fixed pythonpath * install fink-mm dev version, to remove after the test dev phase * add datatest for all topics * add datasim for join with gcn * add gcn data test * pep8 * integrate fink-mm distribution to the broker * update fink-mm commit in workflow, pep8 * add mechanism to avoid bad schema inference of spark dataframe with fink-mm * raw2science too short in CI to generate MM join data * add tests for fink-mm offline * review modification, fix the fink_mm offline test conf * fix distribution CI, drop new timestamp column for fink_mm, convert new timestamp column into string for fink-broker * pep8 * fix parser default * Format files * Remove NIGHT declaration duplicate * Style * Fix headers * Ruff formatting * Fix module path * Merge mm utils into a single module mm_utils.py * Refactor the fink-mm section in raw2science * Refactor distribute * Cleaning files (#849) * Remove the need for SCRAM * Update fink bin * Increase the number of shuffle partition for SSO * Push all alerts in once * Update science elasticc * Use subscribePattern instead of subscribe * Update scheduler * Format code * Discard alerts with i band measurements (#839) * Add new argument in configuration file * Update conf files * Add missing argument * Format raw2science * Better path management * Fix bug in path * Check if files exist -- not just the folder (#851) * Check if files exist -- not just the folder * Bump fink-filters to 3.29, and test it on CI * Bump fink-filters to 3.30 * Improve verbosity when trying to launch fink-mm * Apply ruff * Add missing parameter in the conf file * Check HDFS folder is not empty before launching services * Wait for one batch to complete before launching * Switch Docker image * Use the streaming DF to infer schema (#853) --------- Co-authored-by: JulienPeloton <[email protected]>

sonarqubecloud · 2024-06-13T06:14:05Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

* Update the rowkey construction * Update tester to enable capability to test one file * Improve CD process - increase argoCD usage - use Spark operator - use Minio operator - add Helm chart for fink-broker - Improve logging management - Use finkctl to create kafka secret - Bump ciux to v0.0.3-rc4 - Bump ktbx to v1.1.3-rc1 - Increase sync checks in CI * Wait for input topic to exist * Wait for fink-producer secret to appear * Add temporary hack to CI * Improve code format an linting * add delta time (#860) * Fix typo * Fix column name when constructing the rowkey (#864) * Fix column name when constructing the rowkey * Reformat * Remove unused (and wrong) row key addition * PEP8 * Fix bug in column names * Improve logging message * Trigger GHA build via cron * Remove tmate session in ci * Remove sudo for docker prune in ci * Improve pip dependencies management Add Dockerfile to ciux source pathes Increase parameters management Add separate log level for spark Improve build script configuration * Improve fink-broker configuration * Fix ciux init in CI * Improve fink startup script * Use finkctl new release * Document release management * Ruff * clean CI yaml * Ruff * Force ipv4 for Kafka * Change the path to the fink alert simulator * Restore the path. We have a problem because schema cannot be read. * Update the configuration for the schema * Change get_fink_logger into init_logger --------- Co-authored-by: Fabrice Jammes <[email protected]> Co-authored-by: Anais Möller <[email protected]> Co-authored-by: Fabrice Jammes <[email protected]>

…timestamp

…n Sentinel for the science and e2e-gha for the noscience

* Add script to detect hostless candidates * notes * updated model (#869) * Bump fink-science and fink-filters * Switch to Telegram bot for hostless detection * Ruff * Bump requirements for fink-utils * Fix missing args * Split database operations --------- Co-authored-by: Anais Möller <[email protected]>

sonarqubecloud · 2024-08-26T12:39:00Z

Quality Gate passed

Issues
5 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

JulienPeloton added 6 commits January 18, 2024 08:57

Remove dropduplicates which is no more needed

9fcd4bf

Remove partitioning for online data

1bf32a0

Add timings in the alert packets

ee5ba52

Fix path for distribution

a5952d4

Modify database script with new paths

566b757

Enable DB integration tests

6c79f12

JulienPeloton added apache spark apache parquet streaming labels Jan 18, 2024

JulienPeloton added this to the 3.2 milestone Jan 18, 2024

JulienPeloton changed the title ~~[stream2raw]~~ Change partitioning strategy for online processing Jan 18, 2024

JulienPeloton added 6 commits January 18, 2024 12:55

Rename paths for test data

278db90

Fix missing import

a8ec0a6

Update configuration file with dependencies

82da46b

Update paths

5cedfac

Merge remote-tracking branch 'origin' into issue/770/duplicates

7bb3592

Merge remote-tracking branch 'origin' into issue/770/duplicates

1202d0f

JulienPeloton added 4 commits May 22, 2024 14:42

Fix conflicts

d77d4f8

Check and lint

029a0c4

Fix typo when importing module

58b0bfc

Remove unused code

1468c2e

JulienPeloton mentioned this pull request May 31, 2024

[Bug] Parquet reader produces non-nullable columns #852

Closed

FusRoman and others added 4 commits June 11, 2024 07:14

Fix conflict with origin

be7430a

Do not filter alerts containing i-band only in the history (#855)

41ef904

Update ZTF schedule (#857)

410c106

JulienPeloton and others added 2 commits July 26, 2024 07:57

Merge branch 'master' into issue/770/duplicates

9710782

JulienPeloton and others added 14 commits July 29, 2024 16:23

Update the topic value for helm

8b699c3

Add night as asrgument for stream2raw in the CI

50a1d4a

Update conf

7a78ef4

Increase the default Kafka delivery.timeout.ms

bd21c56

Change topic name for Sentinel

227eed1

Use the night argument to make the partitioning instead of the alert …

24ad76a

…timestamp

Relaunch CI

1790b18

Add a new argument --noscience to bin/fink

ff9e880

Cast only if fields exist

7d8e934

Update distribute with old Kafka setup

26f3d47

Put on hold e2e tests on VD until a solution is found. Only relying o…

594b227

…n Sentinel for the science and e2e-gha for the noscience

Check the output topic in Kafka for Sentinel

f7ef099

Ignore unecessary rule

6d45905

JulienPeloton merged commit dbb8f5c into master Aug 26, 2024
14 checks passed

fjammes deleted the issue/770/duplicates branch August 28, 2024 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change partitioning strategy for online processing #793

Change partitioning strategy for online processing #793

JulienPeloton commented Jan 18, 2024 •

edited

Loading

sonarqubecloud bot commented Feb 20, 2024

sonarqubecloud bot commented Jun 13, 2024

sonarqubecloud bot commented Aug 26, 2024

Change partitioning strategy for online processing #793

Change partitioning strategy for online processing #793

Conversation

JulienPeloton commented Jan 18, 2024 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

sonarqubecloud bot commented Feb 20, 2024

Quality Gate passed

sonarqubecloud bot commented Jun 13, 2024

Quality Gate passed

sonarqubecloud bot commented Aug 26, 2024

Quality Gate passed

JulienPeloton commented Jan 18, 2024 •

edited

Loading