Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect user_identifier selected if snowplow__user_sql or user_identifiers is inconsistent #193

Open
1 of 6 tasks
jethron opened this issue Feb 7, 2025 · 0 comments
Open
1 of 6 tasks
Assignees
Labels
status:needs_triage Needs maintainer triage. type:bug Bugs or weaknesses. The issue has to contain steps to reproduce.

Comments

@jethron
Copy link

jethron commented Feb 7, 2025

Describe the bug

When building base_create_snowplow_sessions_lifecycle_manifest with the snowplow__user_sql var populated, or if snowplow__user_identifiers has > 1 value, if the value produced is inconsistent throughout a session, the value chosen as user_identifier can be unexpected because the values are aggregated via max().

This can occur for instance if user_id is one of the preferred values and one or more event in the session does not have a user_id; e.g. with a snowplow__user_sql like coalesce(user_id, domain_userid) if some events are missing the user_id, then the domain_userid will be selected for those events instead; the final user_identifier when will be whichever value is largest, which will not consistently be user_id if the random domain_userid happens to be larger.

This is unexpected because the intent between coalescing multiple values is generally that the first found value is preferred but you will randomly get other values.

For snowplow__user_identifiers, it would be preferable if rather than max(coalesce(..., ..., ...)) the computation was coalesce(max(...), max(...), max(...)) so preferences would be more deterministic.

For snowplow__user_sql you can not use this approach because you can not nest aggregate functions. As a temporary workaround we're using an approach like:

coalesce(
    max(user_id),
    max(domain_userid),
    max(network_userid)
)) over (partition by session_identifier

to turn the outer max() call into a window function and allow the inner aggregates; this is obviously undesirable.

Steps to reproduce

  • Configure user_identifier options that are inconsistent throughout a session
  • Get unexpected user_identifier values

Expected results

snowplow__user_sql: "coalesce(max(user_id), max(domain_userid), max(network_userid))"
select session_identifier, coalesce(max(user_id), max(domain_userid)) user_identifier
from (
  select 'a' session_identifier, null user_id, '1' domain_userid
  union all
  select 'a' session_identifier, '2' user_id, '1' domain_userid
  union all
  select 'b' session_identifier, null user_id, '3' domain_userid
  union all
  select 'b' session_identifier, '2' user_id, '3' domain_userid
)
group by 1
session_identifier user_identifier
a 2
b 2

Actual results

snowplow__user_sql: "coalesce(user_id, domain_userid, network_userid)"
select session_identifier, max(coalesce(user_id, domain_userid)) user_identifier
from (
  select 'a' session_identifier, null user_id, '1' domain_userid
  union all
  select 'a' session_identifier, '2' user_id, '1' domain_userid
  union all
  select 'b' session_identifier, null user_id, '3' domain_userid
  union all
  select 'b' session_identifier, '2' user_id, '3' domain_userid
)
group by 1
session_identifier user_identifier
a 2
b 3

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • databricks
  • other (specify: ____________)

Additional context

This won't impact most users that use the default identifier settings, but for those that have their own surrogate values used for user identification if an event is missing the surrogates for any reason this issue occurs.

@jethron jethron added the type:bug Bugs or weaknesses. The issue has to contain steps to reproduce. label Feb 7, 2025
@github-actions github-actions bot added the status:needs_triage Needs maintainer triage. label Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:needs_triage Needs maintainer triage. type:bug Bugs or weaknesses. The issue has to contain steps to reproduce.
Projects
None yet
Development

No branches or pull requests

3 participants