Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent bulk generation of Snowflake IDs from same machine is failing #22

Open
anubhav-pandey1 opened this issue Feb 26, 2024 · 9 comments

Comments

@anubhav-pandey1
Copy link

anubhav-pandey1 commented Feb 26, 2024

If we try to bulk generate multiple snowflake IDs concurrently from the same machine, all the IDs are same:

const generateSnowflakeId = (machineId = null) => {
  const config = {
    instance_id: machineId,
    custom_epoch: new Date("2001-01-01").getTime(),
  };
  const uid = new Snowflake(config);
  const snowflakeId = uid.getUniqueID();
  return snowflakeId;
};

const COUNT = 10;
const CURRENT_MACHINE_ID = 1; // Assumed
const snowflakeIds = [];
for (let i = 0; i < COUNT; i++) {
  const snowflakeId = generateSnowflakeId(CURRENT_MACHINE_ID);
  snowflakeIds.push(objectId);
}

All the IDs are the same. Am I doing something incorrectly?

Node.js version: 16.17 and 18.17
Package version: 2.0.1

@anubhav-pandey1
Copy link
Author

anubhav-pandey1 commented Feb 26, 2024

It seems that the sequence ID is not being updated after creating the first snowflake.
I'm making this hypothesis because deliberately blocking the thread for 1 ms using sleep(1) seems to do the trick for generating unique snowflakes.

This would make it a critical issue as the library would essentially stop performing well in highly-concurrent horizontally scaled environments.

@anubhav-pandey1
Copy link
Author

I've raised a PR that seems to fix the issue. Please feel free to use the code and mould it to suit your coding standards and style.

@tangledbytes
Copy link
Owner

Hi @anubhav-pandey1, I am not sure if I understand the issue here. One Snowflake instance is supposed to be used per machine. Each instance should get a unique machine ID.

In the above example, it seems that 10 instances of Snowflake objects are being created with the same machine ID.

Am I missing something here?

@tangledbytes
Copy link
Owner

I also have a test in the CI which tests exactly this to make sure we don't generate duplicate IDs: https://github.com/tangledbytes/nodejs-snowflake/blob/master/tests/snowflake_core.rs#L144

@anubhav-pandey1
Copy link
Author

anubhav-pandey1 commented Feb 27, 2024

If a single machine decides to bulk generate 10 or 1000 objects (for instance for a bulk insert SQL query for payments), they should have different identifiers - this is a fundamental requirement for any unique identification system. The Snowflake ID system is supposed to support this by having an atomically incremented (thread-safe) sequence number so that 1000s of Snowflake IDs can be concurrently bulk generated by the same machine.

If all the IDs generated by the same machine within the same millisecond are the same, then I think it is not a unique identifier (UID) and also not Snowflake.

@anubhav-pandey1
Copy link
Author

anubhav-pandey1 commented Feb 27, 2024

Also, your test seems to be running on a sub-second granularity in a single-threaded environment. This could be why it missed this issue.

High-throughput systems with high-concurrency would work on a sub-millisecond granularity ie. they can generate thousands of objects within the same millisecond. Such systems are usually multi-threaded too and might run into race conditions with the current version of Snowflake ID generation.

I would recommend using my shared Node.js code to attempt bulk generating many IDs concurrently and the issue will be replicated.

I've also raised a PR (#23) to attempt fixing this issue.

@anubhav-pandey1
Copy link
Author

anubhav-pandey1 commented Mar 2, 2024

Hey @tangledbytes, a gentle request to review the PR #23 and merge the fix if it is acceptable.

@tangledbytes
Copy link
Owner

Hi @anubhav-pandey1, how do you plan on adding multi threads to nodejs? If worker threads or other things are supposed to be used then each thread should gets its own instance. There is absolutely no reason to have this thread-safe because there is no reason to share states across different threads. Just let each thread have their own instance (different machine ids though). As for your code, from my perception, it is not the right way to use, the correct code would look like:

const CURRENT_MACHINE_ID = 1; // Assumed
const config = {
    instance_id: CURRENT_MACHINE_ID,
    custom_epoch: new Date("2001-01-01").getTime(),
};

const uid = new Snowflake(config);

const generateSnowflakeId = () => {
  const snowflakeId = uid.getUniqueID();
  return snowflakeId;
};

const COUNT = 10;

const snowflakeIds = [];
for (let i = 0; i < COUNT; i++) {
  const snowflakeId = generateSnowflakeId();
  snowflakeIds.push(objectId);
}

Also, I would like you to clarify your use case here.

@Benjamin-Willard
Copy link

Yeah I ran into this exact problem when designing our system. I would run into multiple ids of the same millisecond when creating records on the same database instance using AWS Lambda. This is why my database contains a server that gets created when a new instance is spun up or a new lambda source is being generated. I have there a specific server id that gets passed to the global Snowflake configuration of the instance id. After doing this I no longer had problems.

So what @tangledbytes suggested is correct. If you are using like nodejs cluster or something of that sort. Each "worker thread" should have it's own unique id that is generated per configuration instance_id.

Based on your setup, you would probably want to do something like a snowflake cache of a single instance per instance_id.

An example of what I have done is this. Using Sequelize as the database ORM.

let __SnowflakeInstance: Snowflake;
const SnowflakeGenerator = () => {
    if( !__SnowflakeInstance ) {
        __SnowflakeInstance = new Snowflake({
            custom_epoch: 1324512000000,
            instance_id: Reflect.get(global, '__SNOWFLAKE_INSTANCE_ID__') || 1
        })
    }
    return __SnowflakeInstance;
};

Then I have something like

bindInstanceId(instanceId: number) {
        Reflect.set(global, '__SNOWFLAKE_INSTANCE_ID__', instanceId);
    }

In my main DbSchema class object. I call this right after I initialize the database and load the server's configuration. In this case I then bind that instance.

Example of an server database entry that I have that uses two different instance ids.

Screenshot 2024-04-19 at 1 03 23 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants