Skip to content

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Sep 3, 2025

Silo users and groups have traditionally defined "external id" as whatever the silo's identity provider (or in the case of ApiOnly silos, an operator) assigns for them: this could be username, email, a random id, etc.

SCIM throws one of many wrenches at us because the specification defines an external id attribute for users and groups that is distinct from other related attributes that could be sent, including username and email. Importantly the SCIM client that is creating a user or group could send a NULL external ID, which is not supported by the current field's database type.

The first step to supporting SCIM's extra attributes is to add the silo's user provision field to our user and group models, and use this field to change the semantics of the external id field depending on what provision type is used. For the existing ApiOnly and JIT silos this will stay the same.

The second step that will help with type safety is to use this new user provision type field to separate the silo user and group database models into distinct silo user and group types at the datastore level, and have separate enum variants for each type that wrap an inner typed object:

pub enum SiloUser {
    /// A User created via the API
    ApiOnly(SiloUserApiOnly),

    /// A User created during an authenticated SAML login
    Jit(SiloUserJit),
}

Call sites now have to use either the ApiOnly or Jit variant of this new higher level silo user type. With the addition of user provision type to the silo user model, Nexus can also now check that the provision type on the user matches the silo, or (like with some of the local idp functions) accept only one specific variant.

The third step that will also help with type safety is to have typed lookups for silo users and groups, as different user provision types will have different fields that they enforce uniqueness for: for example, only search for ApiOnly and JIT users using external_id, but for SCIM use username.

One possible bug that this commit addresses is that the creation of the recovery silo did not have a check to ensure that the created silo was a "LocalOnly" identity type. The code prior to this commit could create a SamlJit silo, which is not ideal for a recovery silo!

Another possible bug potential addressed is a mismatch in the silo's user provision type, and the users within that silo. This can be enforced via functions that only accept one variant of the new higher level enum, and is checked at the lower layers.

Another bug addressed: in silo_group_ensure there was a small window between silo_group_ensure_query and unwrapping the result of silo_group_optional_lookup where, if it were possible to delete groups, a racing delete would cause the unwrap to panic.

Also! Fill in test_ensure_same_silo_group, why that was left truncated is beyond me, but don't look at the git blame please.

Silo users and groups have traditionally defined "external id" as
whatever the silo's identity provider (or in the case of ApiOnly silos,
an operator) assigns for them: this could be username, email, a random
id, etc.

SCIM throws one of many wrenches at us because the specification defines
an external id attribute for users and groups that is distinct from
other related attributes that could be sent, including username and
email. Importantly the SCIM client that is creating a user or group
could send a NULL external ID, which is not supported by the current
field's database type.

The first step to supporting SCIM's extra attributes is to add the
silo's user provision field to our user and group models, and use this
field to change the semantics of the external id field depending on what
provision type is used. For the existing ApiOnly and JIT silos this will
stay the same.

The second step that will help with type safety is to use this new user
provision type field to separate the silo user and group database models
into distinct silo user and group types at the datastore level, and have
separate enum variants for each type that wrap an inner typed object:

```rust
pub enum SiloUser {
    /// A User created via the API
    ApiOnly(SiloUserApiOnly),

    /// A User created during an authenticated SAML login
    Jit(SiloUserJit),
}
```

Call sites now have to use either the ApiOnly or Jit variant of this new
higher level silo user type. With the addition of user provision type to
the silo user model, Nexus can also now check that the provision type on
the user matches the silo, or (like with some of the local idp
functions) accept only one specific variant.

The third step that will also help with type safety is to have typed
lookups for silo users and groups, as different user provision types
will have different fields that they enforce uniqueness for: for
example, only search for ApiOnly and JIT users using external_id, but
for SCIM use username.

One possible bug that this commit addresses is that the creation of the
recovery silo did not have a check to ensure that the created silo was a
"LocalOnly" identity type. The code prior to this commit could create a
SamlJit silo, which is not ideal for a recovery silo!

Another possible bug potential addressed is a mismatch in the silo's
user provision type, and the users within that silo. This can be
enforced via functions that only accept one variant of the new higher
level enum, and is checked at the lower layers.

Another bug addressed: in `silo_group_ensure` there was a small window
between `silo_group_ensure_query` and unwrapping the result of
`silo_group_optional_lookup` where, if it were possible to delete
groups, a racing delete would cause the unwrap to panic.

Also! Fill in `test_ensure_same_silo_group`, why that was left truncated
is beyond me, but don't look at the git blame please.
Copy link
Contributor

@papertigers papertigers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a pass through this and found some nits. It might be worth having a second set of eyes on the diesel bits from someone who touches crdb more often than I do.

pub struct SiloGroup {
#[diesel(embed)]
identity: SiloGroupIdentity,
pub identity: SiloGroupIdentity,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be fully pub or can we make them pub(crate)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, these need to be fully pub as the DataStore structs need to access those fields.

Comment on lines 439 to 449
match &recovery_silo.identity_mode {
shared::SiloIdentityMode::LocalOnly => {
// Ok
}

shared::SiloIdentityMode::SamlJit => {
return Err(RackInitError::Silo(Error::invalid_request(
"recovery silo should only use identity mode LocalOnly",
)));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this match statement will grow with the addition of SCIM we can probably distill the logic down to just:

Suggested change
match &recovery_silo.identity_mode {
shared::SiloIdentityMode::LocalOnly => {
// Ok
}
shared::SiloIdentityMode::SamlJit => {
return Err(RackInitError::Silo(Error::invalid_request(
"recovery silo should only use identity mode LocalOnly",
)));
}
}
if !matches!(
&recovery_silo.identity_mode,
shared::SiloIdentityMode::LocalOnly
) {
return Err(RackInitError::Silo(Error::invalid_request(
"recovery silo should only use identity mode LocalOnly",
)));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, done in e5b315b

silo_id: record.silo_id,
// SAFETY: there is a database constraint that prevents a group
// with provision type 'api_only' from having a null external id
external_id: record.external_id.unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this .expect("database constraint exists")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5690907

silo_id: record.silo_id,
// SAFETY: there is a database constraint that prevents a group
// with provision type 'jit' from having a null external id
external_id: record.external_id.unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this .expect("database constraint exists")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5690907

Comment on lines +338 to +339
// This function is _not_ transactional, meaning a delete could race
// between the ensure query and the lookup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay if the delete occurs after the lookup and we return silo_group?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok - it's unlikely in the first place as the lookup occurs by ID, but it's better not to panic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the thing that would be deleted, if that happened? is it the silo group we just created, or the silo itself? because i think "not found" is definitely correct if the whole silo was being torn down...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this function it's the silo group. I've opened #9145. I think there's a bunch more work to do here related to silo user and group requests.

silo_id: record.silo_id,
// SAFETY: there is a database constraint that prevents a user
// with provision type 'api_only' from having a null external id
external_id: record.external_id.unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above with the expect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5690907

silo_id: record.silo_id,
// SAFETY: there is a database constraint that prevents a user
// with provision type 'jit' from having a null external id
external_id: record.external_id.unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above with the expect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5690907

CHECK (
CASE user_provision_type
WHEN 'api_only' THEN external_id IS NOT NULL
WHEN 'jit' THEN external_id IS NOT NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future SCIM will allow this to be NULL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right :)

jmpesp added a commit to jmpesp/omicron that referenced this pull request Sep 30, 2025
Build on oxidecomputer#8981 and add the SCIM user provision type to users and groups.
Users must have a user name, and optionally have an external id field
and/or an active field. Groups must have a display name and
optionally have an external id field.

Also add a new SiloIdentityMode named SamlScim.

A few functions have to change to account for these new variants, but
not many. Silos using the SamlScim identity mode will use all the same
SAML code that exists today.
@benjaminleonard
Copy link
Contributor

Discussed briefly in our chat, is it worth exposing the external ID on the user_view endpoint to assist with debugging, or is that likely not useful?

@jmpesp jmpesp removed the request for review from david-crespo October 1, 2025 14:53
@jmpesp
Copy link
Contributor Author

jmpesp commented Oct 1, 2025

Discussed briefly in our chat, is it worth exposing the external ID on the user_view endpoint to assist with debugging, or is that likely not useful?

I'd say not useful at this point. For these user types, external id matches the user name, and for SCIM user types, external id is the provisioning client's unique ID.


return Err(Error::invalid_request(
"user provision type of silo does not match silo group",
));
Copy link
Contributor

@david-crespo david-crespo Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it be helpful to say the two types here like in the above error log?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, is the user expected to know what "user provision type" means? given that this isn't an internal error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

87d5a66 changes this to an internal error and adds the two types

Comment on lines 28 to 29
/// the identity provider's ID for this group. There is a database
/// constraint that this field must be non-null for those provision types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicky, feel free to ignore: it might be worth saying the name of the constraint here so that the reader can search for it if needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I like this, added in 1aee591

Comment on lines +39 to +57
pub fn new_api_only_group(
id: SiloGroupUuid,
silo_id: Uuid,
external_id: String,
) -> Self {
Self {
identity: SiloGroupIdentity::new(id),
time_deleted: None,
silo_id,
user_provision_type: UserProvisionType::ApiOnly,
external_id: Some(external_id),
}
}

pub fn new_jit_group(
id: SiloGroupUuid,
silo_id: Uuid,
external_id: String,
) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why two separate constructors instead of taking UserProvisionType as an argument? is that because we're eventually going to have a SCIM UserProvisionType that will be constructed differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right yeah, and specifically the new_scim_group method will require different fields to be non-null (and additional fields that could be null)

Comment on lines +195 to +212
let silo_admin_group =
match new_silo_params.identity_mode.user_provision_type() {
shared::UserProvisionType::ApiOnly => {
db::model::SiloGroup::new_api_only_group(
silo_group_id,
silo_id,
admin_group_name.clone(),
)
}

shared::UserProvisionType::Jit => {
db::model::SiloGroup::new_jit_group(
silo_group_id,
silo_id,
admin_group_name.clone(),
)
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sure seems like it would be a lot simpler if we just had one constructor that you pass the user_provision_type into, but i'm guessing a subsequent change will make it make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's additional behaviour changes required in this function (and others) based on the user provision type. Hopefully the additional "add the scim user provision type" PR will make this make more sense, I just wanted to break out the changes in the smallest parts that made sense (maybe this one was too small!).

Comment on lines 86 to 87
// SAFETY: there is a database constraint that prevents a group
// with provision type 'api_only' from having a null external id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I would really prefer to follow the conventions that all-caps // SAFETY: comments are specifically for unsafe blocks and not for other correctness-critical things (you could reasonably argue that, if a // SAFETY: comment is for "anything that you need to know for the program to do the right thing", then all comments should have SAFETY in them...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good - addressed this in 1aee591

Comment on lines 100 to 101
// SAFETY: there is a database constraint that prevents a group
// with provision type 'jit' from having a null external id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed this also in 1aee591

fn from(u: SiloGroupApiOnly) -> views::Group {
views::Group {
id: u.id,
// TODO the use of external_id as display_name is temporary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something that will be done in this PR or subsequently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah this was copied over from the existing From, I'm not sure the long term plan here.


return Err(Error::invalid_request(
"user provision type of silo does not match silo group",
));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, is the user expected to know what "user provision type" means? given that this isn't an internal error

Comment on lines 388 to 392
if silo.user_provision_type != lookup_user_provision_type {
return Err(Error::invalid_request(format!(
"silo provision type {:?} does not match lookup {:?}",
silo.user_provision_type, lookup_user_provision_type,
)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above, we also log an error for this, should we do that here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a log in 87d5a66

Comment on lines +338 to +339
// This function is _not_ transactional, meaning a delete could race
// between the ensure query and the lookup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the thing that would be deleted, if that happened? is it the silo group we just created, or the silo itself? because i think "not found" is definitely correct if the whole silo was being torn down...

);

return Err(Error::invalid_request(
"user provision type of silo does not match silo user",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly to silo group, should we include the types here? also is a user expected to understand this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was too focused on "invalid request" that some part of the code was making to the relevant functions, and that didn't connect with the fact that this is an external Error - I've changed this to a 500 in 87d5a66

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants