-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests #26
base: main
Are you sure you want to change the base?
Add tests #26
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but need to have coverage of synthesis mode. See above comments. Once env vars are added to secrets we'll merge
@@ -0,0 +1,12 @@ | |||
/* retrieve from 1pass. Note is called pytest env file */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this in the sdk repo? only relevant for backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see, this is for the pipeline tests
def check_dataset_str(original_text: str, dataset_str: str): | ||
# Extract all redacted portions using regex pattern for [ENTITY_TYPE_*] | ||
redaction_pattern = r"\[([A-Z_]+)(?:_[a-zA-Z0-9]+)?\]" | ||
redactions = re.findall(redaction_pattern, dataset_str) | ||
|
||
# Replace all redactions with empty string to get the non-redacted text | ||
non_redacted_text = re.sub(redaction_pattern, "", dataset_str) | ||
|
||
# Check if the non-redacted portions exist in the original text | ||
for segment in non_redacted_text.split(): | ||
if segment.strip(): # Skip empty segments | ||
assert segment in original_text, ( | ||
f"Non-redacted segment '{segment}' not found in original text" | ||
) | ||
|
||
# Ensure we found at least one redaction | ||
assert len(redactions) > 0, "No redactions found in the dataset string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is good, but note that it doesnt apply in synthesis mode. i'd suggest a similar method that
1.asserts len(spans) > 0
2. asserts that original_text[span['start']:span['end']] == span['text']
3. asserts that dataset_str[span['new_start']:span['new_end']] == span['new_text']
this is a slightly different test than yours, so can be done in addition to, but the main point is that this exercises the synthesis mode as well. we could add additioanl checks that in synthesis mode, replacement text doesnt contain the standard redaction pattern
No description provided.