Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests #26

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Add tests #26

wants to merge 5 commits into from

Conversation

ethan-tonic
Copy link
Collaborator

No description provided.

Copy link
Contributor

@gandersteele gandersteele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but need to have coverage of synthesis mode. See above comments. Once env vars are added to secrets we'll merge

@@ -0,0 +1,12 @@
/* retrieve from 1pass. Note is called pytest env file */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this in the sdk repo? only relevant for backend

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, this is for the pipeline tests

Comment on lines +15 to +31
def check_dataset_str(original_text: str, dataset_str: str):
# Extract all redacted portions using regex pattern for [ENTITY_TYPE_*]
redaction_pattern = r"\[([A-Z_]+)(?:_[a-zA-Z0-9]+)?\]"
redactions = re.findall(redaction_pattern, dataset_str)

# Replace all redactions with empty string to get the non-redacted text
non_redacted_text = re.sub(redaction_pattern, "", dataset_str)

# Check if the non-redacted portions exist in the original text
for segment in non_redacted_text.split():
if segment.strip(): # Skip empty segments
assert segment in original_text, (
f"Non-redacted segment '{segment}' not found in original text"
)

# Ensure we found at least one redaction
assert len(redactions) > 0, "No redactions found in the dataset string"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good, but note that it doesnt apply in synthesis mode. i'd suggest a similar method that
1.asserts len(spans) > 0
2. asserts that original_text[span['start']:span['end']] == span['text']
3. asserts that dataset_str[span['new_start']:span['new_end']] == span['new_text']
this is a slightly different test than yours, so can be done in addition to, but the main point is that this exercises the synthesis mode as well. we could add additioanl checks that in synthesis mode, replacement text doesnt contain the standard redaction pattern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants