Skip to content

Commit a4cae42

Browse files
authored
Remove unsupported models from docs (#34)
* Remove nano from docs, examples, benchmarking * Updating graphs
1 parent 6888a6b commit a4cae42

14 files changed

+13
-37
lines changed
-9.25 KB
Loading
-46.4 KB
Loading
-89.2 KB
Loading
-80.3 KB
Loading

docs/evals.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ npm run eval -- --config-path guardrails_config.json --dataset-path data.jsonl
1111

1212
### Benchmark Mode
1313
```bash
14-
npm run eval -- --config-path guardrails_config.json --dataset-path data.jsonl --mode benchmark --models gpt-5 gpt-5-mini gpt-5-nano
14+
npm run eval -- --config-path guardrails_config.json --dataset-path data.jsonl --mode benchmark --models gpt-5 gpt-5-mini gpt-4.1-mini
1515
```
1616

1717
## Dependencies
@@ -160,4 +160,4 @@ npm run eval -- --config-path config.json --dataset-path data.jsonl --base-url h
160160
## Next Steps
161161

162162
- See the [API Reference](./ref/eval/guardrail_evals.md) for detailed documentation
163-
- Use [Wizard UI](https://guardrails.openai.com/) for configuring guardrails without code
163+
- Use [Wizard UI](https://guardrails.openai.com/) for configuring guardrails without code

docs/ref/checks/hallucination_detection.md

Lines changed: 2 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -176,10 +176,8 @@ The statements cover various types of factual claims including:
176176
|--------------|---------|-------------|-------------|-------------|
177177
| gpt-5 | 0.854 | 0.732 | 0.686 | 0.670 |
178178
| gpt-5-mini | 0.934 | 0.813 | 0.813 | 0.770 |
179-
| gpt-5-nano | 0.566 | 0.540 | 0.540 | 0.533 |
180179
| gpt-4.1 | 0.870 | 0.785 | 0.785 | 0.785 |
181180
| gpt-4.1-mini (default) | 0.876 | 0.806 | 0.789 | 0.789 |
182-
| gpt-4.1-nano | 0.537 | 0.526 | 0.526 | 0.526 |
183181

184182
**Notes:**
185183
- ROC AUC: Area under the ROC curve (higher is better)
@@ -193,10 +191,8 @@ The following table shows latency measurements for each model using the hallucin
193191
|--------------|--------------|--------------|
194192
| gpt-5 | 34,135 | 525,854 |
195193
| gpt-5-mini | 23,013 | 59,316 |
196-
| gpt-5-nano | 17,079 | 26,317 |
197194
| gpt-4.1 | 7,126 | 33,464 |
198195
| gpt-4.1-mini (default) | 7,069 | 43,174 |
199-
| gpt-4.1-nano | 4,809 | 6,869 |
200196

201197
- **TTC P50**: Median time to completion (50% of requests complete within this time)
202198
- **TTC P95**: 95th percentile time to completion (95% of requests complete within this time)
@@ -218,10 +214,8 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
218214
|--------------|---------------------|----------------------|---------------------|---------------------------|
219215
| gpt-5 | 28,762 / 396,472 | 34,135 / 525,854 | 37,104 / 75,684 | 40,909 / 645,025 |
220216
| gpt-5-mini | 19,240 / 39,526 | 23,013 / 59,316 | 24,217 / 65,904 | 37,314 / 118,564 |
221-
| gpt-5-nano | 13,436 / 22,032 | 17,079 / 26,317 | 17,843 / 35,639 | 21,724 / 37,062 |
222217
| gpt-4.1 | 7,437 / 15,721 | 7,126 / 33,464 | 6,993 / 30,315 | 6,688 / 127,481 |
223218
| gpt-4.1-mini (default) | 6,661 / 14,827 | 7,069 / 43,174 | 7,032 / 46,354 | 7,374 / 37,769 |
224-
| gpt-4.1-nano | 4,296 / 6,378 | 4,809 / 6,869 | 4,171 / 6,609 | 4,650 / 6,201 |
225219

226220
- **Vector store size impact varies by model**: GPT-4.1 series shows minimal latency impact across vector store sizes, while GPT-5 series shows significant increases.
227221

@@ -241,10 +235,6 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
241235
| | Medium (3 MB) | 0.934 | 0.813 | 0.813 | 0.770 |
242236
| | Large (11 MB) | 0.919 | 0.817 | 0.817 | 0.817 |
243237
| | Extra Large (105 MB) | 0.909 | 0.793 | 0.793 | 0.711 |
244-
| **gpt-5-nano** | Small (1 MB) | 0.590 | 0.547 | 0.545 | 0.536 |
245-
| | Medium (3 MB) | 0.566 | 0.540 | 0.540 | 0.533 |
246-
| | Large (11 MB) | 0.564 | 0.534 | 0.532 | 0.507 |
247-
| | Extra Large (105 MB) | 0.603 | 0.570 | 0.558 | 0.550 |
248238
| **gpt-4.1** | Small (1 MB) | 0.907 | 0.839 | 0.839 | 0.839 |
249239
| | Medium (3 MB) | 0.870 | 0.785 | 0.785 | 0.785 |
250240
| | Large (11 MB) | 0.846 | 0.753 | 0.753 | 0.753 |
@@ -253,15 +243,11 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
253243
| | Medium (3 MB) | 0.876 | 0.806 | 0.789 | 0.789 |
254244
| | Large (11 MB) | 0.862 | 0.791 | 0.757 | 0.757 |
255245
| | Extra Large (105 MB) | 0.802 | 0.722 | 0.722 | 0.722 |
256-
| **gpt-4.1-nano** | Small (1 MB) | 0.605 | 0.528 | 0.528 | 0.528 |
257-
| | Medium (3 MB) | 0.537 | 0.526 | 0.526 | 0.526 |
258-
| | Large (11 MB) | 0.618 | 0.531 | 0.531 | 0.531 |
259-
| | Extra Large (105 MB) | 0.636 | 0.528 | 0.528 | 0.528 |
260246

261247
**Key Insights:**
262248

263249
- **Best Performance**: gpt-5-mini consistently achieves the highest ROC AUC scores across all vector store sizes (0.909-0.939)
264-
- **Best Latency**: gpt-4.1-nano shows the most consistent and lowest latency across all scales (4,171-4,809ms P50) but shows poor performance
250+
- **Best Latency**: gpt-4.1-mini shows the most consistent and lowest latency across all scales (6,661-7,374ms P50) while maintaining solid accuracy
265251
- **Most Stable**: gpt-4.1-mini (default) maintains relatively stable performance across vector store sizes with good accuracy-latency balance
266252
- **Scale Sensitivity**: gpt-5 shows the most variability in performance across vector store sizes, with performance dropping significantly at larger scales
267253
- **Performance vs Scale**: Most models show decreasing performance as vector store size increases, with gpt-5-mini being the most resilient
@@ -271,4 +257,4 @@ In addition to the above evaluations which use a 3 MB sized vector store, the ha
271257
- **Signal-to-noise ratio degradation**: Larger vector stores contain more irrelevant documents that may not be relevant to the specific factual claims being validated
272258
- **Semantic search limitations**: File search retrieves semantically similar documents, but with a large diverse knowledge source, these may not always be factually relevant
273259
- **Document quality matters more than quantity**: The relevance and accuracy of documents is more important than the total number of documents
274-
- **Performance plateaus**: Beyond a certain size (11 MB), the performance impact becomes less severe
260+
- **Performance plateaus**: Beyond a certain size (11 MB), the performance impact becomes less severe

docs/ref/checks/jailbreak.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -95,21 +95,17 @@ This benchmark evaluates model performance on a diverse set of prompts:
9595
|--------------|---------|-------------|-------------|-------------|-----------------|
9696
| gpt-5 | 0.979 | 0.973 | 0.970 | 0.970 | 0.733 |
9797
| gpt-5-mini | 0.954 | 0.990 | 0.900 | 0.900 | 0.768 |
98-
| gpt-5-nano | 0.962 | 0.973 | 0.967 | 0.965 | 0.048 |
9998
| gpt-4.1 | 0.990 | 1.000 | 1.000 | 0.984 | 0.946 |
10099
| gpt-4.1-mini (default) | 0.982 | 0.992 | 0.992 | 0.954 | 0.444 |
101-
| gpt-4.1-nano | 0.934 | 0.924 | 0.924 | 0.848 | 0.000 |
102100

103101
#### Latency Performance
104102

105103
| Model | TTC P50 (ms) | TTC P95 (ms) |
106104
|--------------|--------------|--------------|
107105
| gpt-5 | 4,569 | 7,256 |
108106
| gpt-5-mini | 5,019 | 9,212 |
109-
| gpt-5-nano | 4,702 | 6,739 |
110107
| gpt-4.1 | 841 | 1,861 |
111108
| gpt-4.1-mini | 749 | 1,291 |
112-
| gpt-4.1-nano | 683 | 890 |
113109

114110
**Notes:**
115111

docs/ref/checks/nsfw.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -84,10 +84,8 @@ This benchmark evaluates model performance on a balanced set of social media pos
8484
|--------------|---------|-------------|-------------|-------------|-----------------|
8585
| gpt-5 | 0.9532 | 0.9195 | 0.9096 | 0.9068 | 0.0339 |
8686
| gpt-5-mini | 0.9629 | 0.9321 | 0.9168 | 0.9149 | 0.0998 |
87-
| gpt-5-nano | 0.9600 | 0.9297 | 0.9216 | 0.9175 | 0.1078 |
8887
| gpt-4.1 | 0.9603 | 0.9312 | 0.9249 | 0.9192 | 0.0439 |
8988
| gpt-4.1-mini (default) | 0.9520 | 0.9180 | 0.9130 | 0.9049 | 0.0459 |
90-
| gpt-4.1-nano | 0.9502 | 0.9262 | 0.9094 | 0.9043 | 0.0379 |
9189

9290
**Notes:**
9391

docs/ref/checks/prompt_injection_detection.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -115,10 +115,8 @@ This benchmark evaluates model performance on agent conversation traces:
115115
|---------------|---------|-------------|-------------|-------------|-----------------|
116116
| gpt-5 | 0.9931 | 0.9992 | 0.9992 | 0.9992 | 0.5845 |
117117
| gpt-5-mini | 0.9536 | 0.9951 | 0.9951 | 0.9951 | 0.0000 |
118-
| gpt-5-nano | 0.9283 | 0.9913 | 0.9913 | 0.9717 | 0.0350 |
119118
| gpt-4.1 | 0.9794 | 0.9973 | 0.9973 | 0.9973 | 0.0000 |
120119
| gpt-4.1-mini (default) | 0.9865 | 0.9986 | 0.9986 | 0.9986 | 0.0000 |
121-
| gpt-4.1-nano | 0.9142 | 0.9948 | 0.9948 | 0.9387 | 0.0000 |
122120

123121
**Notes:**
124122

@@ -130,12 +128,10 @@ This benchmark evaluates model performance on agent conversation traces:
130128

131129
| Model | TTC P50 (ms) | TTC P95 (ms) |
132130
|---------------|--------------|--------------|
133-
| gpt-4.1-nano | 1,159 | 2,534 |
134131
| gpt-4.1-mini (default) | 1,481 | 2,563 |
135132
| gpt-4.1 | 1,742 | 2,296 |
136133
| gpt-5 | 3,994 | 6,654 |
137134
| gpt-5-mini | 5,895 | 9,031 |
138-
| gpt-5-nano | 5,911 | 10,134 |
139135

140136
- **TTC P50**: Median time to completion (50% of requests complete within this time)
141137
- **TTC P95**: 95th percentile time to completion (95% of requests complete within this time)

examples/basic/agents_sdk.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ const PIPELINE_CONFIG = {
3434
{
3535
name: 'Custom Prompt Check',
3636
config: {
37-
model: 'gpt-4.1-nano-2025-04-14',
37+
model: 'gpt-4.1-mini-2025-04-14',
3838
confidence_threshold: 0.7,
3939
system_prompt_details: 'Check if the text contains any math problems.',
4040
},

0 commit comments

Comments
 (0)