TeleAntiFraud-28k

TeleAntiFraud-28k is the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. This dataset integrates audio signals with reasoning-oriented textual analysis, providing high-quality multimodal training data for telecom fraud detection research.

Dataset Overview

Total Samples: 28,511 rigorously processed speech-text pairs
Total Audio Duration: 307 hours
Unique Feature: Detailed annotations for fraud reasoning
Task Categories: Scenario classification, fraud detection, fraud type classification

Dataset Construction Strategies

1. Privacy-preserved Text-Truth Sample Generation

Using ASR-transcribed call recordings (with anonymized original audio)
Ensuring real-world consistency through TTS model regeneration
Strict adherence to privacy protection standards

2. Semantic Enhancement

LLM-based self-instruction sampling on authentic ASR outputs
Expanding scenario coverage to improve model generalization
Enriching the diversity of conversational contexts

3. Multi-agent Adversarial Synthesis

Simulation of emerging fraud tactics
Generation through predefined communication scenarios and fraud typologies
Enhancing dataset adaptability to new fraud techniques

TeleAntiFraud-Bench

We have constructed TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from TeleAntiFraud-28k, to facilitate systematic testing of model performance and reasoning capabilities on telecom fraud detection tasks.

Model Contribution

We contribute a production-optimized supervised fine-tuning (SFT) model based on Qwen2-Audio, trained on the TeleAntiFraud training set.

Examples

Explore our dataset examples to better understand the telecom fraud detection capabilities:

Case 1: Normal Conversation Analysis - Detailed analysis of a legitimate phone conversation
Case 2: Fraud Conversation Analysis - Step-by-step reasoning for detecting a fraudulent call
Evaluation Sample - Representative sample from our evaluation benchmark
Model Output: Normal Conversation - Our model's reasoning process on a legitimate call
Model Output: Fraud Detection - Model's analysis and detection of a fraudulent call

Multi-Agent Data Collection

To collect fraudulent conversation data:

Insert your API key in multi-agents-tools/AntiFraudMatrix/main.py (uses SiliconFlow API key)
Run the following command to generate fraudulent dialog text:
```
python multi-agents-tools/AntiFraudMatrix/main.py
```
Results will be saved in the result directory

For normal conversation data:

Use multi-agents-tools/AntiFraudMatrix-normal/main.py following the same process

Voice Synthesis with ChatTTS

To synthesize speech from the collected text:

Install the necessary dependencies

Run the API server:

fastapi dev ChatTTS/examples/api/main_new_new.py --host 0.0.0.0 --port 8006

Use any of the scripts in ChatTTS/examples/api/normal_run*.sh or ChatTTS/examples/api/run*.sh

Modify the port in these scripts if needed, then run:
```
bash ChatTTS/examples/api/run*.sh
```

Open-Source Resources

TeleAntiFraud-28k dataset
TeleAntiFraud-Bench evaluation benchmark
Data processing framework (supporting community-driven dataset expansion)
TeleAntiFraud-Qwen2-Audio SFT model

Key Contributions

Establishing a foundational framework for multimodal anti-fraud research
Addressing critical challenges in data privacy and scenario diversity
Providing high-quality training data for telecom fraud detection
Open-sourcing data processing tools to enable community collaboration

Citation

@inproceedings{Ma2025TeleAntiFraud28kAA,
  title={TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection},
  author={Zhiming Ma and Peidong Wang and Minhua Huang and Jingpeng Wang and Kai Wu and Xiangzhao Lv and Yachun Pang and Yin Yang and Wenjie Tang and Yuchen Kang},
  year={2025},
  url={https://api.semanticscholar.org/CorpusID:277467703}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
ChatTTS		ChatTTS
example		example
multi-agents-tools		multi-agents-tools
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TeleAntiFraud-28k

Dataset Overview

Dataset Construction Strategies

1. Privacy-preserved Text-Truth Sample Generation

2. Semantic Enhancement

3. Multi-agent Adversarial Synthesis

TeleAntiFraud-Bench

Model Contribution

Examples

Multi-Agent Data Collection

Voice Synthesis with ChatTTS

Open-Source Resources

Key Contributions

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

JimmyMa99/TeleAntiFraud

Folders and files

Latest commit

History

Repository files navigation

TeleAntiFraud-28k

Dataset Overview

Dataset Construction Strategies

1. Privacy-preserved Text-Truth Sample Generation

2. Semantic Enhancement

3. Multi-agent Adversarial Synthesis

TeleAntiFraud-Bench

Model Contribution

Examples

Multi-Agent Data Collection

Voice Synthesis with ChatTTS

Open-Source Resources

Key Contributions

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages