TeleAntiFraud-28k is the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. This dataset integrates audio signals with reasoning-oriented textual analysis, providing high-quality multimodal training data for telecom fraud detection research.
- Total Samples: 28,511 rigorously processed speech-text pairs
- Total Audio Duration: 307 hours
- Unique Feature: Detailed annotations for fraud reasoning
- Task Categories: Scenario classification, fraud detection, fraud type classification
- Using ASR-transcribed call recordings (with anonymized original audio)
- Ensuring real-world consistency through TTS model regeneration
- Strict adherence to privacy protection standards
- LLM-based self-instruction sampling on authentic ASR outputs
- Expanding scenario coverage to improve model generalization
- Enriching the diversity of conversational contexts
- Simulation of emerging fraud tactics
- Generation through predefined communication scenarios and fraud typologies
- Enhancing dataset adaptability to new fraud techniques
We have constructed TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from TeleAntiFraud-28k, to facilitate systematic testing of model performance and reasoning capabilities on telecom fraud detection tasks.
We contribute a production-optimized supervised fine-tuning (SFT) model based on Qwen2-Audio, trained on the TeleAntiFraud training set.
Explore our dataset examples to better understand the telecom fraud detection capabilities:
- Case 1: Normal Conversation Analysis - Detailed analysis of a legitimate phone conversation
- Case 2: Fraud Conversation Analysis - Step-by-step reasoning for detecting a fraudulent call
- Evaluation Sample - Representative sample from our evaluation benchmark
- Model Output: Normal Conversation - Our model's reasoning process on a legitimate call
- Model Output: Fraud Detection - Model's analysis and detection of a fraudulent call
To collect fraudulent conversation data:
- Insert your API key in
multi-agents-tools/AntiFraudMatrix/main.py
(uses SiliconFlow API key) - Run the following command to generate fraudulent dialog text:
python multi-agents-tools/AntiFraudMatrix/main.py
- Results will be saved in the
result
directory
For normal conversation data:
- Use
multi-agents-tools/AntiFraudMatrix-normal/main.py
following the same process
To synthesize speech from the collected text:
-
Install the necessary dependencies
-
Run the API server:
fastapi dev ChatTTS/examples/api/main_new_new.py --host 0.0.0.0 --port 8006
-
Use any of the scripts in
ChatTTS/examples/api/normal_run*.sh
orChatTTS/examples/api/run*.sh
Modify the port in these scripts if needed, then run:
bash ChatTTS/examples/api/run*.sh
- TeleAntiFraud-28k dataset
- TeleAntiFraud-Bench evaluation benchmark
- Data processing framework (supporting community-driven dataset expansion)
- TeleAntiFraud-Qwen2-Audio SFT model
- Establishing a foundational framework for multimodal anti-fraud research
- Addressing critical challenges in data privacy and scenario diversity
- Providing high-quality training data for telecom fraud detection
- Open-sourcing data processing tools to enable community collaboration
@inproceedings{Ma2025TeleAntiFraud28kAA,
title={TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection},
author={Zhiming Ma and Peidong Wang and Minhua Huang and Jingpeng Wang and Kai Wu and Xiangzhao Lv and Yachun Pang and Yin Yang and Wenjie Tang and Yuchen Kang},
year={2025},
url={https://api.semanticscholar.org/CorpusID:277467703}
}