Using LLM-Generated Data To Create A Roman Urdu Scam Call Detector

Authors

  • Sameed Irfan Department of Computer Science, Bloomfield Hall College, Multan, Pakistan
  • Aswad Sheeraz Department of Computer Science, Beaconhouse College Program, Multan, Pakistan
  • Muhammad Hasnain Department of Computer Science, Beaconhouse College Program, Multan, Pakistan

Keywords:

LLM, Scam Call Detection, Machine Learning, Training Models with Synthetic Data, Urdu Scam Call Detector

Abstract

The issue of scam calls is on the rise, with losses expected to exceed $1 trillion globally in 2024. While easily incorporating Machine Learning has been effective in countering scam calls, the dominant models continue to suffer from glaring insufficiencies. Most models can only detect monolingual scam calls, and LLM-based solutions, though they can be multilingual, are impractical due to the resources they require. Furthermore, scam call tactics are constantly changing; hence, many models can become outdated. To address these challenges, this paper proposes a structure where a model is trained on LLM-generated data, allowing for a multilingual and easy-to-update dataset. To test the accuracy of these models, a small dataset of human-written scam and non-scam call dialogues was used. This model was trained on synthetic data and tested on real-world scam calls data, achieving, on average, over 90% accuracy and f1_score.

Downloads

Published

2025-11-08