back_to_articles

Machine Learning · NLP · Production

Building Multilingual Address Parsing for BSES Delhi

August 20, 2024
6 min read

BSES needed addresses parsed accurately so field teams could route technicians without human triage. Legacy rule-based systems collapsed under free-form Hinglish, abbreviations, and missing punctuation.

Engineering a Robust Dataset

We curated 28k labeled addresses sourced from CRM exports, on-ground survey sheets, and public GIS registries. Every record was normalised, transliterated, and tagged for entities like building name, locality, and landmark.

  • IndicBERTv2-CRF model fine-tuned on Hindi, English, and code-mixed corpora
  • Character-level noise injection to mimic SMS typos
  • Custom evaluation harness that flagged regressions beyond ±1.5% F1

Making Accuracy Observable

Model quality is only as good as the signals you monitor. We built dashboards tracking precision/recall per district, latency budgets, and post-processing fallbacks. If confidence dipped below 0.82 we escalated to a human-in-the-loop review queue.

The final pipeline achieved 94% F1 on the validation set, processed 60k requests per day under 120ms, and automatically synchronised corrections back into training data. Operations now trust the parser enough to use it as the first touch point for customer tickets.