Process / pipeline

Text Normalization — Noisy-Text Standardisation

Text normalization is an NLP preprocessing pipeline that converts noisy, abbreviated, or misspelled text — such as SMS messages, social-media posts, and OCR output — into a clean, standardised form. It is a prerequisite step for virtually every downstream NLP task, ensuring that inconsistent surface forms do not degrade tokenisation, parsing, or classification. The method gained systematic academic treatment through Baldwin and Li (2015) and Sproat and Jaitly (2017).

Open in MethodMindSoonVideoSoon

Read the full method

Members only

Sign in with a free account to read this section.

Sign in

Sources

  1. Baldwin, T. & Li, Y. (2015). An In-depth Analysis of the Effect of Text Normalization in Twitter. NAACL-HLT 2015. link
  2. Sproat, R. & Jaitly, N. (2017). RNN Approaches to Text Normalization: A Challenge. arXiv:1611.00068. link

Related methods

Referenced by

ScholarGateText Normalization (Text Normalization (Noisy-Text Standardisation)). Retrieved 2026-06-04 from https://scholargate.app/en/text-mining/text-normalization