What does the BLEU score measure?

BLEU measures how much a machine translation overlaps, in terms of matching word sequences (n-grams), with one or more human reference translations, with a penalty for being too short. It correlates reasonably with human judgments and allows fast automatic comparison, though it does not fully capture meaning or fluency.

Why is machine translation considered hard?

Languages differ in vocabulary, word order, morphology, and the distinctions they require, and individual words and sentences are often ambiguous. Producing a translation that is both faithful to the source meaning and natural in the target language requires resolving these issues simultaneously, which is difficult.

Machine Translation

Machine translation is the automatic conversion of text or speech from one natural language into another, one of the oldest and most prominent applications of natural language processing.

Finn tema med PaperMindSnartFind papers & topics

Tools & resources

Last ned lysbilder

Learn & explore

VideoSnart

Definition

Machine translation is the task of producing, for a sentence in a source language, an equivalent sentence in a target language, using systems that may be rule-based, statistical, or neural, and evaluated for adequacy and fluency.

Scope

This topic covers approaches to automatic translation: rule-based and interlingua systems, word- and phrase-based statistical machine translation with alignment models and language models, and the neural sequence-to-sequence paradigm; alongside the central problems of word alignment, fluency versus adequacy, and automatic evaluation with metrics such as BLEU. It addresses why translation is hard (ambiguity, divergence between languages, word order) and how quality is measured. The general neural-network training methods belong to the machine-learning subfield.

Core questions

What makes translation difficult, given lexical ambiguity and structural divergence between languages?
How are word and phrase correspondences (alignments) learned from parallel text?
How do statistical and neural translation models trade off adequacy and fluency?
How is translation quality measured automatically and reliably?

Key concepts

source and target language
parallel corpora
word and phrase alignment
translation model and language model
statistical machine translation
neural sequence-to-sequence translation
adequacy and fluency
BLEU and automatic evaluation

Key theories

Statistical machine translation: Statistical MT models translation as finding the target sentence that maximizes the probability given the source, decomposed via a translation model learned from word/phrase alignments in parallel corpora and a target-language model for fluency.
Word alignment: Learning which source words correspond to which target words from parallel text (the IBM alignment models) is a foundational component that connects the two languages and supports phrase extraction.
Automatic evaluation: Metrics such as BLEU compare system output against human reference translations by n-gram overlap, enabling rapid, repeatable evaluation that drove progress, while acknowledging known limitations relative to human judgment.

Clinical relevance

Machine translation is among the most widely used AI technologies, integrated into search, communication, and content platforms, enabling cross-lingual access to information and powering tools for translators; its evaluation methodology also influenced evaluation across NLP.

History

Machine translation began with Weaver's 1949 memorandum and early rule-based systems, weathered the skepticism following the 1966 ALPAC report, then was transformed by IBM's statistical models (Brown et al., 1993) and phrase-based SMT, and again by neural sequence-to-sequence and attention-based models from the mid-2010s. BLEU (2002) standardized evaluation throughout.

Key figures

Peter F. Brown
Robert L. Mercer
Philipp Koehn
Kishore Papineni
Warren Weaver

Seminal works

brown1993
papineni2002
koehn2010

Frequently asked questions

What does the BLEU score measure?: BLEU measures how much a machine translation overlaps, in terms of matching word sequences (n-grams), with one or more human reference translations, with a penalty for being too short. It correlates reasonably with human judgments and allows fast automatic comparison, though it does not fully capture meaning or fluency.
Why is machine translation considered hard?: Languages differ in vocabulary, word order, morphology, and the distinctions they require, and individual words and sentences are often ambiguous. Producing a translation that is both faithful to the source meaning and natural in the target language requires resolving these issues simultaneously, which is difficult.