Data Description and Summary Statistics
Data description and summary statistics is the part of biostatistics concerned with organising, condensing, and presenting a body of observations so that its essential features can be grasped at a glance. Before any inference is attempted, investigators describe how the data are distributed, where they are centred, how widely they spread, and what shape they take, using numerical summaries and graphical displays.
Definition
Data description and summary statistics comprises the numerical and graphical methods used to characterise a dataset's central location, dispersion, distributional shape, and structure, prior to and independent of inferential generalisation to a population.
Scope
This area orients the reader to the descriptive side of biostatistics: descriptive statistics as a whole, the distribution and normality of data, measures of central tendency, measures of variability, and data visualization. It is a reference overview of how health data are summarised, not a prescription for analysis or clinical action.
Sub-topics
Core questions
- Where is the centre of the data, and which measure of location best represents it?
- How much do the observations vary, and how is that spread quantified?
- What is the shape of the distribution, and is it approximately normal?
- How can the data be displayed so that its pattern, skew, and outliers are visible?
Key concepts
- Descriptive versus inferential statistics
- Measures of central tendency (mean, median, mode)
- Measures of variability (range, variance, standard deviation, interquartile range)
- Distributional shape, skewness, and kurtosis
- Normality and its assessment
- Graphical summaries (histograms, box plots, scatter plots)
- Exploratory data analysis
Mechanisms
Description proceeds by reducing many observations to a few informative quantities and pictures. A measure of location (mean, median, or mode) summarises where the data sit; a measure of dispersion (standard deviation, interquartile range, range) summarises how far they scatter around that location; and the pairing of location with dispersion is chosen to match the shape of the distribution, with the median and interquartile range preferred for skewed data and the mean and standard deviation for roughly symmetric data. Graphical displays such as histograms and box plots reveal shape, skew, and outliers that single numbers can hide, and together these tools form the exploratory stage that precedes formal inference.
Clinical relevance
Almost every clinical study, audit, and surveillance report opens with descriptive summaries of its participants and measurements, so understanding these summaries is fundamental to reading the health-sciences literature. This area describes how data are characterised and is intended as background for evidence appraisal, not as a basis for individual diagnostic or treatment decisions.
Epidemiology
Descriptive summary is the first analytic step in epidemiologic and clinical research, used to characterise study populations, baseline tables, and the distribution of exposures and outcomes before associations are estimated. The choice of summary measures and displays directly affects how transparently a study's data are communicated.
History
Numerical summarisation of data has deep roots in eighteenth- and nineteenth-century astronomy and vital statistics, but the modern descriptive toolkit was consolidated in the twentieth century. John Tukey's Exploratory Data Analysis (1977) reframed description as an investigative activity in its own right and popularised displays such as the box plot, while statistical educators in the health sciences subsequently codified the standard summaries now reported in medical journals.
Debates
- When should the mean and standard deviation give way to the median and interquartile range?
- Because the mean and standard deviation are pulled by skew and outliers, there is a long-standing recommendation to summarise non-normal data with the median and interquartile range; the practical threshold for switching depends on distribution shape and sample size.
Key figures
- John W. Tukey
- William S. Cleveland
- Douglas G. Altman
- J. Martin Bland
Related topics
Seminal works
- tukey-1977
- gupta-2019
Frequently asked questions
- What is the difference between descriptive and inferential statistics?
- Descriptive statistics summarise and display the data actually collected, whereas inferential statistics use those data to draw generalisations about a wider population. Description comes first and makes no probabilistic claim beyond the sample at hand.
- Why describe data before running tests?
- Summaries and plots reveal the distribution's shape, spread, and any outliers or errors, which determine whether later analyses are appropriate and how their results should be interpreted.