Secondary and Big Data

Administrative records, open data, digital traces

Secondary data refers to information collected for purposes other than the current study. Big data extends this further through administrative records, open datasets, and large-scale digital traces — transactions, sensors, and social media — that exceed traditional datasets in volume, velocity, and variety. While offering substantial advantages in scale and timeliness, these sources present distinct challenges: data quality and provenance, representativeness, record linkage, and significant privacy and ethical considerations.

Defining the Concept

Secondary data is information produced by another institution or individual for a purpose other than the researcher's current study. Administrative records (civil registration, health systems, education databases), official statistics, and archival documents are classic examples. Big data extends this concept to encompass vast data streams generated as by-products of digital systems: credit card transactions, location traces, social media activity, and IoT sensor feeds. What unites both concepts is that the data was created outside the researcher's direct control and typically without any specific research intention.

Main Types and How They Are Used

Secondary and big data sources fall into several main categories. Administrative records are natural outputs of institutional processes — hospital files, tax returns, and school enrollment data. Open data refers to structured datasets released to the public by governments, international organizations, or research centers. Digital traces range from web clickstreams and satellite imagery to mobile search logs and online reviews, and are typically unstructured or semi-structured. Researchers use these sources for archival analysis, record linkage studies, longitudinal time-series research, and large-scale descriptive profiling of populations that would be impractical to survey directly.

A Concrete Application Example

Consider a public health researcher examining the effect of COVID-19 vaccination on workplace absenteeism. The researcher links administrative health records containing vaccination dates with social security absenteeism data and open-source air quality measurements. This approach enables analysis covering hundreds of thousands of individuals without the cost of a primary survey. Following the same logic, economists test consumer behavior models using millions of anonymized bank transactions, and sociologists analyze social media streams to understand urban migration patterns — scales of observation that primary data collection could not practically achieve.

Common Pitfalls and Good Practice

The most common pitfall is failing to question how and why the data was originally produced. Administrative records are designed for managerial purposes and may contain systematic measurement errors that undermine research validity. Big data can misrepresent populations due to the digital divide — those without internet access or smartphones are systematically absent. Privacy risk increases sharply when different datasets are linked: records anonymized separately can become re-identifiable when combined. Good practice requires provenance transparency (data dictionaries, production process documentation), representativeness testing, ethics board approval, and adherence to data governance standards such as GDPR or institutional data-sharing agreements.

Key terms

Secondary Data
Data originally collected for a different purpose and reused by the researcher.
Administrative Records
Systematic data files generated by governments or institutions during routine administrative operations.
Data Provenance
The documented chain describing how, by whom, and for what purpose data was produced.
Record Linkage
The process of matching records referring to the same individual across different datasets.
Digital Traces
Residual data passively generated when individuals interact with digital systems.