Semistructured and Document Models
Semistructured and document data models represent data as self-describing, irregularly structured trees or nested objects — as in XML and JSON — where structure is carried with the data rather than fixed by a rigid schema.
Definition
Semistructured data is data that has some organizational structure but does not conform to a fixed schema, typically modeled as labeled trees or nested key-value objects; document models store such data as self-contained documents (commonly JSON or XML) rather than as rows in fixed tables.
Scope
This topic covers data models that relax the relational requirement of a uniform schema: tree- and graph-shaped semistructured data, XML with its DTDs and schemas, and JSON-based document models used by document stores. It treats nesting, optional and repeated fields, schema flexibility, and the path- and tree-oriented query languages (such as XPath and XQuery) that operate over them. It excludes the broader engineering of NoSQL systems and consistency models, which are covered in the big-data and NoSQL area.
Core questions
- How does self-describing, schema-flexible data differ from rigid relational tables?
- How are XML and JSON used to represent nested and irregular data?
- What role do optional schemas (DTDs, XML Schema, JSON Schema) play?
- How do path and tree query languages such as XPath and XQuery navigate the data?
- What are the trade-offs of document models versus the relational model?
Key concepts
- semistructured (tree/graph) data
- XML and DTD/XML Schema
- JSON and document stores
- nested and repeated fields
- schema-on-read versus schema-on-write
- XPath and XQuery
- self-describing data
- schema evolution
Key theories
- Self-describing tree-structured data
- Semistructured data is modeled as labeled trees or graphs in which structure is encoded alongside values, allowing missing, optional, and heterogeneous fields without a predefined schema.
- Schema flexibility versus schema enforcement
- Document and semistructured models trade the integrity and query guarantees of a fixed schema for flexibility and ease of evolution, optionally validating against schemas such as XML Schema or JSON Schema when stronger guarantees are needed.
- Path-based querying
- Languages such as XPath and XQuery select and transform parts of tree-structured documents by navigating paths and patterns, providing a query model suited to nested, irregular data.
Clinical relevance
Semistructured and document models underpin web data interchange and modern application development: XML and JSON are the dominant formats for APIs, configuration, and messaging, and document databases store flexible, evolving data for web, mobile, and content-management systems where rigid relational schemas would be cumbersome.
History
Semistructured data emerged in the 1990s to describe heterogeneous web and integration data that did not fit fixed schemas. XML became a W3C standard in 1998 with associated query languages XPath and XQuery; JSON later became the lightweight de facto format for web APIs, and document databases popularized storing JSON documents directly, reviving and extending the semistructured tradition.
Key figures
- Serge Abiteboul
- Peter Buneman
- Dan Suciu
Related topics
Seminal works
- abiteboul2000
- garciamolina2008
Frequently asked questions
- Is a document model the same as having no schema?
- Not exactly. Document models are schema-flexible rather than schema-free: individual documents carry their own structure, and optional schemas (such as JSON Schema or XML Schema) can be applied for validation. The difference from the relational model is that structure is not required to be uniform across all records.
- When are document models preferable to relational tables?
- Document models fit data that is naturally nested, heterogeneous, or rapidly evolving — such as user profiles, catalog entries, or logged events — where forcing a uniform table schema would be awkward. Relational models remain preferable when data is regular and strong, multi-record integrity and complex joins are needed.