Molecular Representation and Descriptors
Computers need machine-readable encodings of molecules; line notations, chemical graphs, fingerprints, and numerical descriptors translate chemical structure into forms that can be stored, searched, and modeled.
Definition
The encodings and computed features that represent molecular structure digitally, ranging from canonical strings and graphs to fingerprint bit-vectors and numerical descriptors.
Scope
Covers the chemical-graph view of molecules, line notations such as SMILES and InChI, structural keys and hashed fingerprints, and the broad family of molecular descriptors that turn structure into numerical features for similarity and predictive modeling.
Core questions
- How are molecules represented as graphs and as canonical strings?
- What is the difference between structural keys, hashed fingerprints, and numerical descriptors?
- How is a unique, canonical identifier such as InChI generated?
- How does the choice of representation shape downstream searching and modeling?
Key theories
- Chemical graph and line notation
- Representing a molecule as a labeled graph of atoms and bonds, and serializing it into a compact line notation such as SMILES, provides the basis for storage, exchange, and canonicalization.
- Descriptor and fingerprint encoding
- Transforming structure into fixed-length numerical descriptors or binary fingerprints enables quantitative comparison, similarity searching, and machine-learning models.
Clinical relevance
Robust molecular representations are the foundation of every cheminformatics workflow, from database deduplication and search to quantitative structure-activity models that guide drug and materials discovery.
History
From early connection tables and Morgan canonicalization, the field gained the SMILES notation in 1988 and later the open InChI standard, alongside a proliferation of descriptors and fingerprints catalogued in reference works.
Key figures
- David Weininger
- Roberto Todeschini
- Peter Willett
- Stephen Heller
Related topics
Seminal works
- weininger1988
- todeschini2009
Frequently asked questions
- What is the difference between SMILES and InChI?
- SMILES is a flexible, human-readable line notation that can have multiple valid forms for one molecule, while InChI is a standardized, canonical identifier designed to give a single unique string per structure.
- What is a molecular fingerprint?
- It is a bit-vector encoding the presence of structural features or fragments, enabling fast similarity comparisons between molecules using simple set-based measures.