ScholarGate
Asszisztens

Web Crawling and Link Structure

Web crawling is the automated process of discovering and downloading web pages by following hyperlinks, and the resulting link structure forms a graph that search systems both traverse and analyze.

Témakeresés ezzel: PaperMindHamarosanFind papers & topics
Tools & resources
Diák letöltése
Learn & explore
VideóHamarosan

Definition

Web crawling is the algorithmic traversal of the web that begins from seed URLs and repeatedly fetches pages and extracts their outgoing links to discover more pages, while link structure refers to the directed graph formed by pages and the hyperlinks between them.

Scope

This topic covers how crawlers systematically fetch web pages and how the web's hyperlink graph is structured. It addresses crawler architecture, the URL frontier and politeness constraints, duplicate and near-duplicate detection, freshness and recrawl scheduling, and respecting robots exclusion. It also covers empirical properties of the web graph, such as its broad bowtie structure and heavy-tailed degree distribution, which inform both crawling and link analysis. It excludes the ranking use of links, treated under PageRank and HITS.

Core questions

  • How does a crawler discover, prioritize, and schedule the pages it fetches?
  • How are politeness, robots exclusion, and server load respected during crawling?
  • How are duplicate and near-duplicate pages detected and handled?
  • How is crawl freshness maintained as pages change?
  • What large-scale structure does the web graph exhibit?

Key concepts

  • web crawler / spider
  • URL frontier and seed set
  • crawl politeness and robots.txt
  • duplicate and near-duplicate detection
  • freshness and recrawl scheduling
  • the web graph
  • bowtie structure
  • in-degree and out-degree distributions

Key theories

Crawler architecture and the URL frontier
A crawler maintains a frontier of URLs to fetch, applies prioritization and politeness policies, parses fetched pages to extract new links, and tracks visited pages, balancing coverage, freshness, and resource limits.
Macroscopic web graph structure
Empirical studies show the web's link graph has a characteristic bowtie shape with a large strongly connected core plus in and out components, and heavy-tailed in-degree, which constrains reachability and informs crawling strategy.

Clinical relevance

Crawling is the data-acquisition stage of every web search engine and of large-scale web analytics, archiving, and dataset construction. Understanding link structure guides efficient crawling, helps estimate coverage, and underpins the link-based authority measures used in ranking.

History

Web crawlers appeared with the early web in the mid-1990s to feed search indexes. Cho and colleagues studied efficient crawling and URL ordering in 1998, and the 2000 'graph structure in the web' study revealed the web's bowtie macrostructure. As the web grew, crawling matured into a large-scale distributed-systems discipline emphasizing freshness, coverage, and politeness.

Key figures

  • Andrei Broder
  • Prabhakar Raghavan
  • Junghoo Cho
  • Hector García-Molina

Related topics

Seminal works

  • broder2000
  • cho1998
  • manning2008

Frequently asked questions

What is the URL frontier in a crawler?
The URL frontier is the queue of discovered-but-not-yet-fetched URLs. A crawler repeatedly selects URLs from the frontier according to priority and politeness policies, fetches the pages, extracts new links, and adds previously unseen URLs back into the frontier.
What does the 'bowtie' structure of the web mean?
Large-scale studies found the web graph has a big strongly connected core, an 'in' component of pages that can reach the core, an 'out' component reachable from it, plus tendrils and disconnected parts, resembling a bowtie. This shape affects which pages a crawler can reach from given seeds.

Methods for this concept

Related concepts