Web Crawler (GitHub)

Web Crawler allows you to browse a website quickly and automatically to identify problems (invalid links, bad practices).

The crawler does not use Regex to find links because this method is not reliable. Instead, the pages are parsed with AngleSharp, a parser that complies with the official W3C specification. This allows to analyze the pages like a real navigator and to manage all the cases like for example the base tag.

For HTML pages, URLs are extracted from the following:

  • <a href="...">
  • <area href="...">
  • <audio src="...">
  • <iframe src="...">
  • <img src="..." srcset="...">
  • <link href="...">
  • <object data="...">
  • <script src="...">
  • <source src="..." srcset="...">
  • <track src="...">
  • <video src="..." poster="...">
  • <... style="..."> (see CSS section)

For CSS files, the URLs are extracted from the following rules:

  • rule: url(...)

Web Crawler