On Testing Website Links - benjamintoll.com

In revisiting some of my older posts, I noticed that some of the links were broken. Well, I can’t have that.

So, I wrote a tool that will crawl each article or directory of articles, gather all the links, and then make a HEAD request to each one. I wrote it in Go because of its concurrency model, implemented by goroutines, since the tool would be making tens, hundreds, perhaps thousands of HEAD requests over its little lifetime.

Before I go any further, I wanted to say that it’s not the greatest tool I’ve ever written, but it’s not too shabby, either. It accomplishes my main goals, which is good enough for now.

Here are the design goals:

Should be a CLI tool and easy to run as a Git pre-commit hook and in CI pipelines.
Should have a modicum of configuration. At the very least, the user should be able to provide:
- Custom headers
- Link regular expression
- Skip pattern for links

The design goals have been met, it’s a working tool, and I’m happy. I’ll continue to revisit it, because I rarely write something from which I permanently move on. Since I’m always trying to get better at what I do, I’m sure I’ll come back to it and think to myself, “well, this certainly is a pile!”.

In the meantime, it’s working title is link-scanner, and it is up on my GitHub.

Here is the current usage:

$ ./link-scanner -h
Usage of ./link-scanner:
  -dir string
        Optional.  Searches every file in the directory for a match.  Non-recursive.
  -filename string
        Optional.  Takes precedence over directory searches.
  -filetype .html
        Only searches files of this type.  Include the period, i.e., .html (default ".md")
  -header string
        Optional.  Takes comma-delimited pairs of key:value
  -q    Optional.  Turns on quiet mode.
  -regex string
        Optional.  The search pattern (regex) used when gathering the links in an article. (default "(?:https?:\\/\\/[^<>].*\\.[^\\W\\s)\"<>]+[\\w\\.,$'%\\-/?=]*?)$")
  -skipPattern string
        Optional.  Will skip any gathered links matching this pattern. (default "\\.onion|example\\.com")
  -v    Optional.  Turns on verbose mode.

Since I wrote this primarily to shoehorn into daily use, let’s now briefly take a look at the three ways I’ve implemented it into my workflow as a:

Binary
Git pre-commit hook
GitHub Action

Binary

Testing all the links in a particular file:

$ link-scanner -filename gpg.md

Testing all the links in a particular directory with custom headers:

$ link-scanner -dir content/post -header "User-Agent:Mozilla/5.0,Content-Type:application/json"

Git pre-commit hook

In my breathtaking article on how I use the Git pre-commit hook, I explicate its setup, so I won’t go over it now in full.

Briefly, you could uncomment one of more of the Git hooks specified in the install.sh script in the dotfiles repository, change directory to the location of your local repository, and then run that script.

Or, copy both the pre-commit runner and the pre-commit.d/link-scanner.sh into your .git directory in the top-level of your repository in the hooks directory.

In the root of the repository:

$ cd ./.git/hooks
$ wget --no-clobber https://github.com/btoll/dotfiles/blob/master/git-hub/hooks/pre-commit
$ wget --no-clobber --directory-prefix pre-commit.d https://github.com/btoll/dotfiles/blob/master/git-hub/hooks/pre-commit.d/link-scanner.sh

This script will automatically run when committing a Git object (git commit). To disable, simply add the --no-verify option:

$ git commit --no-verify -am 'derpy'

GitHub Action

The GitHub Action is very straightforward and probably doesn’t need any explanation, as it’s immediately identifiable by its similarity to the obnoxious number of “IaC” cloud CI/CD tools that the cool kids are constantly yammering on about.

I used GitHub Actions because I just did a short stint at GitHub as a consultant and liked the experience. Also, although I’m not doing it here, I like that you can “bring your own” agent. I know of at least one other platform that does this (Buildkite), but since all of my repositories are hosted by GitHub, I’ll use GitHub Actions.

scan.yml

name: Link Scanner

on:
  push:
    branches:
      - master

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Repository
        uses: actions/checkout@v2

      - name: Scan Links
        uses: docker://btoll/link-scanner:latest
        with:
          args: -dir content/post -v