Tired of seeing your documents out of date? Don’t want to manually review them?
pdf-link-checker is a simple tool that parses a PDF document and checks for broken hyperlinks. It does this by sending simple HTTP requests to each link found in a given document.
- External references can be a very valuable part of your documents. Broken links reduce their usefulness as well as the impression they make. They also give the feeling that your documents are outdated and older than they are.
- Web sites evolve frequently. Having an automated way of detecting obsolete links is essential to keeping your documents up to date.
pdf-link-checker is free software (GNU GPLv2 license).
We are using
pdf-link-checker to make sure that our Android, embedded Linux, and kernel training materials are always up to date. They contain references to useful resources on the Internet, but such resources can disappear or be moved to other places. Our training materials are created from LaTeX source code, but instead of implementing a broken link checker for LaTeX, we preferred to develop a checker for the exported PDF file. This is a much more generic solution, which could interest billions of users!
pdf-link-checker can be used to check hyperlinks in most document formats. All you need is a utility to convert your document format to PDF, with the ability to preserve hyperlinks. We recommend to open your documents with the excellent and free software LibreOffice office software (supporting GNU/Linux, MacOS X and Windows), offering a very easy to export to the PDF format. This way, you can use
pdf-link-checker to find broken links in any text and presentation document, such as LibreOffice presentations and slides, Microsoft Word (doc / docx) and PowerPoint (ppt / pptx), RTF documents and HTML pages.
pdf-link-checker is very easy. First, we recommend to install the
pip python package installer if you don’t have it yet.
sudo apt-get install python-pipon Debian based systems (such as Ubuntu)
sudo yum install python-pipon RPM based systems (Red Hat, Fedora, Suse, Mandriva…)
pdf-link-checker along with its dependencies is easy:
$ pip install pdf-link-checker
Then, you can install
pdf-link-checker as follows:
$ pip install pdf-link-checker
pdf-link-checker is even easier:
$ pdf-link-checker my-awesome-doc.pdf
./pdf-link-checker --help Usage: pdf-link-checker [options] [PDF document files] Reports broken hyperlinks in PDF documents Options: --version show program's version number and exit -h, --help show this help message and exit -v, --verbose display progress information -s, --status store check status information in a .checked file -d, --debug display debug information -t MAX_THREADS, --max-threads=MAX_THREADS set the maximum number of parallel threads to create -r MAX_REQUESTS_PER_HOST, --max-requests-per-host=MAX_REQUESTS_PER_HOST set the maximum number of parallel requests per host -x EXCLUDE_HOSTS, --exclude-hosts=EXCLUDE_HOSTS ignore urls which host name belongs to the given list -m TIMEOUT, --timeout=TIMEOUT set the timeout for the requests --check-url=CHECK_URL checks given url instead of checking PDF (debug)
Specifies the maximum number of allowed threads (default: 100). To speed up the run,
pdf-link-checkerwill launch several threads in order to check several links in parallel. This option allows to set a limit to the number of threads.
Specifies the maximum number of allowed requests per host. Some URLs may belong to the same host, and since
pdf-link-checkercan check many URLs at the same time, we may want to set a limit to the number of requests per host. Otherwise, some hosts may confuse the check with a DoS attack.
Allows to create a
.input-file.checkedin case no broken hyperlink was found. This can allow scripts to skip the execution of
pdf-link-checkerfor documents which have already been validated.
pdf-link-checkerwon’t detect and check URLs which are not properly declared as hyperlinks.
- It doesn’t support checking internal links yet. This feature is on our todo list though.
- It doesn’t support checking links that require authentication yet. The plan is to ignore such URLs.
Getting help and helping out
Please use GitHub’s ressources for reporting issues, asking questions, etc.
Patches and pull requests are welcome of course! Browse our Git repository and feel free to contribute!