COOol

Checks LibreOffice / OpenOffice.org documents for bad Links

cOOol is a simple Python script that looks for broken hyperlinks in LibreOffice / OpenOffice.org documents.

  • cOOol only supports documents in the OpenDocument format.
  • cOOol is fast: it doesn’t start LibreOffice / OpenOffice.org and runs link checks in parallel threads.
  • cOOol supports most kinds of hyperlinks, including links within the documents.
  • cOOol is easy to use. Just download the script and run it!
  • cOOol is free. It is released under the terms of the GNU General Public License.
cOOol logo

Here is why an automatic link checker for your documents is useful:

  • External references can be a very valuable part of your documents. Broken links reduce their usefulness as well as the impression they make. They also give the feeling that your documents are outdated and older than they are.
  • Web sites evolve frequently. Having an automated way of detecting obsolete links is essential to keeping your documents up to date.
  • You may be much more familiar with your target websites than your readers. They may not be able to find a new location by themselves. You’d better be aware of the change and do this for them!
  • When you rename a page (for example), LibreOffice and OpenOffice.org don’t update all the references to it.

Usage

Usage: coool [options] [OpenOffice.org document files]

Checks OpenOffice.org documents for broken Links

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -v, --verbose         display progress information
  -d, --debug           display debug information
  -t MAX_THREADS, --max-threads=MAX_THREADS
                        set the maximum number of parallel threads to create
  -r MAX_REQUESTS_PER_HOST, --max-requests-per-host=MAX_REQUESTS_PER_HOST
                        set the maximum number of parallel requests per host
  -x EXCLUDE_HOSTS, --exclude-hosts=EXCLUDE_HOSTS
                        ignore urls which host name belongs to the given list

When a broken link is found, open the document in OpenOffice.org and use the search facility to look for the link text.

Configuration file

Rather than configuring cOOol from the command line, it is possible to
define the same settings in a ~/.cooolrc file.

Example:

# Configuration file for cOOol

verbose = True
exclude_hosts = "lxr.bootlin.com www.example.com"
max_threads = 200

You can see that configuration file settings have the same name as long
options, except that dash (-) characters are replaced by
underscores (_).

Usage through a proxy

cOOol can be used through a proxy. The Python classes it uses rely on standard Unix environment variables for proxy definition, as in the below bash example:

export http_proxy="proxy.server.com:8080"
export ftp_proxy="proxy.server.com:8080"

Prerequisites

You first need to install the configparse module.

Downloads

cOOol can be found in our odf-tools git tree.

Screenshot

cOOol screenshot

Implementation

cOOol parses the xml components of each document file, looking for hyperlinks.

It would have been cleaner and safer to use the OpenOffice.org API to explore the documents. However, there are also benefits in a standalone Python implementation:

  • No need to start OpenOffice.org and load documents in memory. This saves a lot of time and RAM!
  • No need to have an OpenOffice.org install. Nice if you need to implement a validation server using cOOol.
  • Last but not least, no need to understand OpenOffice.org’s API and the internal structure of documents! By the way, that’s what makes exchange formats like XML attractive. However, we would be delighted if somebody could come up with a simpler and safer implementation based on the API, that could be run within OpenOffice.org user interface!

Testing cOOol

We are using 2 documents to make sure that cOOol finds all the kinds of broken links it is supposed to support:

Limitations and possible improvements

  • cOOol doesn’t check for e-mail links. It could at least check that the corresponding domain is valid.
  • cOOol doesn’t give you page numbers for broken links. You have to open the document and use the search facilities to locate each link.
  • cOOol still crashes on some documents with Unicode strings (for example with Chinese text).
  • cOOol has trouble with link text containing quotes, as in what’s new. The text it outputs is truncated.

Author: Michael Opdenacker

Michael Opdenacker is the founder of Bootlin, and was its CEO until 2021. He is best known for all the free embedded Linux and kernel training materials that he created together with Thomas Petazzoni. He is always looking for ways to increase performance, reduce size and boot time, and to maximize Linux' world domination. More details...

Leave a Reply