Validating Links in Web Pages - Notes

Your task

Your task is to write a tool for carrying out a basic check on the validity of links within a tree of web pages.

Introduction

Once created, a web site will typically contain lots of links: both internal (links within the web site itself) and external (links to other websites). Maintaining all these links is an ongoing problem. Even internally pages can move or get deleted, leaving dangling links (links to pages which no longer exist). Your task is to read in a file containing data extracted from a tree of web pages, to store the information in that file in a pre-defined data structure, and then to write a tool which examines the stored data to find dangling links. Your tool will help the human responsible for maintaining the web site to ensure that all links lead somewhere.

Rather than dealing with the full complexity of link syntax and semantics, the exercise deals simply with the local `path' component of the link.

Aims of the assignment

The main aim of the assignment is to give you practice at implementing Lisp functionality according to a detailed specification of an API (Application Programmers Interface).

The experience gained through this assignment will be particularly valuable for those who have not programmed to a detailed specification before.

General instructions

In the accompanying notes you will find both a problem description, and a detailed API specification. Read these BEFORE you attempt any coding.

You will be provided with some tools to help you in your task. You must take time to understand what they do and how to use them before you start to write any code yourself, or your first attempts will probably be wrong. If you have difficulty understanding them, seek help before you continue.

We have provided compiled versions of the Lisp tools you will need to use in your own solutions, plus compiled versions of all the functions you are asked to write yourself. See the instructions provided in the assignment directory and make yourself familiar with the directory contents before you continue.

When you are ready to start programming, take a copy of the template file provided. Edit it so that its header contains your name, etc. Load in the compiled file provided, and you can start. Don't forget to take frequent backup copies of your file!

You may attempt the exercise in any order though your solutions should be presented in the order specified in this document. When you have finished defining a function, your own version will shadow the version provided; you can therefore test it in the normal way.

Don't forget that the compiler will provide useful feedback for you in the form of its error messages and warnings; compile your function when you think you have completed it, and listen to what the compiler tells you. Marks will be deducted if your file either fails to load or fails to compile (or compiles with many warnings).

Remember that all the tasks are worth equal marks. Some of the tasks are more difficult than others; if you find a particular task very difficult, therefore, you should move on to other tasks and attempt to complete the majority of the tasks.

Should you find a particular function difficult and wish to revert to the version provided, you will need to reload the version provided to you. As you complete a function, add its name to the list of completed functions at the start of your file, as shown in the example in the template file.

Follow the enclosed specification exactly. Your mark will depend on how closely you follow it. Imagine you are part of a project team and that your program has to integrate with the programs written by the rest of the team; it is of critical importance that you call their programs correctly, and that your program can be used exactly as documented in the specification.

Ensure that you comment your code appropriately. Note that your source code and comments should be readable on any standard terminal (or printed in plain text on standard size paper), so lines should be no more than 78 characters long.

Your solution will be tested using an automatic test harness written in Lisp. You will be marked down if you do not follow the specifications exactly. For example, if you are asked to write a function called ROUTINE call it just that, not MY-ROUTINE; if you are asked to return an integer, return an integer and not a float; if you are asked to signal an error, signal an error rather than returning an error string, and so on.

When you are ready to submit, check that the header of your source file is correct. Ensure that your file contains only Lisp source code and comments. Comment out any test data. If you wish to submit a partial solution for any function, you may do so. You must comment it to state that it is not fully working. If it will not load or compile you must submit it as a comment, and if you wish it to count towards your mark you should include a comment at the start of your file as shown in the template source file provided stating that you wish it to be considered.

Problem description

In this section you will find a description of the input file format, of the data structures you will use in the exercise, and so on. Make sure you read it before you start programming. We will begin with a high-level description of the problem.

A simple Web site will often consist of a directory tree containing a number (often a large number) of HTML documents within it. Each of these documents may contain many links.

Checking over the tree to ensure that each link leads somewhere - that there are no dangling pointers - is a time-consuming and boring task. However, it is an important one, as finding a dangling link gives a very poor impression of a site to anyone browsing it. It is therefore your task to write a tool which will read in data extracted from a tree of web pages, using the API functions provided, and create a list of all the dangling links which appear in the tree. For the purposes of the exercise we will simplify the world of the World Wide Web, by making certain assumptions.

Links in an HTML file appear as URLs - Universal Resource Locators. The full syntax and semantics of a URL is quite complicated. For our purposes, however, we will adopt a very simplified version. We will also assume that all links are internal links (i.e. links to our own web site).

The players

This section gives a brief overview of the types of objects that exist in the exercise, without going into details of representation and algorithm. Further details can be found in the following sections.

Input

The input to the program is sequential data representing web pages. For each page, the input data specifies: the pathname of the page, and the pathnames of all other pages to which links exist from that web page.

URLPATHs

URLPATHs are data which store the elements of input pathnames, and are manipulated by the program. Each URLPATH contains two elements: the name of a pathname, and a specification of the directory of the pathname.

So, if a pathname appears in the input data as "/tmp/wibble" (the file "wibble" in the "/tmp" directory), then a URLPATH created from this pathname by the program will have "wibble" as its name, and a directory specification representing the "/tmp" directory.

PAGE-ATTRIBUTES

PAGE-ATTRIBUTEs are data created and manipulated by the program which store information about each of the web pages that appear in the input data.

Each PAGE-ATTRIBUTE contains two elements: a URLPATH containing information about the pathname of the web page, and a number of links, each of which is stored as a URLPATH containing information about the pathname of one of the links which appeared in the input data.

URLPATH trees

URLPATH trees are created and manipulated by the program. They are used to store URLPATHs.

URLPATH trees are built using two node types: directory nodes and file nodes. Directory nodes have children; a child directory node of a parent directory node represents a subdirectory of the directory represented by the parent directory node. When a URLPATH is stored in a URLPATH tree, it will be stored as a file node, at a place in the tree specified by the directory specification component of the URLPATH.

So if a URLPATH has been created to store the pathname "/usr/bin/foo", and it is stored in a URLPATH tree, then a file node called "foo" will be created. If there is not already a "usr" directory node as an immediate child of the root of the URLPATH tree, then one will be created; if there is not already a "bin" directory node as an immediate child of the "usr" directory node, one will be created. The new "foo" file node will be stored in the "bin" directory node which is a child of the "usr" directory node which is an immediate child of the root of the URLPATH tree.

While directory nodes have children, file nodes have values. For the purposes of this assignment, the value of a file node (which will be created to represent a web page file) will be a PAGE-DESCRIPTION in which is stored the input data about that web page.

Input file syntax

You will be provided with sample data. Assume that it has been generated automatically from a tree of web pages. Each datum in the file is a page description which represents a single html file, as follows:

(filename link1 link2 ... linkn)

For example, here:

("/grib/wibble/index.html" "/grib/wibble/foo.html" "/grib/wibble/baz.html")

is a page description for a file called

/grib/wibble/index.html which contains links to /grib/wibble/foo.html and /grib/wibble/baz.html

URLPATH syntax

A URLPATH type has been defined for you; use this to represent a URL. You have been provided with functions for creating and accessing these. URLPATHs are a distinct type -- you can test for something being a URLPATH with TYPEP for instance.

A URLPATH contains two elements:

A directory specification is a list whose fist element is a type and the rest of whose elements are either strings naming directory components, or the symbol :BACK. This tail of a directory specification is referred to informally as a `dirlist'. Directory types can be :ABSOLUTE, or :RELATIVE. A :RELATIVE directory specification describes how to find the directory in relative terms; e.g. From the current directory, look in subdirectory "foo"; so, if the current directory is

(:ABSOLUTE "tmp") ; e.g. "/tmp/"

a specification of

(:RELATIVE "foo") ; e.g. "foo/"

will mean the same directory as

(:ABSOLUTE "tmp" "foo") ; e.g. "/tmp/foo/"

A dirlist can contain the keyword :BACK; this means to go back up the tree. For example, the directory specification

(:ABSOLUTE "foo" :BACK "bar") ; e.g. "/foo/../bar/"

is the same as the directory specification

(:ABSOLUTE "bar") ; e.g. "/bar/"

URLPATHs are implemented as Lisp structures.

PAGE-ATTRIBUTES Syntax

You will be asked to store the data about a page (taken from the page description) using a pre-defined type PAGE-ATTRIBUTES. This is implemented as a Lisp structure, e.g.

(defstruct page-attributes urlpath links)

The value of the URLPATH slot should be an object of type URLPATH, representing the name of the HTML file. The value of the LINKS slot should be a list of objects of type URLPATH, each one representing a link contained in the file.

URLPATH Trees

You are provided with functionality to create and store URLPATHs in trees.

You are also provided with a program which will create a window on your screen containing a graph of a URLPATH tree. This program is written using CLIM, the Common Lisp Interface Manager.

CLIM is not part of Common Lisp. However, implementations of it are available from several Lisp vendors, and its specification was created as the result of a joint effort. In order to use CLIM, you have to use a Lisp image which has CLIM loaded into it. See the instructions in the directory for how to run it.

Authors: Gail Anderson (ga@cley.com), Tim Bradshaw(tfb@cley.com), Cley Limited.
Copyright 1999–2003 Cley Ltd. & Franz Inc.