URI 101

An obvious place to start looking at institutional identifiers is by looking at what URIs actually are, how they work, and what the current best practice is when it comes to designing them.

URI stands for Uniform Resource Identifier, a set of characters which uniquely identify a specific resource on the internet. URIs consist of two parts, a scheme which defines how a resource can be contacted, followed by a colon and then a scheme-specific part which uniquely identifies the resource. An example of a URI might be http://google.com, where http is the scheme and google.com is the identifier. There are many different URI schemes available which transfer a specific type of content, or transfer it in a specific way. A few examples of common schemes are HTTP, HTTPS, FTP and Mailto, although there are a great many more available.

Linking You focusses specifically on the HTTP (and by extension HTTPS) scheme, since this is the default scheme over which web content is transferred. However, a great many of the suggestions in this toolkit will apply to other schemes.

Breaking Down URIs

A HTTP URI is made of two parts, one of which (generally) identifies the server and the other of which (generally) identifies a resource on that server. Exceptions to this will always exist due to the fluid nature of the web — for example where an institution uses complex load balancing — but the assumption will apply to the vast majority of cases. An example URI would be http://example.com/section/resource.htm. In this URI we can begin by looking at the scheme, in this case HTTP. This leaves example.com/section/resource.htm as the unique part of the identifier; the part which identifies a resource. Everything up until the first slash is the server address or domain, in this case example.com. Everything after the slash identifies the resource within that domain, in this instance section/resource.htm.

In the case of academic institutions the domain is mostly unchangeable, being inextricably linked with the institution to which it is associated. A change of domain name for an established institution would be extremely rare, and only happen should the institution substantially change its name (such as the University of Lincolnshire and Humberside becoming the University of Lincoln, and ulh.ac.uk becoming lincoln.ac.uk). However, the resource identifier is easily altered by changing the configuration of a web server. This toolkit therefore focusses on the resource identifier part of the URI, with the exception of the Domains section.

Best Practises

Developing a website, small or large, isn't a small undertaking. Below are a number of best practises to bear in mind when it comes to developing the URI structure.

Persistent and Permanent

One of the most fundamental philosophies behind a URI is that it represents a data object on the Internet. The URI must be unique so that it is a one-to-one match - one URI per one data object.

While this is always the goal, there are times at which it is very difficult or impossible to accomplish. Canonical URL tags were invented to help reduce the amount of duplicate content seen by a search engine. While not a final solution, canonical URLs are strongly recommended as large search engines like Google are now paying attention to them.

URIs should also be permanent (i.e. choose the URI once and leave it at that). This speaks to good URI design before a site is launched, with the URIs being carefully planned. There will come a time when you do want to make improvements to your choices or otherwise must change URI structure. When this becomes a necessity, HTTP 301 moved permanently redirects should be set up. This tells browsers and search engines the new location of the content and will also preserve any Google PageRank (and other search engine rankings) that the old URI has accumulated.

Consistent

URIs across a site must be consistent in format. Once you pick your URI structure, be consistent and follow it! Having good URI structure for part of the site means that you still have poor structure overall. In order for a user to trust that URIs work a certain way on a site, the format must be consistent. If you must switch structure (maybe you’re updating a poorly-designed site), use 301 redirects as previously mentioned.

An example of poor consistency would be having undergraduate information located at /undergraduate and postgraduate information at /study/postgraduate.

Consistent structure = maintainable websites.

Readability

A URI can simply be used as a 'click to' point on the internet. There’s nothing stopping an HEI putting a page on courses in the School of Computing at http://example.ac.uk/bcwi83b. You plug it into a link, people click the link and off you go. Technically this is sound, but only in the same sense that you can technically address a letter to something like "10, SW1A 2AA". Yes it’s compact and yes it works, but it conveys absolutely nothing in terms of context. It’s also a real pain to remember, and requires you to use additional bits of your brain if you’re ever writing it down for later reference or typing it into a browser address bar.

Imagine for a second that a prospectus had the following:

Find out more about Computing at http://example.ac.uk/bcwi83b

And then compare it with a ‘human’ address:

Find out more about Computing at http://example.ac.uk/school/computing

Now, try to remember the first one without looking at it. In short, URIs should describe your content (but in a short and succinct manner).

Whatever method you use to create your website, it must be able to generate human readable URIs. Increasingly web browsers are allowing people to search through their history based on URI fragments, meaning that while a URI such as /computing will be easily found, /_depts/cs won’t be. Even worse would be the style of URI which is often created by an incorrectly configured CMS, such as ac.uk/content/027463.

Remove unnecessary keywords

Following on from the above point about readability, it is possible to take clean URIs too far by including unnecessary information in the URI.

For example, an about page with the URI /about_the_university whilst accurate and readable, could and should be shortened to /about because it doesn't lose any of it's meaning but still describes the content well.

Watch out for content management systems generating URIs based on page titles. A URI should be made up of keywords, but only the absolute minimum number of keywords in order to describe the content.

Query strings should be for filtering and pagination only

"Dynamic" URIs, i.e. URIs containing query string arguments such as /content/page.php?id=1234&output=1 should be kept to an absolute minimum, and even then should only be used (if necessary) for filtering content and paginating.

Dynamic URIs are less readable to both humans and search engines and therefore could be seen as less trustworthy because they don't necessarily describe the content to the user.

If you need to use query strings then try and ensure they are descriptive:

e.g. /undergraduate/courses?years=2 or /events?page=2

Hashbangs are bad, pushState is good

A number of websites, including Twitter and Gawker Media (Lifehacker, Gizmodo, etc) have recently re-architected their websites to make use of hashbang URIs — e.g. https://twitter.com/#!/unilincoln. The hashbang was recommended by Google as a way for search engines to crawl AJAX powered websites.

There are a number of problems with hashbangs:

  • In order to decide which content to render based on a hashbang URI, a hashbang enabled website relies on a user having a modern JavaScript enabled web browser.
  • Hashbangs are invisible to the server, so if someone visits http://example.ac.uk/#!/badurl (which triggers a 404), the error will not appear in your server logs.
  • Hashbangs are forever, so if you go hashbang you can't go back. You can control the links on your website, however you can't control other people's links to your website. If people start linking to your hashbanged URIs then you're going to have to support the parsing of hashbangs even if you implement "fixed" URIs again.

If you want to be all modern and exciting you should make use of the new pushState JavaScript features being introduced in the latest versions of browsers.

For example, if you you want to move from http://example.ac.uk/undergraduate to http://example.com/undergraduate/courses you'd provide a link, which when clicked, would AJAX load in the content from the other page and update the URI in the navigation bar, or for a user with an older browser it would just load the other page as normal.

Limit the number of subdomains

There may occasionally be a valid reason that content is on a different subdomain compared to the rest of the content, and with appropriate linking between the two should be okay. However if you split all your content subjects across lots of subdomains e.g. home.example.ac.uk (as opposed to example.ac.uk/home) and there aren't all of the correct 301 redirects in place then users will get confused and frustrated, and search engines will just give up on your site.

There is a very well known principle called KISS (Keep It Short and Simple) that is very appropriate in this situation.

Discuss This

Comments powered by Disqus