Around ten months ago, a project on the Exalted wiki started, aimed towards getting “orphaned” pages connected in to the main navigation pages. This was made much harder than it needed to be by virtue of this particular wiki using some very old software. Even worse, it’s not even using the *latest version*. Though this software was originally used for Wikipedia, it has been vastly outpaced by its replacement, MediaWiki, which has many features lacking in this old software. In particular, MediaWiki gives you a list of which pages are not linked to by anything else. This old software cannot. So, DivNull produced software to find these “orphaned” pages, which (hopefully) accelerated the process, which finished today.

The only reason to even mention something this trivial is that this code actually used from real computer science, in particular work done with directed graphs. Having already scraped the wiki source text for the whole site (for other reasons), it was fairly simple to build a directed graph of page links, representing which pages linked to others.

Once you have this graph, though, how do you answer questions like “what is the shortest path between this page and that one”? One good thing here is that the wiki has a single main entry page, and the idea was to find a path from that page to every other page in a short number of steps (no more than five steps turned out to be the most useful, given the way the wiki is organized). This simplifies the problem from finding the shortest path between two arbitrary nodes to finding the shortest path between one arbitrary node and one specific one. This still sounds computationally expensive (and it is), but also smells like a problem that someone has probably already found a good answer to.

A solution came in the form of some algorithms in the Perl Graph module, which is chock full of graph-based utility (and can even export to dot files). In particular, it contains an implementation of Dijkstra’s single-source shortest path problem algorithm, which solves this exact problem. Using these routines, generating the orphan list was fairly simple.

Producing diagrams of the wiki graph, however, was not. Though the graph is easily represented in the aforementioned .dot format, there are so many nodes that existing rendering routines choke on them. And, even if they did complete, the result was horrible to gaze upon. DivNull searched for some other approaches and found some interesting approaches to rendering large, sparse graphs.

One idea was to draw represent the page links as a sparse matrix. The University of Florida seems to do a lot of research in this area. Drawing them, however, seems fairly straight forward, so will probably be done as some future point.

Another idea uses a force-based approach specifically for very large graphs. A paper details a system that could easily handle a data set this size, but experimenting with it for the wiki will have to wait.

Naturally, new orphans will surface in no time, but its good for now.

I’m not sure why you need a shortest path algorithm to find orphans. Wouldn’t a depth-first search find anything reachable from the root page faster than Dijkstra’s?

@Alec: Depth first doesn’t actually work, because the data here is a network, not a tree. That is, going “down” can easily cycle you back to where you started. It would be possible to do a modified “depth first-like” search that, for example, ignored nodes already visited. If you also added the path length constraint used in this project (i.e. no path greater than five links) you would also be able terminate paths (which you basically have to do, since in a network of links like a wiki, any given page can ultimately lead to pretty much all the other pages eventually). Doing this, you could build a list of everything you found, then compare that to a list of all nodes to get the orphans. I doubt this would be much faster than Dijkstra’s, which does one long calculation on a base node, then can thereafter provide the shortest path from the base to any other node in constant time.

It also would have required custom coding. Since I already had the graph set up using the Graph.pm module, it was less effort just to use what the tool already provided.