As part of a much larger project that I have in mind, I'm working on a bit of code now to manage a graph of links in a website that has the potential to grow quite large. I started out doing everything in a relational database, but it soon became clear that my ideas of managing links (edges) between pages (vertices), as well as pages grouped by topic (collections), wasn't going to work with any reasonable amount of speed if I kept everything in the database: I would be making far too many queries to the database if the site grew larger than a few pages.
So I decided to create something new to manage these links. Instead of storing information about how documents are connected in the RDBM itself, which would either make the amount of queries to the database with each pageview depend on how many links were pointing towards, away from, and between that page as well as how many collections it belonged to or increasing the amount of code needed to extrapolate the results from the database, I decided to create a separate backend to the whole thing. What I've started work on is what I guess one could call a RGM: relational graph manager.
The idea started out quite nice, really: I would have a circularly linked list each of vertices, edges, and collections. Each vertex instance would have a circularly linked list of both the edges pointing towards and away from that vertex, as well as a list of collections to which that vertex belongs. Each edge instance would have pointers both to the vertex it originated from as well as the vertex it was pointing to, along with a boolean variable declaring whether it was directed or not. Each collection instance would simply have list of pointers to vertices that belonged to it. No structure would be duplicated - everything would work on pointers. It was all so clean. No one would be searching for what was in a node with this program, since none of the actual data is stored with it, and since it would be accessed by the larger backend program, it wouldn't have to worry about whether or not the node existed - it would be kept up to date with all vertices, edges, and collections created or destroyed.
Until I realized that, if the site started to get fairly popular, I'd take up all the memory on the machine with my instance after instance of vertices, edges, and collections. Well, shit.
I started having all sorts of doubts about the worthiness of this program then:
A) I'm reinventing the wheel. I checked SourceForge, but..
B) I'm going about this the hard way.
C) Speed's not that big of a deal unless this gets huge, in which case, see D.
D) I'm having delusions of grandeur.
I'm not going to drop it just yet, though - what I'm asking about is whether or not it would be possible, easy enough, efficient resource-wise, and just plain worth it to try to store part of the graph after a certain size to the disk. One way to do this would be to add a 'weight' property to each structure - a simple integer that would be incremented every pageview (or x number of pageviews)* - and just store a certain number, percentage of the total, or size in memory amount of instances - or even just the edges - in the graph in memory and store the rest to the disk. The portion in memory could even be designated dynamically - each page view could load the adjacent nodes and edges into memory anticipating that a user would click a link away from that vertex.**
A simple way to do this would be to make it a server type thing and give each graph a directory. In that directory, each vertex could get a file named after its ID (an integer) in a simple to read format containing its label (if I even need that in there), the edges-in and edges-out, as well as all of the collections to which it belongs.
<em>Edit:</em> Figured out how to work it in a RDBM, though pride makes me loathe to implement it that way.
<small>* Ooh, or, you know what, I could keep a private record of which page was most popular, and make each instance's weight a percentage of that most popular page's weight *scribble* Whatever.
** This, of course, requires that I figure out how to know when to clear out this loaded stuff. Maybe have some maximum value of instances loaded at a time..</small>