MagicStats 2.0 Architecture Guide

Abstract:

MagicStats is a flexible package for managing and analyzing web site statistics. This flexibility brings power, and power brings customizability. This document describes the overall architecture of the MagicStats system, and describes the Plugin API in enough detail to allow C++ programmers to get started. This talks about the lowest level architecture and explains some of why MagicStats is the way it is...
  1. MagicStats Architecture Overview:
  2. MagicStats Plugin System:
  3. MagicStats Data Flow:
    1. Filestream Plugins
    2. AccessFormat Plugins
    3. AccessFilter Plugins
    4. Page Plugins
  4. Platform Abstraction Layer:
  5. Conclusion:

MagicStats Architecture Overview:

In essence, MagicStats is simply a plugin system that allows creating arbitrary applications through a powerful and fast plugable interface. MagicStats the log analyzer is the first (and so far only) system that builds off of this base.

When MagicStats starts up, it locates an "EnginePlugin" named "MSEngine" that starts up the system. This plugin starts up the system, loads other plugins, and starts analyzing log files.

Since understanding the plugin system is intrinsic to understanding the system as a whole, I'll digress and talk about the plugin system...

MagicStats Plugin System:

MagicStats makes use of a powerful plugin system to do all of its dirty work. Plugins may either be compiled into the main application binary (such as the above mentioned MSEngine plugin, or they may be dynamically loaded, like the plugins in the /Plugins/ directory.

This plugin system works similar to the CORBA system, except that it is much slimmer, higher performance, and simplified. It currently only supports C++, does not support network transparent objects, and does not require an ORB. These restrictions are in place for several reasons:

  1. Complexity. MagicStats tries to reduce complexity to allow anyone to contribute to the system.
  2. Performance. Eliminating the need to marshal and unmarshall parameters, the need to worry about endianness, or the need to have a specified protocol, MagicStats can attain very high performance.
  3. Flexibility. MagicStats has specialized "constructor" methods to create objects of a specific type, depending on context. For example, given a filename, FileStreamPlugin::OpenFileStream(...) returns a sequence of FileStreamPlugin's that can decode the file in a predictable way (i.e., fetches it from the network, uncompresses it, unencrypts it, etc...). All in an easy to use way.

To reduce implementation complexity and redundant code, a template based system is used to implement plugins. This makes the plugin mechanism slightly harder to understand, but extremely easy to use (we'll see why later).

Given this background in how pieces are put together, we'll talk about the actual data flow patterns in the MagicStats system.

MagicStats Data Flow:

This image to the right (click on it for a larger version) roughly shows how data flows through the MagicStats system. The input, of course, is one or more log files of various types. These files may reside over the network (reachable via FTP or HTTP), or could be local files (as shown). These files are processed through one or more FileStreamPlugins that convert the file in one or more ways.

In the example shown, all three logs are read from the disk by the FSNormal plugin. In the case of Log #1, this stream is then uncompressed by the FSGZipped file stream plugin (which is a wrapper around the wonderful zlib library).

After the files are suitably loaded/downloaded/decompressed/unencrypted/etc, they are passed to an AccessFormatPlugin that decodes the actual lines of the log file. The three (hypothetical) examples shown decode the standard CLF web server log format (CommonLogFormat plugin), merge split log files, and parse an FTP transfer log. Note that the data from these log files is canonicalized into an internal format, which is represented as the AccessFormatPlugin.

At the next stage, the AccessLogManager selects AccessFormatPlugins based on the one with the earliest date stamp on it (in other words it streams accesses out of the three pipelines in a date sorted order).

Each of the accesses that is streamed from the input pipeline is run through a user-defined sequence of AccessFilterPlugin's. These plugins perform acts such as converting escaped characters (UnescapeURL plugin) into the character they represent (such as %20 into a space character), add a domain name to the accesses in a log file (Domainify plugin), or perform arbitrary regex substitutions on accesses (Rewrite plugin).

After the accesses are in their final format, they are passed to each of the active PagePlugin's. These page plugins to a variety of analysis activities like count the number of hits per day (HourlyGraph plugin), find the most popular pages (PageCount plugin), or count what types of errors are occurring on the site (Errors plugin). The page plugins in use are determined by the themes that the user has configured.

When it becomes time to write HTML files out to disk so that the web server can serve them, the OutputManager gets the plugins to emit the HTML that represents their current state, and extracts HTML from the themes in use. The composite result is written to disk.

FileStream Plugins

At the time of this writing, there are currently two different FileStream plugins implemented (FSNormal and plugins). It would be nice to have facilities to fetch a log through FTP or off another web server, but that is currently not implemented. One chain of FileStream plugins is created for each element of the AccessLog entry in the users MagicStats.cfg file.

FileStream plugins are instantiated through the FileStreamPlugin::OpenFileStream(name, opencode) static method. This method searches for a FileStream plugin that is capable of opening the requested filename based on a priority order (so GZipped streams have higher priority than normal disk streams, for example). Here is the API that a FileStreamPlugin must implement (note that this illustrates the statement "This makes the plugin mechanism slightly harder to understand, but extremely easy to use"):

class {classname} : public FileStreamPluginImpl<{classname}> {
public:
  virtual ~{classname}();   // Virtual Destructor to clean things up...
  // Constructors for class
  {classname}(int &CreateError);
  {classname}(const char *Filename, int OpenCode, int &CreateError);
  virtual int seek(long Position, int From = SeekSet);
  virtual long tell();
  virtual FileStreamPlugin *getline(char *Buffer, int BufferLength, char Delimiter = '\n');
  virtual int eof();
  static unsigned int GetCurrentVersion();
  static const char *GetPluginName();
  // Return the priority class of filestreams of this type
  static int GetPriority();
};
As you can see above, the API is a pretty simple streams API, with operations like seeking a stream, reading a line, and testing for EOF. Note that the constructors take a CreateError parameter that is set if the plugin did load the file correctly and has invalid state. If this flag is set, the plugin is immediately destroyed.

In addition to the obvious methods, FileStreamPlugin's have functionality that is common across all plugins. For example:

  1. GetCurrentVersion() returns the current version number*100. Thus, 1.0 is returned as 100. This is to allow versioning capabilities in the future...
  2. GetPluginName() returns the name of the plugin as shown by MagicStats2 -P.
If you are interested in implementing a FileStreamPlugin take a look at one of the two examples that already exist...

AccessFormat Plugins

AccessFormatPlugin's are used to parse a record from a log file and present it to the MagicStats system. Subclasses of this interface define a canonicalized representation of records suitable for processing. The interface used looks like this:

class {classname} : public AccessFormatPluginImpl<{classname}> {
public :
  {classname}(int &CreateError);
  {classname}(const String &ExLine, int &CreateError);
  virtual ~{classname}();
  static unsigned int GetCurrentVersion();
  static const char *GetPluginName();
  static int GetPriority();
  virtual void operator=(const AccessFormatPlugin &F);
  virtual int ParseAccess(String &Line);
  virtual const String &GetHost()       const;
  virtual       String &GetHost()            ;
  virtual const String &GetAuth()       const;
  virtual       String &GetAuth()            ;
  virtual const Date   &GetDate()       const;
  virtual       Date   &GetDate()            ;
  virtual const String &GetRetType()    const;
  virtual       String &GetRetType()         ;
  virtual const String &GetURL()        const;
  virtual       String &GetURL()             ;
  virtual const String &GetProtocol()   const;
  virtual       String &GetProtocol()        ;
  virtual const String &GetDomain()     const;
  virtual       String &GetDomain()          ;
  virtual const String &GetReferrer()   const;
  virtual       String &GetReferrer()        ;
  virtual const String &GetBrowser()    const;
  virtual       String &GetBrowser()         ;
  virtual       int GetStatusCode()     const;
  virtual       int GetLength()         const;
  virtual void SetStatusCode(int S);
  virtual void SetLength(int L);
  virtual void SetDomain(const String &d);
};
This class works by parsing a line (whenever the ParseAccess method is called) into private data members, and returning those members whenever the specific field is requested. This isn't terribly difficult, but has been proven to be a very general and powerful mechanism for handling a variety of access log types.

If you would like to see an example of this type of plugin (which demonstrates that it isn't totally tedious to write a class like this, check out the CommonLogFormat plugin.

AccessFilter Plugins

AccessFilterPlugin's are a powerful mechanism to do filtering and cleaning up of accesses as they are loaded into the system. The user specifies which AccessFilterPlugin's to use with the Filters setting in their MagicStats.cfg file. Each filter can take a list of parameters to shape their behavior.

AccessFilter's are extremely important to MagicStats because they can really help clean up the input coming into the system. here is an example that I use, in my MagicStats.cfg file:

Filters			: [ ["UnEscapeURL" ],
			    ["RemoveIndexFilename", "index.html", 
						    "index.shtml",
						    "index.htm", 
						    "index.cgi", 
						    "/"],
# Use a rewrite rule to change the /~sabre symlink into /sabre
                            ["Rewrite", "URL", "^/~sabre", "/sabre"], 
# Use a rewrite rule to merge all ip189.uni-com.net hosts...
                            ["Rewrite", "HOST", "^ip....uni-com.net", 
                                                "dhcp.uni-com.net"], 
			    ["Domainify", "*", "" ]
			  ];
This rule chain instantiates five AccessFilterPlugin's, that do the following in order:

  1. UnEscapeURL: Remove escapes from the URL field of the accesses (such as %20 into a space character).
  2. RemoveIndexFilename: This plugin strips off the ends off of URLs that end with the specified strings. This is quite useful because references to http://www.nondot.org/~sabre/ and http://www.nondot.org/~sabre/index.html both refer to the same document and should be counted as such.
  3. Rewrite: This is used twice because:
    • I have a symlink from /sabre to /~sabre on my web server... and I want the hits to be considered equivalent. This merges the two.
    • I want all DHCP hosts on my local network merged into one so that I can filter them away in one easy pass.
  4. Domainify: This prepends each URL with the domain of the access log that it comes from. This isn't necessary, I just prefer the "http://" look. On sites with multiple domains being hosted, this is a neccesity, so that links resolve correctly.
AccessFilterPlugin's have one of the simplest APIs in MagicStats:

class {classname} : public PluginTemplate<AccessFilterPlugin,{classname}> {
public:
  {classname}(int &CreateError);
  static unsigned int GetCurrentVersion();
  static const char *GetPluginName();
  // Override this method if your AccessFilter takes arguments...
  virtual int Initialize(const VTListExp *Params);
  // The Filter method does the actual filtering for each Access.  If it 
  //   returns a true value, the access is discarded.
  virtual int FilterAccess(AccessFormatPlugin &A);
};

Example AccessFilterPlugin:

...because it is so short, here is a complete example of an AccessFilterPlugin (the UnescapeURL plugin):
//  AFUnEscapeURL Class:
//
//    This class provides the UnescapeURL plugin that is used to remove the 
//  escape codes from the URL that the browser references.
//
class AFUnEscapeURL : public PluginTemplate<AccessFilterPlugin,AFUnEscapeURL> {
public:
  inline AFUnEscapeURL(int &CE) { CE = 0; }
  inline static unsigned int GetCurrentVersion() {
    return 100;                       // Version 1.00
  }
  inline static const char *GetPluginName() {
    return "UnEscapeURL";
  }
  // The Filter method does the actual filtering for each Access.  If it 
  //   returns a true value, the access is discarded.
  virtual int FilterAccess(AccessFormatPlugin &A) {
    String &URL = A.GetURL();
    int i, j = 0, Len = URL.Length();
    for (i = 0; i < Len; i++) {
      if (URL[i] == '%') {
        URL[i] = 16*String::Hex2Int(URL[i+1]) + String::Hex2Int(URL[i+2]);
        i += 2;
      } else if (URL[i] == '+') {
        URL[j] = ' ';
      } else {
        URL[j] = URL[i];
      }
      j++;
    }
    URL.Left(j);   // Trim URL to shortened length...
    return 0;
  }
};
INIT_PLUGIN(AFUnEscapeURL);
//
// End of plugin
Although simple, this plugin illustrates a number of important points of the plugin system. Lets run through how it works...

The constructor is a really simple function that has no state to initialize. Because it can't fail, it simply sets the CreateError argument to false, to indicate that an error has not occurred.

The GetCurrentVersion method simply returns the constant version number. In this case, 100, which represents version 1.00.

The GetPluginName method returns the name of the plugin. This too is a constant, which it simply returns.

The FilterAccess method does the all of the hard work. It actually goes through scanning for '%' characters (which are mapped to the character specified in hex numbers after it) and '+' characters (which are mapped to spaces)... transforming the URL in place. When the end of the string is reached, the result string is truncated, because it is now strictly ≤ its original length.

The last interesting part of this plugin is something that we have not talked about before: plugin registration. The INIT_PLUGIN macro expands to a sequence of "stuff" that makes sure that the plugin system is notified that the plugin is available for use by the system. If you don't do this, then MagicStats will ignore your newly defined class. Note that this should only be compiled once... so if you put your class declarations into a header file, this should go into one .cpp file... not into the header.

Page Plugins

PagePlugins are the most visible plugins to the end user. They are responsible for transforming preprocessed input into visually appealing graphs and charts. PagePlugin's are instantiated directly from the Skel files that make up the themes that the user has configured. The API for PagePlugin's looks like this:

class {classname} : public PluginTemplate<PagePlugin,{classname}> {
public:
  static unsigned int GetCurrentVersion();
  static const char *GetPluginName();
  {classname}(int &CreateError);
  void LoadSettings(VarTable *List); // Process parameters to plugin...
  void ProcessAccess(AccessFormatPlugin &Access);
  void ResetState();
  void OutputHTML(ostream &OutputStream, VarTable &Settings);  
  void LoadState(Serialize &);  // Save and restore state across program
  void SaveState(Serialize &);  // runs...
  void GetFullName(VarTable &Params, String &FullName, int UpdateFreq);  // this will probably change in the future
};
Conceptually, PagePlugin's go through several stages of "life". When MagicStats starts up, the plugins are created to receive data from the theme's pages. Plugins provide a method to save and a method to load their state (LoadState and SaveState aptly enough), so that between runs of MagicStats, they don't forget everything they know. The ResetState method is used to tell the plugin when to forget everything it knows, for example at the end of a day.

In addition to these state manipulation methods, there are two fundemental methods, ProcessAccess and OutputHTML. The ProcessAccess method is called on each access that is streamed in from an access log. This is the place that the plugin has to decide what to do with the access. The OutputHTML method is called by OutputManager when it is neccesary to send the state of the plugin to file as HTML. This allows the plugin to render any HTML that it feels like.

Example PagePlugin:

The best way to explain this is to look at an example in detail. Before we look at code, lets look at how this plugin (PageCountPlugin) is used in a theme:

Here are two example instantiations of the PageCountPlugin (from the Dejavu theme):

<MSPlugin=PageCount
  Filter=($PageFilter)
  Graph.Format=Graph
  Graph.Width=500 Graph.Height=300
  Graph.BGColor=($BGColor)
  Graph.MaxListLength=20
  Graph.Java.BarColor="#0000FF"
  Graph.Java.BarEndColor="#FFFF00"
  Graph.Java.BarStyle=2
  Graph.Java.BarCenter=128
  Graph.Java.GraphWidth=500
  Graph.Java.GraphHeight=300
  Graph.ASCII.MaxBarLength=70
  Graph.Image.MaxBarLength=500>
Although this looks complicated, it really isn't. Basically the plugin is told to use the Filter object that is passed into the theme in the PageFilter variable, and it tells the plugin to draw as a graph (the Graph.Format parameter), and explicitly tells it how to do so.

<MSPlugin=PageCount
  Filter=($PageFilter)
  Graph.Format=Table
  Graph.MaxListLength=20
  Graph.Table.ColumnHeadColor="#FFFF00">
This example is much shorter, and tells the plugin to draw in Table mode with the same filter. It also specifies a color to use in the column headers.

So now that we see how this powerful plugin is used, lets take a look at its source (this also shows an example of splitting a plugin across .h and .cpp files):

// From the PageCountPlugin.h file...
//
#include "PagePlugin.h"
#include "AccessFormatPlugin.h"
#include "Table.h"
#include "VarTable.h"
class PageCountPlugin : public PluginTemplate<PagePlugin,PageCountPlugin> {
public:
  inline static unsigned int GetCurrentVersion() {
    return 100;                       // Version 1.00
  }
  inline static const char *GetPluginName() {
    return "PageCount";
  }
  PageCountPlugin(int &CE);
  void ProcessAccess(AccessFormatPlugin &);
  void ResetState();
  void OutputHTML(ostream &, VarTable &);  
  void LoadState(Serialize &);  // Save and restore state across program
  void SaveState(Serialize &);  // runs...
  void GetFullName(VarTable &Params, String &FullName, int UpdateFreq);
private :
  Table<String, int> CountTable;  // mapping between URL and count
};
This is a pretty straightforward implementation of the interface described above. Here all that is filled in is the version and plugin name... which are trivial. Okay, lets look at the implementation file now...

// From the PageCountPlugin.cpp file...
//
#include "PageCountPlugin.h"
#include "DataGraph.h"
#include "SerializeEx.h"
// Initialize the plugin, as we discussed before...
INIT_PLUGIN(PageCountPlugin);
PageCountPlugin::PageCountPlugin(int &CE) {
  Flags = FAutoFilter | FIgnoreErrors;
  ResetState();
}
void PageCountPlugin::GetFullName(VarTable &Params, String &FullName, 
                                  int UpdateFreq) {
  String Filt;
  if (Params["Filter"] != 0) {
    Filt = Params["Filter"]->GetStringValue();
  } else {
    Filt = DefFilter.ToParamStr();
  }
  FullName = String::IntToStr(UpdateFreq) + Name + "\t" + Filt;
}
void PageCountPlugin::ProcessAccess(AccessFormatPlugin &A) {
  CountTable[A.GetURL()]++;
}
void PageCountPlugin::ResetState() {
  CountTable.Clear();        // Free all of the data in the linked list...
  CountTable.SetDefault(0);
}
void PageCountPlugin::SaveState(Serialize &O) {
  O << GetCurrentVersion();  // Serialize the Version number
  O << CountTable;
}
void PageCountPlugin::LoadState(Serialize &I) {
  unsigned int Version = 0;
  I >> Version;
  // Check version number...
  switch (Version) {
  case 100:                  // Version 1.00
    I >> CountTable;
    break;
  default:
    cout << "Serialized " << Name << " Page plugin is of a version that cannot "
         << "be deserialized\nwith this version of the plugin.  Reseting state.\n";
    break;
  }
}
void PageCountPlugin::OutputHTML(ostream &O, VarTable &Params) {
  DataGraph Data(Params);
  typedef DataPair<String,int> Pair;
  LinkedList<Pair> List;
  LinkedList<String> RowLabels, Pages, Count;
  // Copy the data from the table into the linked list...
  CountTable.ConvertToList(List);
  // Sort the data list by the number of hits.
  List.Sort(&Pair;::SortBySecondary);
  // Go through the list and convert URL's to links...
  LinkedList<Pair>::Iterator I = List.GetIterator();
  while (I) {
    String &URL = I->P;
    URL = "<a href='" + URL + "'>" + URL + "</a>";
    I++;
  }
  Data.SetRowLabelFormat(1);     // Increasing integer labels
  Data.SetRowLabels(RowLabels);
  Data.AddColumn(Pages);
  Data.AddColumn(Count);
  RowLabels.AddToTail("#");
  Pages.AddToTail("Web Page URL:");
  Count.AddToTail("Hits:");
  I = List.GetIterator();
  while (I) {
    Pages.AddToTail(I->P);
    Count.AddToTail(String::IntToStr(I->S));
    I++;
  }
  Data.WriteHTML(O);
}
Okay... to start with, lets look at that constructor. The constructor is setting a instance variable that all PagePlugin's have named 'Flags'. This is used to optimize and simplify the task of creating PagePlugin's. It is the bitwise or of the following possible values:

  1. FAutoFilter: If this flag is set, the plugin automatically recognizes the Filter parameter and only recognizes accesses that match the specified Filter. This is exactly what we want for this plugin, so we enable this. This removes the need to do custom filtering...
  2. FIgnoreErrors: This is specified as a shorthand to say that we don't want to process any error accesses. This is because we want to see the most popular pages... not the most popular errors... :)
  3. FNoAccesses: This flag is rarely used, but basically means that ProcessAccess should never be called... Take a look at the DateEnum plugin for an example where this is useful.
After setting up the 'Flags' variable, ResetState method is called to zero out our data table.

The GetFullName method is the one that will probably be disappearing soon... because of this, I'm not going to talk about it much. :) Basically it wants to return a string unique to instances of this plugin that are different.

The ProcessAccess method is pretty simple in this case... it basically just increments the count for the URL that got hit. This makes use of one of MagicStats' templated data structures: the Table class (which implements a Hash table... aka an associative array).

The ResetState method is supposed to forget the entire state of the plugin. This is easily accomplished by clearing the hashtable. This also sets the default value for unseen entries in the hash table to 0... so that the first time an entry is used, that is the value it gets initialized to.

The SaveState method saves the entire state of the program to the "Serialize" parameter. This Serialize object is basically a fancy bit bucket for collecting objects that makes it easy to save and restore stuff... Here we save our version number and the state of our table.

The LoadState method is the opposite of the SaveState method. Here we load the version number, then check to make sure that we are not trying to deserialize something that we don't understand. If things are cool and we understand the format, then we go ahead and grab out table back again.

The OutputHTML method is where all the action is... this class has really straight-forward analysis, but potentially complex data formating guidelines. Here we make use of the DataGraph to do all the fancy formatting and processing of all the the Graph.* parameters to the plugin. The DataGraph class takes a series of columns of data to display. In the case, we are presenting it as a column of URLs, and a column of hits.

Before we get that far, we need to convert the data into two parrellel arrays (one for each column) that are sorted by the number of hits (so that most common things get placed at the top). Labels are added to the graph, then the graph writes its HTML to the output stream... and its done!.

Page plugins are extremely important to MagicStats' theme authors. Providing enough capability to be flexible is very important: This allows theme authors to be maximally creative without having to touch C++! MagicStats ALWAYS needs new and different plugins for different types of analysis.

Platform Abstraction Layer:

Unfortunately it seems that operating systems from various vendors are not very compatible... because of this, MagicStats has a small PlatFormAbstraction class that is used to do things like load shared objects, get the preferred directory seperator on that platform, and even compensate for compiler bugs/features. If you are looking for something like that, look here. :)

Conclusion:

MagicStats is a powerful system, but it needs to grow. Even novice C++ programmers should be able to "scratch their itch" and implement a plugin that helps does something cool. This document is one step towards that.

This document grew to be much longer than I expected it to be. It is now a pretty complete guide to authoring plugins of various different types, and hopefully shows why MagicStats is so flexible (at least at the bottom most layer). I would really, really, really like feedback on what you think is terrible about this document, what you think is good, what needs to be explained more and expanded upon, and if I'm a total fool, please tell me!

-Chris


Copyright © 2000 Chris Lattner