May 24, 2008

MAML Migration: The Next Step in the Evolution of Help Authoring

Are you part of that herd of developers that is used to documenting applications by writing help topics in raw HTML?  The power of it is nice, being able to add a pinch of bold here, a splash of italics there, some CSS for different layouts, a floating image, several nested tables, an abundance of hyperlinks, embedded Flash and media players, and even some JavaScript to boot.  What more could we want?  Or maybe a better question is, what could possibly make us want to give any of that up?

XML Documentation Comments

Well, XML documentation comments that may be added to code modules (I'm assuming everyone's familiar with this stuff by now) was one thing that prompted .NET developers to start documenting their code without using HTML.  It's nice to be able to apply a bit of XML structure to our documentation, isn't it?

Commonly used semantics for describing an API may be expressed in a universal way with XML tags such as summary, remarks and example, and the compiler builds an XML documentation file that contains the code comments found in each module when we build our project.  If you take a look at the contents of this file you may see a repeating pattern - schema - that seems like it could be used by some other tools to do, well, other things with it...

Compare XML documentation to your legacy HTML help topics and what do you notice?  The XML comments that you add to APIs do not typically contain much layout or formatting, whereas your HTML topics are chock-full of <b>'s, <i>'s, and <u>'s, and a whole mess of other HTML to describe the document's layout and formatting.  Ok, ok, if you've done it correctly then you've made judicious use of CSS - applying class names to all of those <h*>'s, <div>'s, <span>'s, <p>'s, <a>'s, <td>'s, <tr>'s, <ol>'s, <ul>'s, <dl>'s, and certainly many other HTML tags that only add to the confusion when authoring topics (as opposed to designing them).

Now you might be thinking, "Dave, it's not entirely true that XML documentation is without formatting.  What about the para, code and c elements?".  And to that my reply would be, "Ok, so then what exactly do they look like?".  If you look in the XML documentation files that are produced by your compilers, you'll see the markup exactly as it appears in your code modules.  In other words, no HTML and no CSS - nothing more than semantic usage: paragraph, code block and in-line code.  (If you were thinking something more like, "leading white-space, use pre formatting, code coloring and a fixed font", then you're getting ahead of yourself, so slow down!)

My point is that the semantics for the aforementioned XML documentation tags are clear (i.e., what the tags represent), but their appearance is not yet defined (i.e., their style and format).  Take a look at the other Recommended Tags for Documentation Comments and you won't find anything out of the ordinary.  Each tag has an obvious reason for its existence - to mark up regions of text that serve a particular purpose in the documentation.  But how do they look?  Nobody knows!  ;)

Sandcastle

Now's probably a good time to introduce Sandcastle.  For those of you that aren't familiar with it yet, Sandcastle is Microsoft's tool set for producing HTML help topics dynamically by inspecting managed assemblies and incorporating the markup from XML documentation.  From the assemblies that you provide to Sandcastle, it automatically infers a table of contents (TOC), various pseudo-topics such as Properties and Methods, and also generates many individual topics to cover the entire API.  The documentation you've written within XML documentation tags, such as summary, remarks and example, is automatically added to the generated topics in the appropriate places.

The results of running Sandcastle on your assemblies and XML documentation is a set of files that are web-ready HTML help topics for your project.  This is typically referred to as reference documentation, since it provides a reference for developers that use your API.  These topic files can be used as input to a tool such as HTML Help Workshop (Help 1.x) to produce a stand-alone compiled help file (.chm) that may be distributed with your application as an external help module.  The .NET Framework even gets in on the action by providing helpful APIs for integrating context-sensitive help and Help 1.x navigation into your managed applications.  (See the Help class for more information.)

Presentation Styles

Sandcastle provides three presentation styles that it can produce for your documentation out-of-the-box.  Each one consists of a set of XSL transformation files that convert XML documentation into XML-based HTML (not XHTML, however).  They also contain resources such as icons and, of course, CSS style sheets.

For an example of a Sandcastle presentation style, look no further than the documentation for Visual Studio and the .NET Framework on MSDN.  The appearance that MSDN uses is similar to the VS2005 presentation style in Sandcastle.  I believe that Microsoft actually uses a customized version to build their internal documentation, even for Visual Studio 2008.  The other, experimental styles, that ship with Sandcastle are Prototype and Hana.

For more information about Sandcastle, see my Sandcastle Help article on CodePlex.

From XML Documentation Comments to Reference Documentation

So the process is actually quite simple.  As developers we can easily document our source code using XML documentation, which allows us to concentrate more on writing the content instead of having to worry about formatting it with HTML.  When we build our project, the compiler will produce an XML documentation file that can be passed to Sandcastle, which then inspects our assemblies and automatically generates reference documentation that includes the comments that we added to our source code, but in a pretty HTML/CSS-based style that looks very similar to MSDN.  Nice!

User Documentation

Sandcastle can automatically generate reference documentation that is useful to other developers, but what about user documentation?  I mean things like How To, Sample, Walk-through, Overview, etc. - stuff that an end-user would want to have.  Well don't expect Sandcastle to know what you're thinking - we still, unfortunately, have to get concepts out of our heads and into help topics manually.  (At least for the time being, until someone invents HAL ;)

Conceptual documentation (how Sandcastle refers to user documentation) is often much harder to write than XML documentation comments since it requires a more in-depth understanding of the application being documented.  It's easy to look at the source code, notice that an exception is being thrown and then add an exception element to the XML documentation comments for that API.  Or to notice that a particular algorithm is being used and to add a comment in the remarks element that mentions it.  But to understand and be able to express the purpose of different user interface (UI) elements, how to perform various UI-related tasks, and how the individual APIs and components fit into the designs of other high-level processes in an enterprise-level application, is certainly more difficult and typically requires an understanding of many different aspects of the application.  So the bigger the application the harder it is to write conceptual documentation, and not just because it's more time consuming but also because it's more complex.

So if writing conceptual documentation can be more time consuming and harder to accomplish than writing XML documentation comments, why do people still insist on writing conceptual documentation in HTML?  Maybe the advantages of XML documentation comments can be applied to conceptual documentation as well.

The Perils of Writing HTML Help

I started this post by pointing out one very common way of writing help: raw HTML.  We've all done it, and I know that each time I do I end up reinventing the wheel all over again.  A new HTML layout, CSS styles, some new and strange way of cross-referencing, JavaScript for collapsible sections, etc., must all be redeveloped.  (Yea, some companies are too cheap to buy a tool that does this automatically - and so am I. :)

Creating a new help topic starts with copying an existing HTML file that is used as a template, of which there's usually only one kind containing a header, with style sheet links and scripts, a body that's empty, and a footer.  Writing a help topic requires having to look through the other topics quite often to find out which HTML tags and CSS class names I should be using for various styles.  This is especially annoying when I have a good idea that I simply want to put down quickly and be done with it.  Uh oh, that hyperlink to an HTML topic that I've been copying and pasting throughout my documentation is actually misspelled - time to do a search and replace.  Hmm, I'm not sure that I like the format that I've been using for laying out tables - oh well, it's not worth the effort to fix it now.

Is There a Better Way?

Technical writers, I can only assume, take help authoring more seriously than that.  They get paid to worry about things such as structure, readability and maintenance, so it shouldn't surprise us to know that there's a much better way to write help than simply using raw HTML.  As developers, we could probably learn a thing or two from them when writing our own documentation, whether it's for an API or conceptual topics.

Lucky for us, Microsoft has a huge library of documentation and employs technical writers to write their "official" help, which is then published to the web on MSDN.  (Sorry about that horrible reference for technical writers, but I couldn't find anything better.  I know that I've seen someone from Microsoft, probably Anand, state that they don't use developer code comments internally and instead have professional authors write it.)  This means that over the years they've had to come up with a solution that makes authoring help manageable, which is a huge task for such a large documentation set.  They also needed a way to manage file names and links for cross-referencing help topics (think, See Also section).  Since the look and feel of MSDN changes from time to time, the ability to write documentation that is absolutely independent of any one style or format was imperative as well.

So we have an invaluable example to which we can aspire.  A whole plethora of documentation written with clarity and precision using standardized techniques.  If you take a look at the documentation on MSDN, you should see a crisp and clean style that, when compared to your raw HTML help topics, probably looks far more professional.  This is nothing new to us though - we've been referencing it for quite a long time now as .NET developers.  Many people, predating .NET, have even watched MSDN documentation improve dramatically over the years, and most developers that need to write their own documentation seem to want to reproduce the same look and feel.  Many tools have even helped us to generate reasonable facsimiles in the past (such as NDoc).

But have we finally come to the point where we can write our own help topics without having to remember abstract HTML tags and CSS class names?  Can it be transformed automatically into documentation that looks the same as MSDN, or any other style for that matter?  Is there a way to simply specify a unique identifier for another topic and have hyperlinks generated automatically?  What about linking to reference topics?  Is there a way to ensure that topics of a similar nature will all share the same exact structure?

The answer to all of these questions, of course, is yes.  (But wouldn't it be funny if it was no?  I'd probably take a nap.)

Microsoft Assistance Markup Language (MAML)

Microsoft uses Sandcastle internally to generate help topics for the .NET Framework, so it's no wonder that Sandcastle also provides a way to apply structured authoring techniques to conceptual documentation, in much the same way that XML documentation comments are used by developers to write reference documentation.  In Sandcastle, conceptual topics are written in MAML.

MAML is an XML schema that defines various high-level document types, such as How To, Walkthrough, Sample, Glossary, Whitepaper, Troubleshooting and many others.  These document types provide the structure of a help topic, which doesn't change.  What can change though, is how Sandcastle presents this structure when it generates HTML topics.  This means that, for example, the markup in all of your How To topics will look similar, regardless of the presentation style that you choose.  As a matter of fact, the markup in your How To topics will be similar to mine, even if we choose to produce HTML help output in very different styles.

The schema also defines various XML elements that mark up text using a semantic approach.  For example, the ui tag is applied to text that corresponds to a user interface element, such as the text on a button.  Another example is alert, which also requires an attribute named, class that indicates the type of alert, such as note, caution, tip, warning, and others.  Another is country, which you may have already guessed, describes a country!  You would surround text with an application element when it represents the name of an application.  I think you get the idea...  By my count there's well over 40 elements that you can choose from.   And with Visual Studio's XML editor you can actually have IntelliSense tell you what they all are and where it's appropriate, within the topic's structure, to use them.

The beauty of all this is that the Sandcastle presentation style that you choose controls the HTML layout of the MAML document type used by your topic.  It also defines how all of the MAML elements will appear in the HTML.  For example, alert is transformed into an HTML table layout, while ui and application are simply bolded.  Special formatting is not actually applied to text that is specified as being the name of a country, but you could update the transformation to change the HTML markup or possibly just add a CSS rule to apply the formatting that you want, without having to update the actual topic itself.

A MAML Example

Here's a small portion of the Glossary help topic that I've written for my Auto-Input Protection (AIP) project.

<?xml version="1.0" encoding="utf-8"?>
<topic id="14790228-f45b-42d5-9b3e-f6b4ab932b9e" revisionNumber="0">
  <developerGlossaryDocument xmlns="http://ddue.schemas.microsoft.com/authoring/2003/5" 
                             xmlns:xlink="http://www.w3.org/1999/xlink">
    <glossary>
      <title>Glossary</title>
      <glossaryEntry>
        <terms>
          <term>AIP</term>
        </terms>
        <definition>
          <para>
            An acronym that stands for Auto-Input Protection.
          </para>
        </definition>
      </glossaryEntry>
      <glossaryEntry>
        <terms>
          <term>Answer</term>
        </terms>
        <definition>
          <para>
            A user's or bot's response to a challenge.  In AIP, the correct answer is a 
            string of text that matches the text on the CAPTCHA image.  An incorrect 
            answer does not match.
          </para>
        </definition>
      </glossaryEntry>
      <glossaryEntry>
        <terms>
          <term>CAPTCHA</term>
        </terms>
        <definition>
          <para>
            An acronym that stands for Completely Automated Public Turing test to tell 
            Computers and Humans Apart, trademarked by Carnegie Mellon University according 
            to the following article: <externalLink>
              <linkText>CAPTCHA. (2008, March 26).</linkText>
              <linkUri>http://en.wikipedia.org/w/index.php?title=CAPTCHA&amp;oldid=201120981</linkUri>
            </externalLink> In Wikipedia, The Free Encyclopedia. Retrieved 09:01, March 27, 2008.
          </para>
        </definition>
      </glossaryEntry>
      <glossaryEntry>
        <terms>
          <term>Challenge</term>
          <term>Test</term>
        </terms>
        <definition>
          <para>
            A CAPTCHA image, being displayed on a web page, to which a user must respond 
            with an answer by entering the text that they see on the image.  The result 
            is pass or fail.
          </para>
        </definition>
      </glossaryEntry>
    </glossary>
  </developerGlossaryDocument>
</topic>

The following image shows the results of the glossary transformation into HTML, built by DocProject (a tool that I've written to automate Sandcastle inside Visual Studio).  The VS2005 presentation style was used for this example.

image

And now here's the same exact topic file after being transformed into HTML using the Hana presentation style.

 image

There are a few things to point out about all of this.

First of all, notice that the topic that I've written only uses some very basic XML, yet the output obviously contains additional layout and style, which differs depending upon the presentation style that I've chosen.  In the Hana version I've even left in the default header that warns about pre-release documentation.

You may have also noticed the letter bar and the individual letter sub headers.  Where'd they come from?  These features are not actually part of Sandcastle, but Eric Woodruff and I have added them to the presentation styles by modifying the XSL transformations that convert the MAML Glossary document type into HTML.  The additional behavior automatically detects the glossary terms in the topic and creates the letter bar and headers dynamically.  All of the terms are sorted alphabetically as well (although it's not obvious in my example because they're already in alphabetical order in my topic file).

Pretty cool, right?  You'll be able to get these Glossary updates from the new Sandcastle Styles project on CodePlex, which should go public within a few days after the next Sandcastle release.  This project was started by Paul Selormey, Eric Woodruff and myself.  In the last week we've been diligently working on preparations for our first release, so please check it out when we go live and let us know what you think :)

Linking in MAML

The last thing that I want to point out about the previous example is that it contains a hyperlink to an external web site.  As you can see from my topic, MAML supports an externalLink element that accepts text in a linkText element and a URI in a linkUri element.  It also accepts alternate text in a linkAlternateText element, but that's optional.

Instead of linking to external URIs, you can also link to any of the other topics being documented.  To do that you would use a very simplified version of the XLink specification on a link element, as in the following example:

<link xlink:href="37852294-410f-4bb2-9008-c5fa9dfb4347">Part II</link>

Right, topics are identified by GUIDs.  Currently, Sandcastle also requires that all conceptual topic files are named with a GUID and an .xml extension.  A bit annoying at first, but if you use DocProject it provides a Topic Explorer tool window that makes it easy to find the topic that you're looking for without having to open all of them :)

Notice that in my example the value in the href does not have an .xml file extension specified.  That's because link doesn't reference files, it references topics.  This is important to realize because it's not the same as the way linking works in HTML - this is actually dynamic.  If Sandcastle cannot find a topic that is associated with the specified GUID, then it doesn't generate a hyperlink at all.

This is a bit different from what we're used to in HTML, which allows us to link to anything under the sun using only one tag: a.  So why such weirdness in MAML?  I think the answer to that question is actually quite simple, although for some reason it's easy to miss when first starting out with MAML.  The MAML schema defines elements that apply structure and semantics to text, instead of format and style, like HTML.  For this reason, you wouldn't see a tag named simply, a in MAML because it's not descriptive at all.  Link, on the other hand, is very descriptive.  And since an HTML anchor is meant to provide the source point of a diametric link, its use is actually more limited than XLink.  The XLink specification actually provides a way to establish relationships between one or more resources (at least that's my interpretation of it), which would offer much more flexibility.  So MAML provides a mechanism to link to other topics, not just external URIs, and the XLink implementation provides an explicit way to describe links as being special - they must be processed by Sandcastle.  Currently, Sandcastle doesn't actually seem to use any of XLink's features though aside from what has been deemed as "simple" usage, but maybe that'll change in the future.

But that's not all.  If you want to create a link to an API in your reference documentation, you would use the codeEntityReference element instead.  Yikes!  So now we've got yet another way to link.  But again, keep in mind that MAML is much more expressive than HTML, and that's why we've got different tags for linking to different things.  The benefit being that our intentions are clear when we write our topics so that different styles of linking can be handled differently.

The following XML snippet illustrates all three approaches to linking in MAML topics.  Each example is a child of the relatedTopics element, which, in the Sandcastle world, will eventually become your topic's See Also section.

<relatedTopics>
  <codeEntityReference>T:MyNamespace.MyClass</codeEntityReference>
  <codeEntityReference>P:MyNamespace.MyClass.MyProp</codeEntityReference>
  <codeEntityReference>M:System.IO.File.OpenText(System.String)</codeEntityReference>
  <externalLink>
    <linkText>DocProject</linkText>
    <linkUri>http://www.codeplex.com/DocProject</linkUri>
  </externalLink>
  <link xref="home">My Home Page</link>
  <link xref="Contact Us"/>
  <link vref="/related.aspx">Related web page</link>
  <link xlink:href="14790228-f45b-42d5-9b3e-f6b4ab932b9e">Part II</link>
</relatedTopics>

Notice that there are also two more link types in the example above that I didn't mention previously: link elements with xref and vref attributes.  This type of linking is used instead of externalLink so that only an ID must be specified instead of an entire URL.  The ID is part of an ID-to-URL mapping that is configured elsewhere.  This feature is not actually part of Sandcastle though; it's provided by a custom build component that I've written which, for the next release of DocProject, has been modified to support conceptual builds as well.  The component is called ResolveExternalLinksComponent and it's available as a separate download or as part of DocProject.  Without this build component xref and vref do nothing.

Conclusion

HTML is out.  MAML is in.

Well, it's not actually as substantial of a change as I'm implying - HTML is still being used extensively as the final output for compiling help; however, we no longer have to author help topics in HTML, which is a huge benefit.

So all this stuff might seem really wonderful in print, but I feel that I must warn you: It actually took me a few weeks before I finally started to get rid of that itch to lace my topics with bold and italic phrases where it didn't actually add any value.  When you first start writing MAML it can feel very restrictive, and it is compared to HTML in terms of how quickly you can apply new styles, since to do that you have to leave the actual topic and modify files in the Sandcastle presentation; but it's actually much more expressive in terms of describing information and that's what we should be concentrating on when we write help topics - the information.

What I've learned from writing topics in MAML is that using elements such as ui, userInput, math, date, and many others, as well as externalLink, codeEntityReference and link for linking, ultimately accomplish the same thing as HTML but in a much better way - no more CSS class names to remember or abstract HTML tags like b and i (or strong and em too).  Instead, I can specify exactly what a phrase represents and continue writing.  The format and style is already defined for me by the presentation style that I choose, even if I haven't chosen it yet!  However, if I've already chosen one that mostly fits my needs but I'm not happy with a particular style, I can apply some HTML and CSS to the different MAML elements without having to update anything in the topics themselves.  By reusing the same common tags throughout my documentation, it looks much more professional, it's easier to manage and it's even portable since it's all XML, so if in the future I want to generate Open XML documents instead of HTML, I won't even have to change anything in my topics.

Note that if you want to convert all of your existing HTML topics to MAML in a batch process, I've got a tool called DocToMaml.  It's currently in beta, but it does work.  Any feedback on it will be appreciated :)

For the next version of DocProject 2008 (Beta 3) I'm working on a MAML WYSIWYG editor that is integrated into Visual Studio, so keep your eyes open for that.

If you have any feedback about how MAML and Sandcastle's conceptual build process can be improved please let the Sandcastle team know by submitting a request to the Sandcastle Issue Tracker on CodePlex.

Add comment