Table of Contents
PreTI is driven by rules. The rules syntax is briefly discussed in "PreTI Plug-in User's Guide"; we included the description into the User's Guide to make it easier for the user to understand how PreTI operates. However, the rules are not supposed to be created manually; instead, we have provided a set of tools to produce them automatically.
The two sources for PreTI rules are markup schemas and sample documents. A markup schema defines syntactic properties of the markup, the way its elements are allowed to be nested and ordered. A schema language is used to describe a schema; examples of schema languages are XML DTD, RELAX NG, W3C XML Schema, and others.
What remains beyond the scope of a markup definition expressed by a schema is the appearance of text units. For example, titles and subtitles usually consist of a single sentence and do not end with a period; list items begin with a dash and often end with a semicolon; among inline elements, one can distinguish class and variable names or book titles in the text flow without the aid of a markup; abbreviated names, WWW links, e-mail addresses can also be recognized easily. Hints aiding the recognition, while related to human perception, can be reconstructed from sample documents. Sources for the patterns are already marked up, and it just remains to discover common characteristics and differences.
Document markup grammars are often being developed in an attempt to cover a wide range of documents and subject areas, and are consequently quite extensive. In fact, many institutions and individual authors are wise enough to restrict their vocabularies to smaller grammar subsets. Subsetting is a good thing:
compact subsets are easy to keep in one's memory — and the author should not be distracted from the essence of the document by the need to consult a markup reference book;
restricting choices helps achieve consistency, both in the form and in semantics;
following common guidelines helps several authors working on a common set of documents understand each other's intents and co-edit common parts.
PreTI, too, works better with reasonably restricted subsets than with complete grammars. For similar reasons.
Preparing a grammar subset for PreTI is the only completely manual step; it also is a very straightforward one. To define a subset, one ought to create a rules file with all the three sections (inline, block, group) and elements for all tags in each of the sections. The elements' contents (that is, when, first-in and follows) should be omitted at this stage.
The subset serves several purposes:
it defines which elements should be included into the markup rules;
it assigns a class (inline, block or group) to each of the elements;
it serves as a template for the rules file.
For each element in the subset, its class should be correctly chosen according to its use, not necessarily according to the rules of the grammar. For example, term in DocBook is almost always the only member word in a paragraph, and making it a block-level element is natural. On the other hand, while footnote is a group-lervel element according to its content model, it is used inside text lines, and most footnotes are relatively short, seldomly exceeding one sentence in length. It is convenient in many applications to declare footnote as an inline element.
A good way to explain a process is to follow an example. As a sample markup language, we have chosen Darwin Information Typing Architecture; an introduction to the markup and the technology behind it is available at http://www-106.ibm.com/developerworks/xml/library/x-dita1/index.html. DITA's lower-level elements are similar to those of DocBook and XHTML, however, the hierarchical structure and certain other features are different; it makes the language both easy to grasp for those familiar with either of the other two, and interesting to implement in PreTI.
The subset is chosen so that it both shows the technique and avoids cluttering the explanation with needless details; many more elements can be added without the loss of convenience. The subset, in the rules file's syntax, has the following form:
<?xml version="1.0" encoding="utf-8"?> <rules xmlns="http://davidashen.net/PreTI/rules"> <inline> <element name="cite"/> <element name="q"/> <element name="term"/> <element name="keyword"/> <element name="varname"/> <element name="filepath"/> <element name="xref"/> <element name="i"/> <element name="b"/> <element name="u"/> <element name="tt"/> <element name="sup"/> <element name="sub"/> </inline> <block> <element name="p"/> <element name="title"/> <element name="sli"/> <element name="li"/> <element name="dd"/> <element name="dt"/> <element name="pre"/> <element name="lq"/> <element name="note"/> <element name="stentry"/> <element name="linktext"/> </block> <group> <element name="dita"/> <element name="topic"/> <element name="section"/> <element name="body"/> <element name="concept"/> <element name="conbody"/> <element name="dlentry"/> <element name="dl"/> <element name="ul"/> <element name="ol"/> <element name="sl"/> <element name="strow"/> <element name="sthead"/> <element name="simpletable"/> <element name="related-links"/> <element name="link"/> </group> </rules>
A wide enough range of inline elements is chosen to demonstrate both the fixed patterns and the dynamic learning. The block elements are sufficient to represent common constructs, such as plain paragraphs, various lists, titles and simple tables. The group elements correspond to the block-level elements chosen, as well as provide two alternative upper-level containers, conceptand topic, to illustrate generation of tag selection rules in various contexts. The subset is not where DITA shines most; but it helps explain how, in general, rules for PreTI should be built.
Group rules reflect the way group and block elements are nested and ordered, and define when and which group tags to offer to the user. They are generated from the schema. We've chosen RELAX NG, http://relaxng.org/, as the schema language for PreTI; both a tool to convert to this language from most others (with notorious exception of XML Schema), Trang, exists, and parsing and manipulation of the schema can be implemented with acceptual amount of effort. Actually, a parser from Jing has been used; having an API for access to RELAX NG would be helpful though.
The makegroup utility takes the subset, rules-dita.xml and the schema, ditabase.rng, and produces a list of group rules.
java makegroup rules-dita.xml ditabase.rng > rules-dita.group
For each block- or group-level element, it generates an entry. The entry has two kinds of clauses, first-in and follows. The first-inclause specifies that an element can be offered as the opening tag just before the current element. The follows clause advises that the containing element can be closed just before the current one.
For example, the group rules for p, the tag for plain paragraphs, have the following form:
<element name="p"> <first-in name="conbody"> <when test="inside(concept)"/> </first-in> <first-in name="body"> <when test="inside(topic)"/> </first-in> <follows name="dl">1</follows> <follows name="ul">1</follows> <follows name="ol">1</follows> <follows name="sl">1</follows> <follows name="simpletable">1</follows> <follows name="dlentry">1</follows> <follows name="strow">1</follows> <follows name="section">1</follows> </element>
The rules state that a p begins conbody when met inside concept and body inside topic; it also closes lists, tables, and sections. Upon closer investigation, the last condition is not one that helps enter documents; each time a new paragraph is typed, PreTI will offer to close the current section. It happens because the schema allows a mix of sections and paragraphs inside a body, and paragraphs may follow sections.
While this rule can be manually removed from the final rules file, this particular feature of the markup language is questionable. Few documentation styles allow unstructured text after a section or chapter; specific efforts should be employed to visually separate the paragraph from the section which precedes it. Thus, it make sense, both to make the rules more convenient and the markup more consistent, to modify the schema by making it stricter and disallowing paragraphs after sections When this modification is made to the schema, the last rule goes away.
A similar thing happens with title. It would be natural to expect that a title inside a body starts a new section. However, due to the fact the schema allows a mix of block-level elements inside section without restricting the ordering of the elements, title cannot be used as a section start. Declaring title as the mandatory initial element of section solves the problem and makes entering text easier.
The rule for the title element is:
<element name="title"> <first-in name="section"> <when test="inside(body,conbody,dl,ul,ol,sl,simpletable,dlentry,strow)"/> <when test="(inside(topic,concept) and follows(title))"/> <when test="(inside(section) and not(follows(-)))"/> </first-in> <first-in name="concept"> <when test="inside(-,dita,related-links,body,link,conbody,dl,ul,ol,sl,simpletable,dlentry,strow)"/> <when test="(inside(topic,concept,section) and not(follows(-)))"/> </first-in> <first-in name="topic"> <when test="inside(-,dita,related-links,body,link,conbody,dl,ul,ol,sl,simpletable,dlentry,strow)"/> <when test="(inside(topic,concept,section) and not(follows(-)))"/> </first-in> </element>
Additionally, makegroup generates context constraints for block-level elements. They are then merged with recognition patterns based on features of block contents; their generation is discussed in the next chapter.
 In DocBook, para and other block-level elements can precede sections inside sections or chapters, but not follow it.
Rules for block-level elements help choose appropriate markup candidates for paragraphs of text and order them so that the most likely choices appear at the top of the list. They include rules of two kinds: based on the current context (containing and preceding elements) and on the contents of the paragraph.
Context rules are generated my makegroup. Another utility, makeblock, processes sample documents and outputs contents-based rules for each block element. Additionally, it orders the elements by their frequency of appearance in the samples in the decreasing order, so that equally rated, but more frequently used, tags appear first in the list of choices.
java makeblock rules-dita.xml samle1.xml sample2.xml sample3.xml ...
With a representative set of samples, it generates reliable statistical characteristics for each of the block-level elements, based on the number of word and sequences in a paragraph, as well as on initial and final substrings. For example, for title it produces:
<element name="title"> <when test="sentence-count > 2">-10</when> <when test="sentence-count == 1"/> <when test="word-count > 7">-10</when> <when test="word-count <= 5"/> <when test="ends-with('?','.','!',';',',',':')">-10</when> </element>
and for p:
<element name="p">1 <when test="ends-with('.')"/> </element>
Which means, that in title there can be two sentences at most, and most likely a single sentence only; never more than 7 and preferably not more than 5 words; a title never ends with a punctuation character. p is the most frequent block-level element, thus it gets one by default; an additional point is added if it ends with a period.
Certain rules cannot be inferred from sample data, but help increase efficiency of the recognition. For example, we find it useful to specify:
<when test="starts-with('- ','* ')/>>
for listitem, and then replace the initial dash or asterisk with a corresponding value of the mark attribute during postprocessing.
Yet another utility, makeinline, generates patterns for inlines. It takes the rules file, to determine which tags are for inline markup, and one or more sample documents to derive the patterns from inlines found in them. The syntax for the patterns may look cryptic, but is actually very simple. A pattern consists of two parts (each ending with character z):
a set of traits which must be present in a inline to be marked up with the tag (starts with =);
a set of traits which are allowed to be present (starts with +).
The traits are:
An uppercase letter is the first character (U).
A lowercase letter is the first character (L).
A digit is the first character (D).
An uppercase letter occurs in the inline (u).
A lowercase letter occurs in the inline (l).
A digit occurs in the inline (d).
A non-alphanumeric character occurs in the inline (the character is literally included into the template).
For example, =Uz+luz stands for 'an inline starting with an uppercase letter and consisting of on more letters'. The syntax is limited so that it allows use of simple algorithms for accumulation of instances in a pattern — the basis for the learning algorithm employed by PreTI. Full regular expressions would be more powerful, but much more difficult to implement efficiently.
However, in contrast to the other two groups of rules, inline rules can be left empty. A PreTI implementation constructs patterns for the tags used during the text input, and actually performs ad hoc what makeinline does with sample data. In our experience, the only patterns that should be included into the rules file are those for easily distinguishable objects, such as URLs and e-mail addresses. The use of most other tags depends on the style of a particular author and, unless consistence in markup is required among several writers, should only depend on the dynamic learning algorithm.
A makefile Make.rules for the UNIX make utility is included into the distribution. It illustrates the sequence of actions required to generate rules from a subset and a schema. A call
make -f Make.rules IDENT=dita RNG=ditabase.rng SAMPLES="dita-samples/*.xml"
will generate file rules-dita.all from the subset stored in file rules-dita.xml and sample DITA documents matching dita-samples/*.xml, according to RELAX NG schema ditabase.rng.
The generated rules can then be edited and tested using PreTIplug-in for jEdit. To add a new mode for DITA markup, one can copy rules-dita.all to .jedit/preti/rules-dita.xml and add the following lines to .jedit/properties:
preti-modes=docbook xhtml dita preti-dita-rules=rules-dita.xml preti-dita-extension=dit mode.dita.customSettings=true mode.dita.maxLineLen=80 mode.dita.wrap=soft
On subsequent runs of jEdit, the default mode for files with extension .dit will be dita. Alternatively, one can set the buffer mode in the Buffer Options dialog, or specify it inside the document in an XML comment according to jEdit's rules.