PreTI Plug-in User's Guide

Version 1.2

David Tolpin

Table of Contents

1. Introduction

2. Quick Start

3. The Document Model

4. The User Interface

Block Mode
Inline Mode
Tag Name Completion
The Status Line
Configuration and Settings

5. Rules

Inline Rules
Block Rules
Group Rules

6. Supported Formats

DocBook
XHTML

7. Licensing

Chapter 1. Introduction

The PreTI Plug-in for jEdit aids the markup of XML documents. It helps by augmenting text with structural tags, offering appropriate markup alternatives in the right place. The user just types the text and chooses tags which reflect the structure of the document.

The difference with this editor is the way in which it operates. Type a block of text; when you're finished, hit 'Enter'. The editor chooses a most likely tag. You accept it (hit 'Enter') or scroll, by pressing 'Tab', through alternatives.

The design is based on assumtions that:

a written document conveys information about its structure by means of the human language;
the role of markup is to follow the document's structure, not to define it;
the tags are for computers; no one would read a text marked in DocBook, instead of a plain flow of words, sentences, and paragraphs, just to understand it.

While the plug-in can be successfully used in its current phase of development, its main objective is to illustrate the principles of operation. Other features beyond the core paradigm can and will be added to make it even more convenient; we will do that in subsequent releases. This release supports markup for inlines, paragraphs and groups of paragraphs.

It is simply a matter of time to add support for schema-driven input of tags and attributes where a top-down approach is more appropriate. However, we think that this principal implementation provides a clear illustration of the technique. Handling of inline-level markup is addressed in this version for the first time and is still experimental.

Chapter 2. Quick Start

Open a document, set the buffer's mode to docbook in 'Buffer Options' or use a file name matching *.dbx.
Begin typing text; hit enter at the end of a title or a paragraph.
Accept a tag name with 'Enter', choose another one with 'Tab'. To reject inserting a tag, press 'Esc'. Type in the name if it is not amongs the choices.

To mark an inline (a word or a sequence of words inside a paragraph), press 'Ctrl-Shift-Space' just after the inline.

While this is enough to start playing with PreTI, we still advise to read the rest of the guide. It is written to make the use of PreTI efficient and rewarding.

Chapter 3. The Document Model

A text document possesses a structure. The structure is determined by expressive features of the document's language. For many computer-related tasks, the structure is underlined using a formal markup language. The document is augmented with markup tags which outline and join its parts.

A document consists of paragraphs. Paragraphs fall into several classes, such as

general-purpose paragraphs,
document, chapter, and section titles,
list items,
notes.

Paragraphs are marked with block-level tags.

Sequences of paragraphs are grouped into lists, sections and chapters. These higher-level structures can also be grouped into sections, chapters or documents. Group-level tags are used to denote groups of paragraphs.

Text in paragraphs is often decorated — certain words and word sequences are underlined, printed in intalic, bold, or using a different font. This idea is expressed here using inline-level tags.

Document markup languages usually provide these three groups of markup tags (block, group and inline tags). Auxiliary infromation can be added to tags through the use of attributes. Other classes of tags serve the purpose of data-oriented input, and, while important, are beyond the subject of this chapter.

Chapter 4. The User Interface

Table of Contents

Block Mode
Inline Mode
Tag Name Completion
The Status Line
Configuration and Settings

Block Mode

The plug-in reveals itself through custom handling of the 'Enter' key and additional key bindings. The basic actions, all of which are bound to keystrokes, are:

offering a tag for the current paragraph;
automatic closure with a matching tag;
jumping between opening and closing tags of an element

Each time the user signals the end of a paragraph by pressing 'Enter', the program adds a pair of tags denoting the current paragraph and displays a best guess as the element's name. The user can

pass to other feasible (from the program's point of view) options by pressing 'Tab',
accept a choice by pressing 'Enter' or 'Space',
or just type in the name of a tag^[1], if the right choice is not amongst the proposed names.

Several keystrokes have been added for an author's convenience.

'<' followed by '/': which automatically closes the current element by adding the tag's name and the closing angle bracket '>'.
'Ctrl-Shift-Enter': Surrounds the paragraph under the caret by tags (in the same way 'Enter' does for the just entered one), thus allowing to add tags to a paragraph at any place in the document.
'Ctrl-Shift-<', 'Ctrl-Shift->': Move the caret to the opening or closing angle bracket of the matching tag.

Just after an element is chosen for a block of text, the program can begin a new group or close the currently open one. Whenever the user chooses an element's name, he can press 'Esc' to reject inserting a tag at that point, or enter a tag name manually.

Inline Mode

When the user presses 'Ctrl-Shift-Space', the word under or just before the caret is surrounded by tags. Exactly as in the block mode, the user can either choose or type in a tag name. If several words must be outlined, the user can press:

'Ctrl-Left' or 'Ctrl-Right' to move the left margin of the region being tagged by one word;
'Left' or 'Right' to move the left margin by one character.

Additionally, if there is a selection on the current line, the selected region is tagged, instead of the word under the caret.

If a tag is entered manually, PreTI tries to build a pattern from the tag and use it to determine a markup for later inlines; thus, it is able to offer the right markup for inline elements when a few similar elements are already marked up in the same document. A useful operation is bound to 'Ctrl-Shift-!': the program will build patterns for the inline markup from samples found in the current buffer. It is a good idea to begin the editing of an existing document with this keystroke.

Tag Name Completion

When the user types a tag name, a list of possible completions is displayed; 'Tab' can be used to insert the first entry from the completion list. Since the document's structure can be incomplete, names of all elements in the given class (inline, block or group) are included in the completion list; however, typing keys should be an infrequent activity with PreTI.

The Status Line

The status line at the bottom of the jEdit window displays the list of available choices. It mostly serves for debugging purposes and illustrates the algorithms, but can be used to indicate the choices when scrolling through the options with the 'Tab' key.

Configuration and Settings

A few properties of PreTI affect the plug-in's behavior and can be configured by the user. In the current version, they can only be set in .jedit/properties; we are going to provide an options' pane in a latter version of the PreTI plug-in, but believe that their values should be left unchanged, and probably not even revealed to the user — the fewer options the user can change, the better his life is.

preti-first-first: When true, opening group-level tags are created first; otherwise, groups preceding the current block are closed, then new groups are opened. Correctly generated rules should work with either setting of this property.
preti-learned-first: When true, the hypotheses for inline tags obtained from dynamically generated patterns precede those based on the default rules. This is the default mode, and it appears to be rather convenient; however, when the default rules are deemed correct, and the markup should be employed in accordance with them, appending results from custom patterns to the end of the list of hypotheses would ensure a more stable style.
preti-learn-exact: When true, PreTI remembers each inline with the tag used to mark it up and presents the tag as the first alternative in the list of hypotheses if the same inline is met again. This is the default setting.

Additionally, key bindings for the plug-in's commands can be changed too; please refer to PreTI.props, included as a resource into PreTI.jar, for names of the corresponding properties.The only two shortucts which should be left unchanged are for preti-mark-last-para (bound to 'Enter') and preti-close-tag ('/'), since they trigger plug-in's actions during normal text input.

^[1]'BackSpace' deletes the last character, 'Ctrl-BackSpace' — all characters.

Chapter 5. Rules

Table of Contents

Inline Rules
Block Rules
Group Rules

The reader may safely skip the chapter on first reading; it explains the syntax of PreTI rules, which is not required to undertand to use PreTI for editing. A separate document, "Creating Rules for PreTI" is dedicated to creation and modification of rules files.

The plug-in is governed by a set of rules. There are three kinds of rules: inline, block, and group rules. The rules file uses XML syntax which is defined in file rules.rng (resource net/davidashen/preti/rules.rng in PreTI.jar).

Inline Rules

Inline rules provide templates for selection of inline tags. Currently, they are based on a simple regular syntax; the syntax allows easy manipulation and automatic generation of the templates. Additionally, the plug-in uses the names of inline tags listed in the rules file to choose tags for the automatic generation of patterns; thus, it makes sense to list all inline markup tags intended for use in documents even if good templates cannot be generated for some of them.

Each element may contain initial points and one or more when clauses. By default, initially an element gets 0 points, and each successful when clause adds 1 point. Optionally, other values can be specified (see rules.rng).

Test conditions for when clauses in inline elements are templates. A template consists of two parts (each ending with character z):

a set of traits which must be present in a inline to be marked up with the tag (starts with =);
a set of traits which are allowed to be present (starts with +).

The traits are:

An uppercase letter is the first character (U).
A lowercase letter is the first character (L).
A digit is the first character (D).
An uppercase letter occurs in the inline (u).
A lowercase letter occurs in the inline (l).
A digit occurs in the inline (d).
A non-alphanumeric character occurs in the inline (the character is literally included into the template).

For example, =Uz+luz stands for 'an inline starting with an uppercase letter and consisting of on more letters'. The syntax is limited so that it allows use of simple algorithms for accumulation of instances in a template — the basis for the learning algorithm employed by PreTI. Full regular expressions would be more powerful, but much more difficult to implement efficiently.

Block Rules

Block rules describe block-level markup tags and their probability for each paragraph, based on a position of the paragraph in the document hierarchy and on the number of words in sentences in the paragraph. Each trait adds or removes a tag's probability, defined by the number of points amassed at any particular time; the tags are then sorted in reverse order by the total number of points received and those with positive sum are included into the list of choices presented to the user

This part of the rules is partly based on statistical data gathered from sample documents, partly derived from a schema. The ability to manually tweak the rules used to be helpful at earlier stages of development, but seems to be of less importance with the current state of the rules generation algorithm.

One or more disable-when elements may be specified at the beginning of block rules, listing conditions under which tags should not be inserted automatically. An example of such condition is a context inside an element which denotes literal layout, such as pre in XHTML or literallayout in DocBook.

In the same way as for inline rules, each element in block rules may contain initial points and one or more when clauses. Test conditions for when clauses in block rules correspond to the following syntax:

  condition ::=  orexp .
  orexp ::= andexp | orexp "or" andexp .
  andexp ::= pred | andexp "and" pred .
  pred ::= func | var rel NUMBER | "(" orexp ")" | "not" pred .
  func ::=
       "inside"      "(" elements ")"
     | "follows"     "(" elements ")"
     | "starts-with" "(" literals ")"
     | "ends-with"   "(" literals ")" .
  literals ::= LITERAL | literals "," LITERAL .
  elements ::= ELEMENT | elements "," ELEMENT .
  rel ::= "<" | ">" | "<=" | ">=" | "=" | "==" | "!=" .
  var ::= "word-count" | "sentence-count" .

  ELEMENT =~ \w[\w:-]*
  NUMBER =~ [0-9]+
  LITERAL =~ ".*"|'.*'

Here, word-count and sentence-count are numbers of words and sentences in a paragraph, starts-with() and ends-with() test for initial and final substrings, and inside() and follows() check the preceding sibling and the parent in the current markup context. For example, a condition for title would be "sentence-count = 1 and not ends-with('.',',',';')".

Group Rules

Group rules describe grouping of block-level tags. Each element can, under certain conditions, begin a new group or end the current one.

These rules are derived from a markup schema. The basic algorithm is extended to provide appropriate choices where elements can be grouped in more than one way.

A rule for each element can have up to two clauses, first-in and follows. The former clause describes conditions under which the current element can begin a new group; the latter lists conditions for closing the current group. The syntax for when clauses in group rules is the same as for block rules, except that the only functions which are actually used are inside() and follows().

Chapter 6. Supported Formats

Table of Contents

DocBook
XHTML

The current version contains rules for small subsets of DocBook and XHTML. These rules are very basic, but still useful. This documentation was and continues to be written in jEdit+PreTI without entering manually even a single xml tag. The rules are triggered by docbook files which utilise the extension dbx, (*.dbx) and xhtml (*.xht) modes. Please bear in mind that jEdit comes with XML mode by default, and the 'first line' rule of jEdit overrides globbing (See "Chapter 9. Mode Definition Syntax" in the jEdit User's Guide). If you chose to manually prepend the XML declaration to docbook or xhtml files for use with PreTI, you will either have to change the definition for the XML mode or set the buffer's mode manually in the 'Buffer Options' dialog.

The rules are included into PreTI.jar as resources net/davidashen/preti/rules-dbx.xml and net/davidashen/preti/rules-xhtml.xml. They are copied to .jedit/preti; the user can modify and adjust the rule files in that directory. To reload the rules without restarting jEdit, the user can choose 'Reload Rules' from the plug-in's menu or press 'Ctrl-Shift-@'.

DocBook

A small subset of DocBook, http://docbook.sourceforge.net/, is included with the plug-in. While the number of tags is kept to the minimum, the subset is sufficient to mark up the structure of this document. It includes:

block tags: para, title, subtitle, author, listitem, term, note;
group tags: book, article, bookinfo, articleinfo, abstract, chapter, section, itemizedlist, orderedlist, variablelist, varlistentry;
inline tags: classname, constant, email, emphasis, filename, indexterm, keysym, literal, parameter, property, sgmltag, ulink, varname, acronym, citetitle, function, application, footnote.

The group rules are built automatically from a schema, the inline and blocks rules are generated from sample data. A few adjustments have been made to the original classification:

term (originally inline) is declared to be a block-level tag. In fact, most variablelists contain exactly one term per varlistentry, and it is rendered as a paragraph, not in a line with something else.
listitem is also made block-level; its content is then wrapped into para in the customization stylesheet;
footnote is made inline-level element; it is easier to enter short footnotes this way, and the content is then also wrapped into para.

Many documents are marked up using a small part of the full DocBook. This sample subset is a working example; it is actually good enough to write elaborated software documentation. A sample stylesheet is provided to convert documents entered using PreTI to valid DocBook documents (included into PreTI.jar as resource net/davidashen/preti/dbx2xml.xsl).

XHTML

Tags for the XHTML subset have been chosen along the lines of the DocBook subset just described. Included elements are listed below; one important modification to XHTML is that instead of having six elements for section headings (h1 to h6), one element h is included. Since the user can enter a tag's name manually, and the input is appended to the current choice, the correct suffix can just be entered.

block tags: dd, dt, h, h1, li, p, pre, title;
group tags: body, dl, head, html, ol, ul;
inline tags: a, abbr, acronym, b, big, cite, code, dfn, em, i, kbd, q, samp, small, strong, tt, var.

Few templates are defined for inline-level elements; their use is more fuzzy than in DocBook and varies between documents. Since PreTI builds patterns for inline elements automatically and can learn from existing data, inline elements can still be conveniently used.

Chapter 7. Licensing

PreTI jEdit Plug-in is released under BSD License. Here is the text of the License:

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of Davidashen nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS", AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.