Text Input Extensions in XSL Transformations
Version 1.0 (Draft)
12 July 2024

Abstract

This specification defines an extension of XSL Transformations (XSLT) . It describes a set of XSLT extension elements and functions, designed to implement processing of source data represented in text format.

Status of this document

This is the first working draft of the specification. It was developed as part of Unicorn XSL World (UXSL) programme.

Comments on this specification may be sent to [email protected]

Table of contents

Appendices


1 Introduction

This specification defines an extension of XSL Transformations (XSLT). It describes a set of XSLT extension elements and functions, designed to implement processing of source data represented in text format. These elements and functions belong to a specific namespace and are integrated into XSLT transformation environment using the extension mechanism specified in W3C XSLT recommendation.

Extension elements defined by this specification may be embedded in XSLT stylesheets and intermixed with the standard XSLT instructions. The special version of XSLT processor is required to process such stylesheets. When an instruction corresponding to the report generation extension element is encountered, it is instantiated as described in this specification.

This specification also defines the set of extension functions designed to access results of text data processing. These functions may be used in XSLT expressions and attribute value templates in all contexts where functions are allowed. Values returned by data access extension functions may be inserted into generated content or used as intermediate parameters during the transformation.

2 Extension Namespace

The extension namespace assigned to all XSLT text input extension elements and functions is:

https://www.unicorn-enterprises.com/XSLT/Extensions/TextInput/1.0

In this specification, the prefix txt: is used for referring to text input extension elements and functions. Authors of XSLT stylesheets are free to use any other prefix, provided that the appropriate namespace declarations are supplied.

3 Extension Elements

Extension elements defined by this specification are used to implement text data access operations. They may be embedded in XSLT stylesheets and intermixed with the standard XSLT instructions.

The following extension elements implement extension instructions:

  • txt:format
  • txt:read

When any of these instructions is encountered, it is instantiated as described in this specification.

The following extension elements are not treated as instructions:

  • txt:field

These elements are used to supply additional information for text data access extension instructions.

3.1 Defining a Source Format

The txt:format extension element is used to specify the source data format. Source data are represented as sequences of text lines. Each line corresponds to a single data record and contains data fields. Fields are identified by names. Depending on the data format, all fields might have either fixed or variable width.

When all fields have fixed width, the exact width for each field must be specified. In this case field names as well as the number of fields must be known in advance and must be specified.

When all fields have variable width, it is assumed that fields are separated using some character. This character is called separator and must be supplied as part of the format specification. Field values may be enclosed in double quotes; if the separator character is encountered within the quoted value, it is interpreted as part of field data.

When all fields have variable width, field names as well as the number of fields may be either known in advance or obtained from the source data stream. In the latter case the first line of the data stream is interpreted as the header line. It must contain list of field names, separated with the separator character. The format definition must specify whether the source data contains the header line.

The format definition created using the txt:format extension element is available while instantiation of the parent element of txt:format is in progress. If needed, this format may be accessed from the content of other template rules activated in course of instantiation of the parent element of txt:format.

The txt:format extension element has the following syntax:

<txt:format
  name = string
  separator = char
  header = "yes" | "no"
  <!-- Content: (txt:field*) -->
</txt:format>

The required attribute name is used to specify the format name. Format names need not to be unique. The newly defined format shadows format definitions with the same name that could be created before the instantiation of the parent element of txt:format was started. It is an error, however, if two sibling txt:format elements define formats with the same name.

The optional attribute separator is used to specify the separator character. If this attribute is specified, corresponding source data are interpreted as having variable width fields. Otherwise the fixed width fields are assumed.

If the tabulation character (hexadecimal code 9) is used as separator, the special value #tab should be specified for separator attribute instead of the tabulation character itself. This is necessary because XML processor performs attribute normalization and replaces tabulations by space characters.

The optional attribute header is used to specify whether the source data contain a header line. This attribute is allowed only for variable field width formats, and may be specified only if separator is specified. If header is not specified, it is assumed that source data contain no header line.

Source data fields may be specified using txt:field children that form content of the txt:format element. Field specification is required if source data contain no header line. Field specification is not allowed if source data contain a header line.

3.2 Defining Source Fields

The txt:field extension element is used to specify a particular field of source data. If source data contain no header line, a separate txt:field element is required to specify each source data field.

The txt:field extension element has the following syntax:

<txt:field
  name = { string }
  width = number-expression />

The required name attribute is used to specify the field name. It is interpreted as an attribute value template.

The optional width attribute is used to specify the field width. This attribute is required if the source fields have fixed width. It is not allowed if source fields have variable width. The attribute value is an expression. The result of the expression evaluation is converted to number. The conversion result is rounded and interpreted as width of the corresponding field.

3.3 Reading Source Data

The txt:read extension element is used to access source data. When this element is instantiated, the specified source data file is accessed in accordance to the specified format definition. The stylesheet processor reads source data records; for each record the content of txt:read is instantiated. During the instantiation of content, information about data fields may be obtained using extension functions.

The txt:read extension element has the following syntax:

<txt:read
  source-id = string
  href = { uri-reference }
  encoding = string
  format = string
  <!-- Content: template -->
</txt:read>

The optional source-id attribute is used to specify the source identifier assigned to a data source. If this attribute is not supplied, an empty string is used as the source identifier value.

The source identifier is valid while the instantiation of the corresponding txt:read element is in progress. Source identifiers need not to be unique. When instantiation of several txt:read elements assigned the same source identifier is in progress at some point, this source identifier is referencing the element corresponding to the most recently started instantiation. Empty strings are allowed as source identifiers.

A source identifier corresponding to the txt:read element, whose instantiation is in progress, may be referenced by stylesheet elements that need not to be descendants of the parent of the txt:read element that had defined this identifier.

The required href attribute is used to specify the URI of the data source. This attribute is interpreted as an attribute value template. Relative URI references are resolved using the base URI of the txt:read stylesheet element.

The optional attribute encoding specifies the text data encoding. The attribute value is interpreted case-insensitive. If this attribute is not present, the value of iso-8859-1 is assumed.

The required format attribute is used to specify the name of the format definition that should be applied to interpret the data source content.

4 Extension Functions

Extension functions may appear in XSLT expressions and attribute value templates in all contexts where functions are allowed. These functions are typically used to access information about the current text record.

4.1 Fetching the Number of Fields

While the instantiation of the txt:read element is in progress, the number of fields in source records may be obtained using the extension function txt:field-count. This function may be useful when field definitions are contained in the header line of the source data file and therefore number of fields is not known in advance. If the source data file contains no header line, this function returns number of fields specified in the corresponding format definition. If the source data file contains the header line, this function returns number of fields specified in the header line. This function cannot be used to obtain the actual number of fields supplied within each particular record, which may differ from the specified number of fields.

The txt:field-count function has the following syntax:

number txt:field-count(string?)

The argument is optional. Is specifies the source identifier associated with the appropriate source data stream. If this argument is not supplied, the empty string is assumed as the source identifier. If this argument has the type other than string, it is converted to string prior to the function evaluation.

The txt:field-count function obtains and returns the specified number of fields in the source data file.

4.2 Fetching a Field Name

While the instantiation of the txt:read element is in progress, the name of each source field may be obtained using the extension function txt:field-name. This function may be useful when field definitions are contained in the header line of the source data file and therefore field names are not known in advance. If the source data file contains no header line, this function returns field names specified in the corresponding format definition. If the source data file contains the header line, this function returns field names specified in the header line.

The txt:field-name function has the following syntax:

string txt:field-name(number, string?)

The first argument is required. It specifies the field number. The first field has number 1. If the argument has type other than number, it is converted prior to the function evaluation.

The second argument is optional. Is specifies the source identifier associated with the appropriate source data stream. If this argument is not supplied, the empty string is assumed as the source identifier. If this argument has the type other than string, it is converted to string prior to the function evaluation.

The txt:field-name function obtains and returns the name of the specified field in the source data file.

4.3 Fetching a Field Value

While the instantiation of the txt:read element is in progress, the value of each source field in the current record may be obtained using the extension function txt:field-value. The field may be identified either by name or by number. Field identification by number may be useful when field definitions are contained in the header line of the source data file and therefore field names are not known in advance.

The txt:field-value function has the following syntax:

string txt:field-value(object, string?)

The first argument is required. If this argument is a number value, it specifies the field number. The first field has number 1. If the argument has type other than number, it is interpreted as a field name. In this case, if the argument has type other than string, it is converted to string prior to the function evaluation.

The second argument is optional. Is specifies the source identifier associated with the appropriate source data stream. If this argument is not supplied, the empty string is assumed as the source identifier. If this argument has the type other than string, it is converted to string prior to the function evaluation.

The txt:field-value function obtains and returns the value of the specified field in the source data file. Field values are interpreted as having string type. If the source data file contains fixed-width fields, trailing spaces are removed from the result value.

5 Notation

The specification of each text input extension element contains a summary of its syntax. The notation is similar to that used in XSL Transformations (XSLT) Version 1.0 recommendation.

The names of required attributes are given in bold. Strings that occur in place of attribute values specify the value type of those attributes. Strings surrounded by curly braces indicate that the corresponding attribute values are interpreted as attribute value templates. Elements allowed not to be empty contain comments specifying the allowed content.


Appendices

A References

World Wide Web Consortium. XSL Transformations (XSLT) Version 1.0. W3C Recommendation. See http://www.w3.org/TR/1999/REC-xslt-19991116

B Element Syntax Summary

<txt:format
  name = string
  separator = char
  header = "yes" | "no"
  <!-- Content: (txt:field*) -->
</txt:format>

<txt:field
  name = { string }
  width = number-expression />

<txt:read
  source-id = string
  href = { uri-reference }
  encoding = string
  format = string
  <!-- Content: template -->
</txt:read>

C Reference Implementation

The software product developed by Unicorn Enterprises SA, Unicorn XSLT Processor, Standard Edition (version 1.02.10 or higher) serves as the reference implementation for this specification.

The recent release of this product is available at https://www.unicorn-enterprises.com .

D Examples

This section contains XSLT stylesheets that demonstrate usage of extension elements defined in this specification. These examples can be also found in the distribution of Unicorn XSLT Processor, Standard Edition (version 1.02.10 or higher) available at https://www.unicorn-enterprises.com .

All examples in this section are designed to process the same set of source data represented using different formats.

Example 1.

This example demonstrates the stylesheet that reads the source data file with fixed-width fields. The XML representation of the source document is generated.

<?xml version='1.0'?>
<xsl:stylesheet 
    version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:txt="https://www.unicorn-enterprises.com/XSLT/Extensions/TextInput/1.0"
    extension-element-prefixes="txt">
  <xsl:output
      method="xml"
      indent="yes"
      encoding="UTF-8"/>
  <xsl:strip-space elements="*"/>
  <xsl:template match="/">
    <txt:format name="TEXT01">
      <txt:field name="CODE" width="50"/>
      <txt:field name="CATEGORY" width="50"/>
      <txt:field name="NAME" width="50"/>
      <txt:field name="CPU" width="50"/>
      <txt:field name="RAM" width="50"/>
      <txt:field name="HARD_DISK" width="50"/>
      <txt:field name="MONITOR" width="50"/>
      <txt:field name="OS" width="50"/>
      <txt:field name="PRICE" width="50"/>
    </txt:format>
    <root>
      <txt:read href="text01.txt" format="TEXT01">
        <model>
          <code><xsl:value-of select="txt:field-value('CODE')"/></code>
          <category><xsl:value-of select="txt:field-value('CATEGORY')"/></category>
          <name><xsl:value-of select="txt:field-value('NAME')"/></name>
          <cpu><xsl:value-of select="txt:field-value('CPU')"/></cpu>
          <ram><xsl:value-of select="txt:field-value('RAM')"/></ram>
          <hard-disk><xsl:value-of select="txt:field-value('HARD_DISK')"/></hard-disk>
          <monitor><xsl:value-of select="txt:field-value('MONITOR')"/></monitor>
          <os><xsl:value-of select="txt:field-value('OS')"/></os>
          <price><xsl:value-of select="txt:field-value('PRICE')"/></price>
        </model>
      </txt:read>
    </root>
  </xsl:template>
</xsl:stylesheet>

Example 2.

This example demonstrates the stylesheet that reads the source data file with comma separated variable-width fields. The XML representation of the source document is generated.

<?xml version='1.0'?>
<xsl:stylesheet 
    version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:txt="https://www.unicorn-enterprises.com/XSLT/Extensions/TextInput/1.0"
    extension-element-prefixes="txt">
  <xsl:output
      method="xml"
      indent="yes"
      encoding="UTF-8"/>
  <xsl:strip-space elements="*"/>
  <xsl:template match="/">
    <txt:format name="TEXT02" separator=",">
      <txt:field name="CODE"/>
      <txt:field name="CATEGORY"/>
      <txt:field name="NAME"/>
      <txt:field name="CPU"/>
      <txt:field name="RAM"/>
      <txt:field name="HARD_DISK"/>
      <txt:field name="MONITOR"/>
      <txt:field name="OS"/>
      <txt:field name="PRICE"/>
    </txt:format>
    <root>
      <txt:read href="text02.txt" format="TEXT02">
        <model>
          <code><xsl:value-of select="txt:field-value('CODE')"/></code>
          <category><xsl:value-of select="txt:field-value('CATEGORY')"/></category>
          <name><xsl:value-of select="txt:field-value('NAME')"/></name>
          <cpu><xsl:value-of select="txt:field-value('CPU')"/></cpu>
          <ram><xsl:value-of select="txt:field-value('RAM')"/></ram>
          <hard-disk><xsl:value-of select="txt:field-value('HARD_DISK')"/></hard-disk>
          <monitor><xsl:value-of select="txt:field-value('MONITOR')"/></monitor>
          <os><xsl:value-of select="txt:field-value('OS')"/></os>
          <price><xsl:value-of select="txt:field-value('PRICE')"/></price>
        </model>
      </txt:read>
    </root>
  </xsl:template>
</xsl:stylesheet>

Example 3.

This example demonstrates the stylesheet that reads the source data file with comma separated variable-width fields. The XML representation of the source document is generated.

It is assumed that the source document contains the header line that specifies actual field names. It is assumed that number of fields as well as field names are not known in advance, but there are no more than 10 fields.

For each field in the source document a separate element is generated in XML output. Field names are used as element tags.

<?xml version='1.0'?>
<xsl:stylesheet 
    version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:txt="https://www.unicorn-enterprises.com/XSLT/Extensions/TextInput/1.0"
    extension-element-prefixes="txt">
  <xsl:output
      method="xml"
      indent="yes"
      encoding="UTF-8"/>
  <xsl:strip-space elements="*"/>
  <xsl:template match="/">
    <txt:format name="TEXT03" separator="," header="yes"/>
    <root>
      <txt:read href="text03.txt" format="TEXT03">
        <model>
          <xsl:if test="txt:field-count() >= 1">
            <xsl:element name="{txt:field-name(1)}">
              <xsl:value-of select="txt:field-value(1)"/>
            </xsl:element>
          </xsl:if>
          <xsl:if test="txt:field-count() >= 2">
            <xsl:element name="{txt:field-name(2)}">
              <xsl:value-of select="txt:field-value(2)"/>
            </xsl:element>
          </xsl:if>
          <xsl:if test="txt:field-count() >= 3">
            <xsl:element name="{txt:field-name(3)}">
              <xsl:value-of select="txt:field-value(3)"/>
            </xsl:element>
          </xsl:if>
          <xsl:if test="txt:field-count() >= 4">
            <xsl:element name="{txt:field-name(4)}">
              <xsl:value-of select="txt:field-value(4)"/>
            </xsl:element>
          </xsl:if>
          <xsl:if test="txt:field-count() >= 5">
            <xsl:element name="{txt:field-name(5)}">
              <xsl:value-of select="txt:field-value(5)"/>
            </xsl:element>
          </xsl:if>
          <xsl:if test="txt:field-count() >= 6">
            <xsl:element name="{txt:field-name(6)}">
              <xsl:value-of select="txt:field-value(6)"/>
            </xsl:element>
          </xsl:if>
          <xsl:if test="txt:field-count() >= 7">
            <xsl:element name="{txt:field-name(7)}">
              <xsl:value-of select="txt:field-value(7)"/>
            </xsl:element>
          </xsl:if>
          <xsl:if test="txt:field-count() >= 8">
            <xsl:element name="{txt:field-name(8)}">
              <xsl:value-of select="txt:field-value(8)"/>
            </xsl:element>
          </xsl:if>
          <xsl:if test="txt:field-count() >= 9">
            <xsl:element name="{txt:field-name(9)}">
              <xsl:value-of select="txt:field-value(9)"/>
            </xsl:element>
          </xsl:if>
          <xsl:if test="txt:field-count() >= 10">
            <xsl:element name="{txt:field-name(10)}">
              <xsl:value-of select="txt:field-value(10)"/>
            </xsl:element>
          </xsl:if>
        </model>
      </txt:read>
    </root>
  </xsl:template>
</xsl:stylesheet>