inline2XSF.xsl

inline2XSF.xsl - Transformation of inline annotations to XSF

Version 17.02.2011, 15:39 (GMT)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU GNU Lesser General Public License for more details.

You should have received a copy of the GNU GNU Lesser General Public License along with this program (file 'lgpl-3.0.txt'). If not, see http://www.gnu.org/licenses/.

The transformation into XStandoff requires an input XML file ideally containing elements bound by XML namespaces. Every single namespace evokes the output of a layer in XStandoff which contains the elements from the namespace. Thereby the default (or empty) namespace is treated like the named ones. Thus inline annotations without explicit namespace declarations can also be processed (in this case a namespace will be generated).

The process of converting an inline annotation to XStandoff is divided into two steps: Firstly, segments are built on the basis of the occurring elements. There are two possible ways of mapping the element boundaries to the textual content in the form of character positions. The preferred way of reaching such a mapping is the use of a primary data file which contains the bare text without annotations. The name of this file can be provided during the transformation call by specifying the stylesheet parameter primary-data. Providing the location of a primary data file leads to a comparison of the content of the primary data file and the textual content of the input file of the transformation. This guarantees primary data identity. If no primary data location is provided, the textual content of the input file is used to build up the primary data. This has certain disadvantages such as the lack of line breaks since these cannot easily be inferred by the textual content of an XML file. Furthermore, the automatic conversion of the textual content of the input file to be used as primary data relies on heuristics which have to guess white spaces. Because of the complexity of this task it is possible to get undesirable results.

After the building of segments, the second step of the transformation is to return layers on the basis of namespaces. Thus for every namespace the corresponding elements are released from the initial inline annotation and copied into the layer maintaining the embedding relations. Meanwhile the elements in the XStandoff layer get connected to the according segments by ID/IDREF binding. In this manner one segment can serve as a reference for elements from different layers.

Input XML file: file containing inline annotation containing namespaces (used to generate layers)

Additional file (required): a second XSF instance which should be merged with the first one

Stylesheet parameters: see individual documentation

Execution via command line (XSLT processor Saxon9): java -jar saxon9.jar [optional Saxon Parameters] -o [XML output filename] [XML input filename] inline2XSF.xsl [optional Stylesheet Parameters, see below]

Author:
Daniel Jettka; daniel.jettka@uni-bielefeld.de; Project Sekimo (A2); DFG Research Group 437
Copyright:
GNU Lesser General Public License, see below for details

Parameters Summary

xs:string ANALYZE-BUGS - source

En-/Disable extended messaging

xs:string all-layer - source

Whether or not elements which occur in all layers shall be returned in a separate layer

Level of returned messages for bug analysis

Determines whether or not primary data should be copied to element xsf:primaryData

True ('1') has effect that all nodes from the XML input are included in a single xsf-layer

If set to true() then all pure whitespace text nodes are removed from input XML and positions are inferred completely heuristically.

Value true includes optional elements into the result of the transformation

True includes segments for whitespaces into the result of the transformation

Determines which empty element should get @xsf:segment; '#all' means that all empty elements get a segment reference; e.g. 'chs,cnx' means only elements with prefixes chs und cnx get segment reference

xs:boolean local-xsd - source

If true then local XSDs are used, false - XSDs from http://www.xstandoff.net/2009/xstandoff/1.1... are used

XPath expression meta-root - source

Determines the location of meta data which is to be copied to element xsf:meta ( in saxon call give XPath for respective element: meta-root="//*:meta )

xs:string pd-check - source

Strength of primary data check: lax, middle (default) or strict.

  • lax: Minimum of primary data identity, i.e. non-whitespace characters must be identical in XML annotation's text content and primary data file. Only whitespaces from primary data which must be present in XML are those in XML annotation's terminal text nodes.
  • middle: Like 'lax', but all characters, also whitespaces, from primary data file must be present in XML annotation's text nodes.
  • strict: Like 'middle', but indents and additional whitespaces are only allowed within mixedConent and at the boundaries of terminal text nodes.
xs:string primary-data - source

Name of a txt-file containing the primary data. If omitted then primary data is generated from the text data of XML input

XPath expression virtual-root - source

XPath expression identifying element which serves as root of the actual annotation (default: document node) to be included in layers -> saxon call sample: virtual-root=//*:text

xs:string xsfVersion - source

Version of returned XSF

Variables Summary

xs:string baseFileName - source

Name of the input xml file, e.g. used for @xml:id in element xsf:corpusData

Namespaces which can be found in input XML file

Supportive variable for processing empty elements in replace-non-mC-text

There are conditions under which an all-layer should not be returned

xs:integer iter - source

Saxon assignable variable for iteration

element()+ layerSEG - source

Variable containing information about different layers (especially element names)

element()+ layers - source

Determination of different layers with respect to namespaces in XML input file ($distinct-namespaces)

xs:string log-directory - source

Directory for result-documents for bug analysis

Variable containing the element which is provided by stylesheet parameter meta-root

Saxon assignable variable for storing textual content of segments and later replacement by position information

Saxon assignable variable for storing textual content of text nodes in mixed content sections which precede the non-mC-text-node in question, but do not precede the first preceding non-mC text node

Making $primary-data usable as URI for document()

Saving primary data (either from file provided by stylesheet parameter or from textual data of XML input)

xs:string? primaryData_noWS - source

Primary data without whitespaces

No short description available
xs:string root - source
Referencing the original root node of the input document

This assignable variable contains the segments for the input XML, built on the basis of the replaced text XML ($segmentation-base)

This assignable variable allows replacing of textual content (which is in mixed content section) from $replaced-non-mC-text with position information

Finding segments which reference elements occuring in all namespaces

This assignable variable contains the textual content of the element determined by stylesheet parameter virtual-root (without whitespaces)

This assignable variable contains the textual content of the element determined by stylesheet parameter virtual-root (with whitespaces)

xs:string textData_noWS - source

This variable contains the textual content of the element determined by stylesheet parameter virtual-root (without whitespaces)

Variable containing the element which is provided by stylesheet parameter virtual-root

xs:string xsd-location - source

Defines the directory of the schema location with respect to the stylesheet parameter $local-xsd

Keys Summary

elem-by-id (match: element(), use: @*:id) - source

Find elements by their ID attribute; no matter which namespace @id is bound by

elem-by-namespace-uri (match: element(), use: namespace-uri(.)) - source

Find elements by their specific namespace URIs they are connected to

element-by-positions (match: element(), use: concat(descendant::xsf:c[1]/@p, '-', descendant::xsf:c[last()]/@p)) - source

Find elements by position information provided by descending xsf:c/@p

segment-by-positions (match: xsf:segment, use: concat(@start, '-', @end)) - source

Find XSF segments by their start and end positions

Match Templates Summary

Template matching the root of the document to be transformed.

element() | text() | processing-instruction() | comment() (param: namespace-urimode: copy-segmentation-base) - source

The template performs a copy of $segmentation-base (input XML with textual content replaced by position information) without positioning elements.

Template for removing pure whitespace text nodes from input XML.

The template performs a copy of $segmentation-base (input XML with textual content replaced by position information) and returns position info for empty elements.

Template for replacing mixed-content characters of the input annotation by corresponding position information.

Template for replacing non-mixed-content characters of the input annotation by corresponding position information.

No short description available

Named Templates Summary

xs:string build-primary-data-from-XML (param: node()+ XMLdata) - source

This template generates a text on the basis of the textual content of the provided $XMLdata. The text serves as primary data and therefore as a reference of the start and end position of elements.

element()* copy-nodes-by-namespace-uri (param: xs:anyURI* uri) - source

Recursive copy of elements descending the context node elements are copied iff they are in the provided namespace ($uri)

Normalization of XML data, see also: SVN/software/tools/trunk/normalize2.xsl

changed version of the (invalid) original from http://dpawson.co.uk/xsl/sect2/N8321.html#d12364e18

Functions Summary

attribute()* attr:return-atts-for-elems-from-namespace (param: xs:anyURI* namespaces) - source

Function for getting names of elements from a specific namespace

xs:boolean bool:include-empty-element (param: xs:string* ns-prefix) - source

Testing whether to include an empty element using stylesheet parameter $levels-empty-elements

xs:boolean bool:self-in-mC (param: self) - source

Tests whether a node (text() or element()) is in a mixed content section

xs:boolean bool:twoStringsFormClitic (param: xs:string string1xs:string string2) - source

Testing whether or not two strings ($string1 und $string2) form a clitic together.

Deriving position information from preceding item of context item in $pd-replaced-positions

xs:integer* int:substring-positions (param: xs:string stringxs:string substring) - source

Returns the position (start and end) of the first occurence of a substring in string

xs:string string:ID-comform (param: xs:string input-string) - source

A valid ID is returned on the basis of the value of the input-string; necessary changes being made

string:compare-strings (param: xs:string text-nodexs:string pd-dataxs:string matching-stringxs:integer removed-leading-pd-ws) - source

Compares two strings while ignoring whitespaces; the result is a sequence consisting of (a) the matching substring, and (b) the rest of the $pd-data which does not match

xs:string string:escape4regex (param: xs:string input) - source

Function escapes certain chars of an input string by putting '\' in front

xs:string+ string:first-distinct-char (param: xs:string string1xs:string string2xs:integer position) - source

Deriving first distinct char of two unequal strings

xs:string string:infiltrate-string (param: xs:string input-stringxs:string infiltrator) - source

Insert $infiltrator (e.g. '\s*') into string (after every single char) and escape special regex chars, e.g. input: 'hello.' - output: '\s*h\s*e\s*l\s*l\s*o\s*\.\s*'

xs:string string:int2string (param: xs:integer+ intSEQ) - source
No short description available
xs:string? string:replace-all (param: xs:string inputStringxs:string+ regexSEQxs:string+ replaceSEQ) - source

This function replaces all instances from the regex sequence (regexSEQ) by the corresponding substitution values (same position in replaceSEQ)

undef:ANALYZE-BUGS (param: xs:string terminatexs:integer levelxs:string message) - source

Function receives message strings which are output due to $level and $ANALYZE-BUGS values

Checking primary data identity on the basis of primary data and textual content of XML both without whitespaces

xs:integer undef:iterate-non-ws (param: xs:string? stringxs:string? non-wsxs:integer count) - source
No short description available

Returning messages containing stylesheet parameters and their supplied values

undef:segments2inline (param: element()* segments) - source

Copy of elements into an inline annotation by reference to segments

Parameters Detail

xs:string ANALYZE-BUGS - source

En-/Disable extended messaging

xs:string all-layer - source

Whether or not elements which occur in all layers shall be returned in a separate layer

Level of returned messages for bug analysis

Determines whether or not primary data should be copied to element xsf:primaryData

True ('1') has effect that all nodes from the XML input are included in a single xsf-layer

If set to true() then all pure whitespace text nodes are removed from input XML and positions are inferred completely heuristically.

Value true includes optional elements into the result of the transformation

True includes segments for whitespaces into the result of the transformation

Determines which empty element should get @xsf:segment; '#all' means that all empty elements get a segment reference; e.g. 'chs,cnx' means only elements with prefixes chs und cnx get segment reference

xs:boolean local-xsd - source

If true then local XSDs are used, false - XSDs from http://www.xstandoff.net/2009/xstandoff/1.1... are used

XPath expression meta-root - source

Determines the location of meta data which is to be copied to element xsf:meta ( in saxon call give XPath for respective element: meta-root="//*:meta )

xs:string pd-check - source

Strength of primary data check: lax, middle (default) or strict.

  • lax: Minimum of primary data identity, i.e. non-whitespace characters must be identical in XML annotation's text content and primary data file. Only whitespaces from primary data which must be present in XML are those in XML annotation's terminal text nodes.
  • middle: Like 'lax', but all characters, also whitespaces, from primary data file must be present in XML annotation's text nodes.
  • strict: Like 'middle', but indents and additional whitespaces are only allowed within mixedConent and at the boundaries of terminal text nodes.
xs:string primary-data - source

Name of a txt-file containing the primary data. If omitted then primary data is generated from the text data of XML input

XPath expression virtual-root - source

XPath expression identifying element which serves as root of the actual annotation (default: document node) to be included in layers -> saxon call sample: virtual-root=//*:text

xs:string xsfVersion - source

Version of returned XSF

Variables Detail

xs:string baseFileName - source

Name of the input xml file, e.g. used for @xml:id in element xsf:corpusData

Namespaces which can be found in input XML file

Supportive variable for processing empty elements in replace-non-mC-text

There are conditions under which an all-layer should not be returned

xs:integer iter - source

Saxon assignable variable for iteration

element()+ layerSEG - source

Variable containing information about different layers (especially element names)

element()+ layers - source

Determination of different layers with respect to namespaces in XML input file ($distinct-namespaces)

xs:string log-directory - source

Directory for result-documents for bug analysis

Variable containing the element which is provided by stylesheet parameter meta-root

Saxon assignable variable for storing textual content of segments and later replacement by position information

Saxon assignable variable for storing textual content of text nodes in mixed content sections which precede the non-mC-text-node in question, but do not precede the first preceding non-mC text node

Making $primary-data usable as URI for document()

Saving primary data (either from file provided by stylesheet parameter or from textual data of XML input)

xs:string? primaryData_noWS - source

Primary data without whitespaces

No short description available
xs:string root - source
Referencing the original root node of the input document

This assignable variable contains the segments for the input XML, built on the basis of the replaced text XML ($segmentation-base)

This assignable variable allows replacing of textual content (which is in mixed content section) from $replaced-non-mC-text with position information

Finding segments which reference elements occuring in all namespaces

This assignable variable contains the textual content of the element determined by stylesheet parameter virtual-root (without whitespaces)

This assignable variable contains the textual content of the element determined by stylesheet parameter virtual-root (with whitespaces)

xs:string textData_noWS - source

This variable contains the textual content of the element determined by stylesheet parameter virtual-root (without whitespaces)

Variable containing the element which is provided by stylesheet parameter virtual-root

xs:string xsd-location - source

Defines the directory of the schema location with respect to the stylesheet parameter $local-xsd

Keys Detail

elem-by-id (match: element(), use: @*:id) - source

Find elements by their ID attribute; no matter which namespace @id is bound by

elem-by-namespace-uri (match: element(), use: namespace-uri(.)) - source

Find elements by their specific namespace URIs they are connected to

element-by-positions (match: element(), use: concat(descendant::xsf:c[1]/@p, '-', descendant::xsf:c[last()]/@p)) - source

Find elements by position information provided by descending xsf:c/@p

segment-by-positions (match: xsf:segment, use: concat(@start, '-', @end)) - source

Find XSF segments by their start and end positions

Match Templates Detail

Template matching the root of the document to be transformed.

The initial template is naturally the starting point of the transformation. In this special case, several tasks are involved:

  • Returning messages on supplied stylesheet parameters → undef:param-messages()
  • Checking for primary data identity → undef:check-primary-data()
  • Replacing non-mixed-content characters by position information (template mode: replace-non-mC-text)
  • Replacing mixed-content characters by position information (template mode: replace-mC-text)
  • Deriving a segmentation on the basis of position information
  • Returning the XSF annotation by processing individual layers
Detailed information on the several processing steps is available at the documentation of the individual templates and functions called.

element() | text() | processing-instruction() | comment() (param: namespace-urimode: copy-segmentation-base) - source

The template performs a copy of $segmentation-base (input XML with textual content replaced by position information) without positioning elements.

Parameters:
namespace-uri - determines the namespace whose elements shall be copied.

Template for removing pure whitespace text nodes from input XML.

This template removes all pure whitespace text nodes from the input and thus prepares it for a heuristical determination of whitespace positions.

The template performs a copy of $segmentation-base (input XML with textual content replaced by position information) and returns position info for empty elements.

Template for replacing mixed-content characters of the input annotation by corresponding position information.

A pointer leads through the textual content (in this case only mixed-content) of the input annotation and outputs position information where applicable.

The information is returned by replacing the textual content by elements like the following: <xsf:c p="{current character position}">

Template for replacing non-mixed-content characters of the input annotation by corresponding position information.

A pointer (incrementing $iter) leads through the textual content (in this case only non-mixed-content) of the input annotation and outputs position information where applicable.

The information is returned by replacing the textual content by elements like the following: <xsf:c p="{current character position}">

No short description available

Named Templates Detail

xs:string build-primary-data-from-XML (param: node()+ XMLdata) - source

This template generates a text on the basis of the textual content of the provided $XMLdata. The text serves as primary data and therefore as a reference of the start and end position of elements.

Parameters:
node()+ XMLdata -
element()* copy-nodes-by-namespace-uri (param: xs:anyURI* uri) - source

Recursive copy of elements descending the context node elements are copied iff they are in the provided namespace ($uri)

Parameters:
xs:anyURI* uri - Defines the namespace uri whose elements a copied during the recursive template call

Normalization of XML data, see also: SVN/software/tools/trunk/normalize2.xsl

changed version of the (invalid) original from http://dpawson.co.uk/xsl/sect2/N8321.html#d12364e18

Functions Detail

attribute()* attr:return-atts-for-elems-from-namespace (param: xs:anyURI* namespaces) - source

Function for getting names of elements from a specific namespace

Parameters:
xs:anyURI* namespaces - Namespace URIs whose included elements shall be enlisted in declared attribute output format
xs:boolean bool:include-empty-element (param: xs:string* ns-prefix) - source

Testing whether to include an empty element using stylesheet parameter $levels-empty-elements

Parameters:
xs:string* ns-prefix - Namespace prefix(es) of empty elements which are to be included in XSF
xs:boolean bool:self-in-mC (param: self) - source

Tests whether a node (text() or element()) is in a mixed content section

Parameters:
self - Element or text node for which a test is run whether $self is located within a mixed content section
xs:boolean bool:twoStringsFormClitic (param: xs:string string1xs:string string2) - source

Testing whether or not two strings ($string1 und $string2) form a clitic together.

Parameters:
xs:string string1 -
xs:string string2 -

Deriving position information from preceding item of context item in $pd-replaced-positions

xs:integer* int:substring-positions (param: xs:string stringxs:string substring) - source

Returns the position (start and end) of the first occurence of a substring in string

Parameters:
xs:string string - Input string in which a certain substring could be found
xs:string substring - Substring that is expected and searched within the input string
xs:string string:ID-comform (param: xs:string input-string) - source

A valid ID is returned on the basis of the value of the input-string; necessary changes being made

Parameters:
xs:string input-string - As the name says, this is the string which is made xs:ID conform
string:compare-strings (param: xs:string text-nodexs:string pd-dataxs:string matching-stringxs:integer removed-leading-pd-ws) - source

Compares two strings while ignoring whitespaces; the result is a sequence consisting of (a) the matching substring, and (b) the rest of the $pd-data which does not match

Parameters:
xs:string text-node - First string to be compared to second one
xs:string pd-data - Second string (to be compared to first)
xs:string matching-string - Holding the respective temporary result which is enhanced by recursive call of the function
xs:integer removed-leading-pd-ws -
xs:string string:escape4regex (param: xs:string input) - source

Function escapes certain chars of an input string by putting '\' in front

Parameters:
xs:string input - String whose special 'regex' chars are to be escaped
xs:string+ string:first-distinct-char (param: xs:string string1xs:string string2xs:integer position) - source

Deriving first distinct char of two unequal strings

Parameters:
xs:string string1 - First string which shall be compared to second
xs:string string2 - Second string; to be compared to $string1
xs:integer position - Position counter for recursive calling of this funtion on substrings of the two strings
xs:string string:infiltrate-string (param: xs:string input-stringxs:string infiltrator) - source

Insert $infiltrator (e.g. '\s*') into string (after every single char) and escape special regex chars, e.g. input: 'hello.' - output: '\s*h\s*e\s*l\s*l\s*o\s*\.\s*'

Parameters:
xs:string input-string - String that is to be 'infiltrated' by another one
xs:string infiltrator - String which is inserted before and after every single position of the input string
xs:string string:int2string (param: xs:integer+ intSEQ) - source
No short description available
Parameters:
xs:integer+ intSEQ -
xs:string? string:replace-all (param: xs:string inputStringxs:string+ regexSEQxs:string+ replaceSEQ) - source

This function replaces all instances from the regex sequence (regexSEQ) by the corresponding substitution values (same position in replaceSEQ)

Parameters:
xs:string inputString - Input string whose corresponding characters are to be replaced
xs:string+ regexSEQ - Sequence of regular expressions used to find substrings in the input
xs:string+ replaceSEQ - Sequence of substituting strings which shall replace matching regex strings in the input string
undef:ANALYZE-BUGS (param: xs:string terminatexs:integer levelxs:string message) - source

Function receives message strings which are output due to $level and $ANALYZE-BUGS values

Parameters:
xs:string terminate - value of @terminate of message
xs:integer level - The level (importance) of the bug message; if lower than $bug-message-level then message is returned
xs:string message - String value of the message

Checking primary data identity on the basis of primary data and textual content of XML both without whitespaces

xs:integer undef:iterate-non-ws (param: xs:string? stringxs:string? non-wsxs:integer count) - source
No short description available
Parameters:
xs:string? string -
xs:string? non-ws -
xs:integer count -

Returning messages containing stylesheet parameters and their supplied values

undef:segments2inline (param: element()* segments) - source

Copy of elements into an inline annotation by reference to segments

Parameters:
element()* segments - XSF segments that form the basis for the resulting inline annotation