XSF2inline.xsl

XSF2inline.xsl - Transformation of a standoff XSF instance into an inline XSF instance

Version 17.02.2011, 15:41 (GMT) for SGF 1.0/ XSF 1.1

This program is free software: you can redistribute it and/or modify it under the terms of the GNU GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU GNU Lesser General Public License for more details.

You should have received a copy of the GNU GNU Lesser General Public License along with this program (file 'lgpl-3.0.txt'). If not, see http://www.gnu.org/licenses/.

In addition to the inline2XSF.xsl stylesheet there is a counterpart. XSF2inline.xsl creates an inline annotation on the basis of an XStandoff instance. The approach covers the handling of overlapping markup insofar as these structures are represented by milestone elements in the resulting inline annotation. Concerning this matter, the first task of the stylesheet is to detect segments whose start and end position information constitute an overlap. These segments are split up into segments representing milestones so that they can be used as an adequate basis to build an inline annotation. The linear list of segments is processed recursively by taking the currently outermost segments (those who are not included in other segments' spans which respectively are determined by their start and end positions in the character range of the primary data). The elements from the XStandoff layers which are referenced by the outermost segments are copied into the inline annotation. The segments which are embedded in the outermost segments are processed recursively.

However, copying the elements from the XStandoff layers has to be controlled by a mechanism regarding the possibility of elements from different layers referencing the same segment. These elements share the same positions for starting and ending tags and therefore a decision has to be made in which order they should be nested into one another. The optional stylesheet parameter sort-by refers to this circumstance. Its default value is 'measure' which means that a statistical analysis is performed by the stylesheet diagnosing the embedding relations of all occuring element types by frequency. An expected result would be that elements representing sentence boundaries are embedded in those for paragraphs because this embedding relation is more frequent than vice versa. The only case this approach could be inadequate for would be, if there are elements from different layers for which no definite statistical result for embedding can be achieved. For instance, there could be elements whose boundaries always share the same character positions, but which can clearly semantically be assigned to a certain embedding. This case cannot be covered by the statistical method.

In addition, the embedding can be based on the priority attribute of the level (in SGF version 1.0) or else layer element (in XStandoff version 1.1, cf. the section called “The development of SGF to XStandoff”). This strategy can be accessed by specifying the value 'priority' for the parameter sort-by. Low values of the priority attribute are nested deeper in the inline annotation than higher ones. By this means the user can specify the embedding manually, but one has to be sure to set the values of the attribute correctly to get the desired result. This method underlies the assumption that there can be a semantically grounded, definite decision for the embedding. The most promising concept was a mixture of the both approaches which has to be realized in future work.

There is an additional optional stylesheet parameter return-segID which can be very helpful to retain a connection between the XStandoff file and the resulting inline annotation. By default the value of this parameter is set to '1' which means that the segment attribute is retained throughout the conversion process. The parameter has been added mainly for control issues.

Input XML file: XSF instance

Stylesheet parameter:

  • sort-by
  • return-segID

Execution via command line (XSLT processor Saxon9): java -jar saxon9.jar [optional Saxon Parameters] -o [inline XSF output filename] [XSF input filename] XSF2inline.xsl [optional Stylesheet Parameters]

Known problems:

  • changing namespaces (different URIs with same prefixes) within input XSF

Author:
Daniel Jettka; daniel.jettka@uni-bielefeld.de; Project Sekimo (A2), DFG Research Group 437
Copyright:
GNU Lesser General Public License, see below for details

Parameters Summary

xs:boolean return-segID - source

Whether or not segment information shall be maintained in the returned annotation. Default: true

xs:string sort-by - source

Determination of heuristics to nest elements with the same boundary position into one another.

Variables Summary

Segments sorted and overlapping hierarchies replaced by milestones

xs:integer pd-length - source

String length of the primary data

xs:string primary-data - source

Textual content of inline annotation (primary data)

Converting string value for primary data file path to URI

Root node for referencing

xs:string xsfVersion - source

Version of XSF

Keys Summary

elem-by-name (match: *, use: name()) - source

Find elements by their element name → name()

elem-by-segRef (match: *, use: @xsf:segment) - source

Find elements by their segment reference @xsf:segment

measure (match: measure, use: concat(@parent, @child)) - source

Find elements named "measure" by the concatenation of their attributes @parent and @child; applied to temporary tree created by doc:inclusion-measures()

seg-by-ID (match: xsf:segment, use: @xml:id) - source

Find <xsf:segement> by the value of their attribute @xml:id

Match Templates Summary

Creating the <xsf:inline> annotation by processing individual layers by undef:inline-annotations()

Functions Summary

attribute()? attr:get-ID-att (param: element()? node) - source

Getting the corresponding ID-attribute (intended for <xsf:segment> and <milestone>).

document-node() doc:inclusion-measures - source

Calculation of inclusion relations of elements.

xs:double double:multiply-int-sequence (param: xs:double* sequencexs:double temp-result) - source

Multiplication of all numbers from a sequence.

element()* elem:sort-elems-by-nesting (param: element()* elems-same-segxs:string sort-type) - source

Sorting elements which refer to the same segment either by priority of their xsf:level ($sort-type='priority') or by measurement of their inclusion relations ($sort-type='measure').

element()* elem:sort-segments (param: element()* segments) - source

Sorting of segments by their @start and @end

element(text)* elem:text-segments - source

Building segments for the textual content of the input XML file.

undef:get-nested-elems (param: element()? segment) - source
No short description available
undef:inline-annotations (param: element()* segmentsxs:integer ancestor-start-posxs:integer ancestor-end-pos) - source

Copy of elements into an inline annotation by reference to segments

undef:multiple-inline-annotations (param: element()* elems-this-segmentthis-segment) - source

Nesting elements with same segment reference by heuristics given by func:sort-elems()

undef:overlaps2milestones (param: element(xsf:segment)* segments) - source

Replaces segments including overlapping position information with milestones.

Returning information on supplied stylesheet parameters.

Parameters Detail

xs:boolean return-segID - source

Whether or not segment information shall be maintained in the returned annotation. Default: true

xs:string sort-by - source

Determination of heuristics to nest elements with the same boundary position into one another.

The default value for the nesting heuristics is 'measure' which calls a statistical analysis and nests elements into one another by measuring the embedding relation. The value 'priority' evokes the nesting by the attribute @priority which is optional for every layer element. By this means the nesting can be determined manually.

Variables Detail

Segments sorted and overlapping hierarchies replaced by milestones

xs:integer pd-length - source

String length of the primary data

xs:string primary-data - source

Textual content of inline annotation (primary data)

Converting string value for primary data file path to URI

Root node for referencing

xs:string xsfVersion - source

Version of XSF

Keys Detail

elem-by-name (match: *, use: name()) - source

Find elements by their element name → name()

elem-by-segRef (match: *, use: @xsf:segment) - source

Find elements by their segment reference @xsf:segment

measure (match: measure, use: concat(@parent, @child)) - source

Find elements named "measure" by the concatenation of their attributes @parent and @child; applied to temporary tree created by doc:inclusion-measures()

seg-by-ID (match: xsf:segment, use: @xml:id) - source

Find <xsf:segement> by the value of their attribute @xml:id

Match Templates Detail

Creating the <xsf:inline> annotation by processing individual layers by undef:inline-annotations()

Functions Detail

attribute()? attr:get-ID-att (param: element()? node) - source

Getting the corresponding ID-attribute (intended for <xsf:segment> and <milestone>).

Parameters:
element()? node - Element whose ID is to be returned.
document-node() doc:inclusion-measures - source

Calculation of inclusion relations of elements.

xs:double double:multiply-int-sequence (param: xs:double* sequencexs:double temp-result) - source

Multiplication of all numbers from a sequence.

Parameters:
xs:double* sequence - Sequence of numbers which are multiplicated recursively.
xs:double temp-result - Storation of the temporary result while numbers exist that were not already used for multiplication.
element()* elem:sort-elems-by-nesting (param: element()* elems-same-segxs:string sort-type) - source

Sorting elements which refer to the same segment either by priority of their xsf:level ($sort-type='priority') or by measurement of their inclusion relations ($sort-type='measure').

Parameters:
element()* elems-same-seg - Elements referring the same segment.
xs:string sort-type - There are two possible sort types: 'priority' and 'measure'. The value 'priority' evokes a sorting by using the @priority of the corresponding <xsf:level>. The value 'measure' calls a statistical analysis of the overall embedding relations of the specific elements in $elems-same-seg.
element()* elem:sort-segments (param: element()* segments) - source

Sorting of segments by their @start and @end

Parameters:
element()* segments - Segments which are being sorted.
element(text)* elem:text-segments - source

Building segments for the textual content of the input XML file.

undef:get-nested-elems (param: element()? segment) - source
No short description available
Parameters:
element()? segment -
undef:inline-annotations (param: element()* segmentsxs:integer ancestor-start-posxs:integer ancestor-end-pos) - source

Copy of elements into an inline annotation by reference to segments

Parameters:
element()* segments - Segments that reference elements from the annotation and serve as the basis for inlining the annotation.
xs:integer ancestor-start-pos - Start position of the ancestor of the current element.
xs:integer ancestor-end-pos - End position of the ancestor of the current element.
undef:multiple-inline-annotations (param: element()* elems-this-segmentthis-segment) - source

Nesting elements with same segment reference by heuristics given by func:sort-elems()

Parameters:
element()* elems-this-segment - Elements that are referring to the same segment.
this-segment -
undef:overlaps2milestones (param: element(xsf:segment)* segments) - source

Replaces segments including overlapping position information with milestones.

Parameters:
element(xsf:segment)* segments - Segments on the basis of which overlaps are discovered and resolved by milestones.

Returning information on supplied stylesheet parameters.