mergeXSF.xsl

mergeXSF.xsl - Merging standoff XSF files

Version 04.03.2011, 16:50 (GMT)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU GNU Lesser General Public License for more details.

You should have received a copy of the GNU GNU Lesser General Public License along with this program (file 'lgpl-3.0.txt'). If not, see http://www.gnu.org/licenses/.

Due to the frequent use of the ID/IDREF mechanism in XStandoff, a manual merging of XStandoff files seems quite unpromising. The XSLT stylesheet mergeXSF.xsl file is provided as the input file of the transformation, the second file's name has to be included via the stylesheet parameter merge-with.

The main problem which is solved by the stylesheet is to adapt the segments from the involved XStandoff files to each other. On the one hand there are segments in the different files spanning over the same string of the primary data, but having distinct IDs. In this case the two segments have to be replaced by one. On the other hand there will be segments with the same ID, but spanning over different character positions. These have to get new unequal IDs. The merging of the XStandoff files in general leads to a complete reorganization of the segment list making it necessary to update the segment references of the elements in the XStandoff layers. After fulfilling this duty, the XStandoff layers are included in the new XStandoff file.

The reorganization of the segment list can be disabled by configuring the stylesheet parameter keep-segments. Specifying the value '1' causes the perpetuation of the segments of the input XStandoff file. However, the segments of the file provided by merge-with have to be subject of a reorganization.

In addition, the stylesheet handles the optional output of the 'all-layer'. The stylesheet parameter all-layer, in case it is set to '1', evokes the inclusion of the layer which contains the elements that are present in all annotation layers. It is irrelevant if there was an 'all-layer' present in the input files or not. Though it might happen that no such layer is returned, namely if there are no elements in the several layers which share the required features.

Currently the stylesheet only supports the merging of two single XStandoff files. Naturally this allows for a successive merging of more than two files. However it would be more straightforward to have the possibility of merging more than two XStandoff files with a single transformation. The update of the stylesheet to support this is in preparation.

Input XML file: XSF instance

Additional file (required): a second XSF instance which should be merged with the first one

Stylesheet parameters:

  • merge-with=$xsf-file - name of the second XSF file
  • keep-segments=xs:boolean - segments of the input file (the first XSF instance) remain unchanged, new segments (of the second file) are appended at the end
  • all-layer=xs:boolean - determines the return of a layer containing elements that appear in all layers
  • nest-by=xs:string('measure' | 'priority') - heuristics for nesting elements with same start and end positions; default 'measure' leads statistical measuring

Execution via command line (XSLT processor Saxon9): java -jar saxon9.jar [optional Saxon Parameters] -o [XML output filename] [XSF input filename] mergeXSF.xsl merge-with=[2nd XSF instance] [optional Stylesheet Parameters]

Known problems:

  • different character encodings used for primary data file and input XML file (first XSF instance) can cause problems

Author:
Daniel Jettka; daniel.jettka@uni-bielefeld.de; Project Sekimo (A2), DFG Research Group 437
Copyright:
GNU Lesser General Public License, see below for details

Parameters Summary

xs:boolean all-layer - source

Determines the return of a layer containing elements that appear in all layers. Default: false

xs:boolean keep-segments - source

Keeping the segments of the transformation basis? Default: false

xs:string merge-with - source

Relative path to the primary data instance

xs:string nest-by - source

Determination of heuristics to nest elements with the same boundary position into one another.

Variables Summary

Root of the first XSF file to be merged (XML transformation input).

xs:anyURI+ XSF1-namespaces - source

Namespaces of the layers appearing in the XSF files (all but "all-layer") - input XML file

Primary data with automatic encoding detection (input XML file)

Root of the second XSF file to be merged (provided by $merge-with).

xs:anyURI+ XSF2-namespaces - source

Namespaces of the layers appearing in the XSF files (all but "all-layer") - $merge-with file

Primary data with automatic encoding detection (file from $merge-with)

Variable contains the "all-layer" which was created by elem:process-layer() if $all-layer=true()

URI of the "all-layer" namespace.

document-node() concurrent-IDs - source

Test for same IDs in input files. Contains the generated IDs of layers which contain ID-attributes which are also present in other layer(s).

xs:boolean is-schema-aware - source

Indicator if current transformation shall be schema-aware. Can be overwritten if "mergeXSF.xsl" is imported by "mergeXSF-sa.xsl".

xs:string merge-with-file - source

Making $merge-with usable as URI for document().

element()+ namespaces - source

The several namespaces present in the XSF files.

Contains segments occuring in both XSF files

Directory location of the current stylesheet.

Keys Summary

concurrent-IDs (match: concurrent-ID, use: @old-id) - source
Index of IDs which occur with same value in $XSF1 and $XSF2
elem-by-id (match: *, use: ((if($is-schema-aware) then attribute(*, xs:ID) else @*:id), generate-id())[1]) - source

Find elements by their ID attribute; no matter which namespace @id is bound by

elem-by-local-name-and-segRef (match: *, use: concat(local-name(), '#', @xsf:segment)) - source

Find elements by a combination of their local-name and the segment reference given by @xsf:segment

elem-by-name (match: *, use: name()) - source

Find elements by their element name → name()

elem-by-segRef (match: *, use: @xsf:segment) - source

Find elements by their segment reference @xsf:segment

measure (match: measure, use: concat(@parent, @child)) - source

Find elements named "measure" by the concatenation of their attributes @parent and @child; applied to temporary tree created by doc:inclusion-measures()

seg-by-XSF1ID (match: *:segment, use: @XSF1ID) - source
Find element by the value of its attribute @XSF1ID
seg-by-XSF2ID (match: *:segment, use: @XSF2ID) - source
Find element by the value of its attribute @XSF2ID
seg-containing-all-layer-elem (match: xsf:segment, use: @XSF1ID, @*[contains(., $all-layer-namespace)]/local-name()) - source
Find an element xsf:segment using the value of its attribute @XSF1ID and @*[contains(
seg-id-by-start-and-end (match: xsf:segment/@xml:id, use: ../concat(@start, '#', @end)) - source
Find element by the value of its attribute @XSF2ID

Match Templates Summary

Template matching the root of the document to be transformed.

Named Templates Summary

copyNodes (param: xs:string source) - source

Template copies child elements of the context node recursively and replaces old IDs by new ones (from $newSegments)

Template called by initial template (matching '/').

Functions Summary

attribute() attr:avoid-concurrent-IDs (param: attribute() attributexs:string root-name-of-layer) - source

Test for concurrent ID/IDREF/IDREFS information in attributes and escaping by predefined prefix.

document-node() doc:inclusion-measures - source

Calculation of inclusion relations of elements

xs:double double:multiply-int-sequence (param: xs:double* sequencexs:double temp-result) - source

Multiplication of all numbers from a sequence.

element(xsf:segment)* elem:copy-segments-with-new-ID-and-elem-info (param: element(xsf:segment)* segments) - source

Segments get new IDs and are returned; the temporary information (@XSF1ID und @XSF2ID) is being kept; additionally information on elements spanning over this segment is returned

Combining the segments from both XSF files

element()* elem:process-layer (param: element()* layer-contentxs:anyURI ns-to-be-inlinedxs:string root-name-of-layer) - source

Returns the contents of a layer. The @xsf:segment are updated.

element()* elem:return-elements-in-all-layer (param: element()* base-layer-elemselement(xsf:layer)+ ref-layers) - source

Returning of elements which are present in every layer into a common all-layer.

element()* elem:segments2inline (param: element(xsf:segment)* segmentsxs:anyURI ns-to-be-inlinedxs:string root-name-of-layerxs:string layer-prefix) - source

Creating an inline annotation on the basis of a segment list

element()* elem:sort-elems-by-nesting (param: element()* elems-same-segxs:string sort-type) - source

Sorting elements which refer to the same segment either by priority of their xsf:level ($sort-type='priority') or by measurement of their inclusion relations ($sort-type='measure').

element(xsf:segment)* elem:sort-segments (param: element(xsf:segment)* segments) - source

Sorting segments on the basis of their attributes @start and @end

xs:string+ string:first-distinct-char (param: xs:string string1xs:string string2xs:integer position) - source

Deriving first distinct char of two unequal strings

xs:string string:get-id (param: element() self) - source

Getting the string value of the ID of an element which can either be an attribute(*, xs:ID), in case of schema-aware transformation, or @*:id (basic processor), or the generation of an ID by generate-id()

Checking for layers from input annotations which have same IDs

undef:multiple-inline-annotations (param: element()* elems-this-segmentnested-elemsxs:anyURI ns-to-be-inlinedxs:string layer-prefixxs:string root-name-of-layer) - source

Nesting elements with same segment reference by heuristics given by func:sort-elems()

Returning messages containing information on supplied stylesheet parameters.

Checking for primary data identity by comparing the content of the <xsf:primaryData> of the input annotations

Parameters Detail

xs:boolean all-layer - source

Determines the return of a layer containing elements that appear in all layers. Default: false

xs:boolean keep-segments - source

Keeping the segments of the transformation basis? Default: false

xs:string merge-with - source

Relative path to the primary data instance

xs:string nest-by - source

Determination of heuristics to nest elements with the same boundary position into one another.

The default value for the nesting heuristics is 'measure' which calls a statistical analysis and nests elements into one another by measuring the embedding relation. The value 'priority' evokes the nesting by the attribute @priority which is optional for every layer element. By this means the nesting can be determined manually.

Variables Detail

Root of the first XSF file to be merged (XML transformation input).

xs:anyURI+ XSF1-namespaces - source

Namespaces of the layers appearing in the XSF files (all but "all-layer") - input XML file

Primary data with automatic encoding detection (input XML file)

Root of the second XSF file to be merged (provided by $merge-with).

xs:anyURI+ XSF2-namespaces - source

Namespaces of the layers appearing in the XSF files (all but "all-layer") - $merge-with file

Primary data with automatic encoding detection (file from $merge-with)

Variable contains the "all-layer" which was created by elem:process-layer() if $all-layer=true()

URI of the "all-layer" namespace.

document-node() concurrent-IDs - source

Test for same IDs in input files. Contains the generated IDs of layers which contain ID-attributes which are also present in other layer(s).

c
xs:boolean is-schema-aware - source

Indicator if current transformation shall be schema-aware. Can be overwritten if "mergeXSF.xsl" is imported by "mergeXSF-sa.xsl".

xs:string merge-with-file - source

Making $merge-with usable as URI for document().

element()+ namespaces - source

The several namespaces present in the XSF files.

Contains segments occuring in both XSF files

Directory location of the current stylesheet.

Keys Detail

concurrent-IDs (match: concurrent-ID, use: @old-id) - source
Index of IDs which occur with same value in $XSF1 and $XSF2
elem-by-id (match: *, use: ((if($is-schema-aware) then attribute(*, xs:ID) else @*:id), generate-id())[1]) - source

Find elements by their ID attribute; no matter which namespace @id is bound by

elem-by-local-name-and-segRef (match: *, use: concat(local-name(), '#', @xsf:segment)) - source

Find elements by a combination of their local-name and the segment reference given by @xsf:segment

elem-by-name (match: *, use: name()) - source

Find elements by their element name → name()

elem-by-segRef (match: *, use: @xsf:segment) - source

Find elements by their segment reference @xsf:segment

measure (match: measure, use: concat(@parent, @child)) - source

Find elements named "measure" by the concatenation of their attributes @parent and @child; applied to temporary tree created by doc:inclusion-measures()

seg-by-XSF1ID (match: *:segment, use: @XSF1ID) - source
Find element by the value of its attribute @XSF1ID
seg-by-XSF2ID (match: *:segment, use: @XSF2ID) - source
Find element by the value of its attribute @XSF2ID
seg-containing-all-layer-elem (match: xsf:segment, use: @XSF1ID, @*[contains(., $all-layer-namespace)]/local-name()) - source
Find an element xsf:segment using the value of its attribute @XSF1ID and @*[contains(
, $all-layer-namespace)]/local-name()
seg-id-by-start-and-end (match: xsf:segment/@xml:id, use: ../concat(@start, '#', @end)) - source
Find element by the value of its attribute @XSF2ID

Match Templates Detail

Template matching the root of the document to be transformed.

The initial template is naturally the starting point of the transformation. In this special case, it just calls the template named 'main' which incorporates the first steps of the transformation.

This strategy contributes to the schema-aware version of this stylesheet ('mergeXSF-sa.xsl') which validates the input XSF instances and allows for the caption and handling of multiple IDs in the resulting merged XSF instance.

Named Templates Detail

copyNodes (param: xs:string source) - source

Template copies child elements of the context node recursively and replaces old IDs by new ones (from $newSegments)

Parameters:
xs:string source - Name of the root of the context element ('XSF1' or 'XSF2')

Template called by initial template (matching '/').

  • Returning messages on supplied stylesheet parameters → undef:param-messages()
  • Checking and returning of primary data → undef:test-and-return-pd()
  • Creation of a new merged segment list (<xsf:segmentation>) using $newSegments
  • Returning the several layers of the two XSF instances into one <xsf:annotation>. Therefore they have to be compared for uniqueness and elements have to be supplied by updated segment references.
Detailed information on the several processing steps is available at the documentation of the individual templates and functions called.

Functions Detail

attribute() attr:avoid-concurrent-IDs (param: attribute() attributexs:string root-name-of-layer) - source

Test for concurrent ID/IDREF/IDREFS information in attributes and escaping by predefined prefix.

Parameters:
attribute() attribute - The attribute to be tested.
xs:string root-name-of-layer - Name of the root of the layer containing this attribute ('XSF1' vs. 'XSF2').
document-node() doc:inclusion-measures - source

Calculation of inclusion relations of elements

xs:double double:multiply-int-sequence (param: xs:double* sequencexs:double temp-result) - source

Multiplication of all numbers from a sequence.

Parameters:
xs:double* sequence - Sequence of numbers which are multiplicated recursively.
xs:double temp-result - Storation of the temporary result while numbers exist that were not already used for multiplication.
element(xsf:segment)* elem:copy-segments-with-new-ID-and-elem-info (param: element(xsf:segment)* segments) - source

Segments get new IDs and are returned; the temporary information (@XSF1ID und @XSF2ID) is being kept; additionally information on elements spanning over this segment is returned

Parameters:
element(xsf:segment)* segments - Segments which are supplied with updated IDs.

Combining the segments from both XSF files

element()* elem:process-layer (param: element()* layer-contentxs:anyURI ns-to-be-inlinedxs:string root-name-of-layer) - source

Returns the contents of a layer. The @xsf:segment are updated.

Parameters:
element()* layer-content - Descendant elements of the current layer.
xs:anyURI ns-to-be-inlined - Namespace URI of the elements which shall be included in the resulting inline annotation.
xs:string root-name-of-layer -
element()* elem:return-elements-in-all-layer (param: element()* base-layer-elemselement(xsf:layer)+ ref-layers) - source

Returning of elements which are present in every layer into a common all-layer.

Parameters:
element()* base-layer-elems - Layer serving as the basis of the all-layer. Due to the fact that the all-layer is defined by containing the elements which are present in every single layer, this can be any layer from the input.
element(xsf:layer)+ ref-layers - Layers which are examined on containing elements which are present in all other layers.
element()* elem:segments2inline (param: element(xsf:segment)* segmentsxs:anyURI ns-to-be-inlinedxs:string root-name-of-layerxs:string layer-prefix) - source

Creating an inline annotation on the basis of a segment list

Parameters:
element(xsf:segment)* segments - Segments whose referenced elements shall be included in the resulting annotation.
xs:anyURI ns-to-be-inlined - Namespace URI of the elements which shall be included in the resulting inline annotation.
xs:string root-name-of-layer -
xs:string layer-prefix - Prefix of the namespace whose elements are to be inlined.
element()* elem:sort-elems-by-nesting (param: element()* elems-same-segxs:string sort-type) - source

Sorting elements which refer to the same segment either by priority of their xsf:level ($sort-type='priority') or by measurement of their inclusion relations ($sort-type='measure').

Parameters:
element()* elems-same-seg - Elements referring the same segment.
xs:string sort-type - There are two possible sort types: 'priority' and 'measure'. The value 'priority' evokes a sorting by using the @priority of the corresponding <xsf:level>. The value 'measure' calls a statistical analysis of the overall embedding relations of the specific elements in $elems-same-seg.
element(xsf:segment)* elem:sort-segments (param: element(xsf:segment)* segments) - source

Sorting segments on the basis of their attributes @start and @end

Parameters:
element(xsf:segment)* segments - Segments which are sorted in this function.
xs:string+ string:first-distinct-char (param: xs:string string1xs:string string2xs:integer position) - source

Deriving first distinct char of two unequal strings

Parameters:
xs:string string1 - First string which shall be compared to second
xs:string string2 - Second string; to be compared to $string1
xs:integer position - Position counter for recursive calling of this funtion on substrings of the two strings
xs:string string:get-id (param: element() self) - source

Getting the string value of the ID of an element which can either be an attribute(*, xs:ID), in case of schema-aware transformation, or @*:id (basic processor), or the generation of an ID by generate-id()

Parameters:
element() self - Element whose ID shall be returned.

Checking for layers from input annotations which have same IDs

undef:multiple-inline-annotations (param: element()* elems-this-segmentnested-elemsxs:anyURI ns-to-be-inlinedxs:string layer-prefixxs:string root-name-of-layer) - source

Nesting elements with same segment reference by heuristics given by func:sort-elems()

Parameters:
element()* elems-this-segment - Elements referencing the same segment.
nested-elems - Elements which are descendants of the $elems-this-segment. They are simply copied into the $elems-this-segment when the embedding relations are resolved.
xs:anyURI ns-to-be-inlined - Namespace URI the elements of which shall be included in the resulting inline annotation.
xs:string layer-prefix - Prefix of the namespace whose elements are to be inlined.
xs:string root-name-of-layer -

Returning messages containing information on supplied stylesheet parameters.

Checking for primary data identity by comparing the content of the <xsf:primaryData> of the input annotations