Gauche XML and SXML

Table of Contents

1. XML and SXML

1.1. Overview

SXML is another representation of XML using s-expression. To understand SXML, some XML knowledge is required. Gauche Scheme provides SXML functionality using Oleg Kiselyov's implementation.

1.2. XML structure

  1. Document Prolog
    • XML Declaration
    • Document Type Declaration
    • Entity Declaration
  2. Document Element

1.3. XML declaration

This declaration starts with <? tag and ends with ?>, processing instruction tag.

version, encoding and standalone parameters can be specified. version specifies XML version. Encoding specifies the document encoding. standalone specifies whether this document relies on external resources such as DTDs. (Using Schema is not affected by this attribute)

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>

1.4. Document type declaration

Document type declaration starts with <!DOCTYPE and end with >. This is a declaration tag, and does not nest. It can specify an element name, DTD resource and Entity declarations.

DTD resource is specified in two ways of identifiers. System identifier consists of SYSTEM and path or URI. Public identifier consists of PUBLIC and public identifier.

Entity declarations are enclosed with square brackets, [ ]. Each Entity declaration starts with <!Entity and ends with >. A pair of name and value or a pair of name and identifier.

Entities are referenced by ampersand, entity name and semicolon. They are replaed by the values specified.

<!DOCTYPE  emails
  SYSTEM "/xml-res/dtds/emails.dtd"
  [<!ENTITY my-name "Toshi">
   <!ENTITY my-address "toshi@example.com">
]>

1.5. Document Element

Document element is the main part of XML document. Elements can be nested. Each element starts with <element-name> and ends with </element-name>. Empty elements can be specified as <element-name />. Each element tag can have attributes.

<emails>
  <mail from=&my-address; to="tom@test.com">
  <mail from=&my-address; to="jim@test.com">
</emails>

1.5.1. Namespace

Element names can have prefix to distinguish elements with the same name from multiple sources. These prefixes need to declared as namespaces.

Namespace is declared as xmlns attributes in document element tag. Namespace can only be used in limited regions. The element declaring the namespace and its descendents can use namespace prefix.

There are two types of namespace declaration, general form and implicit form. General form explicitly declares namespace name, and implicit one does not. Elements without prefix belong to implicit namespace.

2. SXML

2.1. Overview

XML is a tree structure and can be expressed in s-expression. The s-expression is SXML. Oleg Kiselyov provides Scheme libraries to access and manipulate SXML.

The following is an SXML example.

'(*TOP*
  (emails
   (mail (@ (from "toshi@example.com")
            (to "tom@test.com")))
   (mail (@ (from "toshi@example.com")
            (to "jim@test.com")))))

2.2. Convert XML to SXML

ssax:xml->sxml is defined in (sxml ssax). The function takes a port, and reads input and returns s-expression. Namespace prefix and identifier can be specified as (prefix . identifier). To omit them, '() can be specified.

(import (sxml ssax))
(ssax:xml->sxml port '())

3. XPath and SXPath

3.1. Overview

XPath is used to specify locations in XML node structure. SXPath is a Scheme way to speficy nodes in SXML structure. SXPath provides users both low-level and abbreviated ways.

About Low-level functions, not only documents but also its source code is useful to understand them. They are defined in sxpathlib.scm

4. SXPath

In the following explanation, libraries and SXML data below are used. The SXML converted from XML has *TOP* element at the top. However when specifying nodes in SXML, the element should be omitted from the specification.

(import (scheme base) (scheme list)
        (sxml ssax) (sxml sxpath))

(define xml-books
"<data>
  <books>
    <book id='1'>
      <title> The Hunger Games </title>
      <author> Suzanne Collins </author>
      <year> 2008 </year>
      <available />
    </book>
    <book id='2'>
      <title> Twilight: Twiligh, Book 1 (Twilight Saga)</title>
      <author> Stephenie Meyer </author>
      <year> 2007 </year>
      <available />
    </book>
    <book id='3'>
      <title> Life of Pi </title>
      <author> Yann Martel </author>
      <year> 2001 </year>
    </book>
    <textbook id='4'>
      <title> Compilers: Principles, Techniques, and Tools </title>
      <author>  Alfred Aho </author>
      <year> 2006 </year>
      <available />
    </textbook>
    <textbook id='5'>
      <title> Structure and Interpretation of Computer Programs, second edition </title>
      <author>  Harold Abelson </author>
      <year> 1996 </year>
    </textbook>
  </books>
</data>")

(define sxml-books
  (ssax:xml->sxml (open-input-string xml-books) '()))

4.1. sxpath function

sxpath function takes a SXPath list format, and returns a converter to extract or to specify nodes in SXML.

4.2. SXPath: (<elem-name> <elm-name> … )

In XPath, element hierarchy is specified by element names concatenated with slashes. In SXPath, it is specified by elements in a list.

sxpath function takes (abbreviated) SXPath and constructs a procedure that can be applied to SXML. This function can also be used with the following low-level API.

In low-level API, node-join function joins procedures called converters. Converters are functions that take node and return nodeset. The converters are applied sequentially to "each node" of nodeset. The reason why "each node" is emphasized is that converters that assumes nodeset as its input, such as node-pos, do not work properly with node-join.

select-kids specifies children of the current node. ntype?? and ntype-names?? are used to specify conditions.

;; XPath : "<elem-name>/<elem-name>/..."
;; SXPath: (<elem-name> <elem-name> ... )
((sxpath '(data books book title)) sxml-books)

((node-join (sxpath '(data))
            (sxpath '(books))
            (sxpath '(book))
            (sxpath '(title)))
 sxml-books)

((node-join (select-kids (ntype?? 'data))
            (select-kids (ntype?? 'books))
            (select-kids (ntype?? 'book))
            (select-kids (ntype?? 'title)))
 sxml-books)

((node-join (select-kids (ntype-names?? '(data)))
            (select-kids (ntype-names?? '(books)))
            (select-kids (ntype-names?? '(book)))
            (select-kids (ntype-names?? '(title))))
 sxml-books)

4.3. SXPath: (<elem-name> <pos>)

In SXPath, (<elem-name> <pos>) specifies a position of a node.

In low-level API, node-pos is provided to specifiy the position in nodeset. This function assumes nodeset for its input, but node-join passes each node. For this purpose node-reduce is used for composition, which apply functions to nodeset sequentially.

;; XPath : "<elem-name>[position()=2]"
;; SXPath: (<elem-name> 2)
((sxpath '(data books (book 2) title)) sxml-books)

((node-join (sxpath '(data books))
            (node-reduce
             (sxpath '(book))
             (node-pos 2))
            (sxpath '(title)))
 sxml-books)

4.4. SXPath: (<elem-name> <condition>)

In SXPath conditions can be specified following element name. To specify existence of some subelements, they are specified as a (nested) list.

In low-level API, sxml:filter can be used to specify nodes with a condition specified. Converters, such as select-kids, are also used to specify conditions.

;; XPath : "<elem-name>[<sub-elem-name>]"
;; SXPath: (<elem-name> (<sub-elem-name>))
((sxpath '(data books (book (available)) title)) sxml-books)

((node-join (sxpath '(data books book))
            (sxml:filter
              (select-kids (ntype?? 'available)))
            (sxpath '(title)))
 sxml-books)

In SXPath, attributes can be specified with @ and attribute-name. @ and attribute-name are dealt in a chunk in s-expressions as follows.

In low-level API, @ is dealt as one node, and attribute names are its children.

;; XPath : "<elem-name>[@id]"
;; SXPath: (<elem-name> (@ id))
((sxpath '(data books (book (@ id )) title)) sxml-books)

((node-join (sxpath '(data books book))
            (sxml:filter
             (node-join
              (select-kids (ntype?? '@))
              (select-kids (ntype?? 'id))))
            (sxpath '(title)))
 sxml-books)
;; XPath : "<elem-name>[@id='3']"
;; SXPath: (<elem-name> (@ id (equal? '3')))
((sxpath '(data books (book (@ id (equal? "3"))) title)) sxml-books)

((node-join (sxpath '(data books book))
            (sxml:filter
             (node-join
              (select-kids (ntype?? '@))
              (select-kids (node-equal? '(id "3")))))
            (sxpath '(title)))
 sxml-books)

4.5. SXPath: (or@ …. ) (not@ …. )

In SXPath, element names can be specified by element name set or by negating set. In low-level API, sxml:invert works for negate.

;; SXPath: (or@ <elem-name> <elem-name> ... )
((sxpath '(data books (or@ book textbook) title)) sxml-books)

((node-join (sxpath '(data books))
            (select-kids (ntype-names?? '(book textbook)))
            (sxpath '(title)))
 sxml-books)

((node-join (sxpath '(data books))
            (node-or
             (select-kids (ntype-names?? '(book)))
             (select-kids (ntype-names?? '(textbook))))
            (sxpath '(title)))
 sxml-books)
;; SXPath: (not@ <elem-name> <elem-name> ... )
((sxpath '(data books (not@ book) title)) sxml-books)

((node-join (sxpath '(data books))
            (select-kids (sxml:invert (ntype-names?? '(book))))
            (sxpath '(title)))
 sxml-books)

4.6. SXPath: (// …. )

In SXPath, // is used to specify current node or descendant nodes. In low-level API, node-closure is used to specify nodes from descendants of input node that satisfy conditions. node-self specifies current node that satisfies conditions.

;; SXPath: (//)
((sxpath '(//)) sxml-books)

((node-or
  (node-self (ntype?? '*any*))
  (node-closure (ntype?? '*any*)))
 sxml-books)
;; SXPath: (// <elem-name> ... )
((sxpath '(// textbook title)) sxml-books)

((node-join (node-closure
             (ntype?? 'textbook))
            (sxpath '(title)))
 sxml-books)