SWI-Prolog SGML/XML parser
AllApplicationManualNameSummaryHelp

  • Documentation
    • Reference manual
    • Packages
      • SWI-Prolog SGML/XML parser
        • Introduction
        • Bluffer's Guide
          • `Goodies' Predicates
        • Predicate Reference
        • Stream encoding issues
        • library(xpath): Select nodes in an XML DOM
        • Processing Indexed Files
        • External entities
        • library(pwp): Prolog Well-formed Pages
        • Writing markup
        • Unsupported SGML features
        • Acknowledgements

2 Bluffer's Guide

This package allows you to parse SGML, XML and HTML data into a Prolog data structure. The high-level interface defined in library(sgml) provides access at the file-level, while the low-level interface defined in the foreign module works with Prolog streams. Please use the source of sgml.pl as a starting point for dealing with data from other sources than files, such as SWI-Prolog resources, network-sockets, character strings, etc. The first example below loads an HTML file.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

<html>
<head>
<title>Demo</title>
</head>
<body>

<h1 align=center>This is a demo</title>

Paragraphs in HTML need not be closed.

This is called `omitted-tag' handling.
</body>
</html>
?- load_html('test.html', Term, []),
   pretty_print(Term).

[ element(html,
          [],
          [ element(head,
                    [],
                    [ element(title,
                              [],
                              [ 'Demo'
                              ])
                    ]),
            element(body,
                    [],
                    [ '\n',
                      element(h1,
                              [ align = center
                              ],
                              [ 'This is a demo'
                              ]),
                      '\n\n',
                      element(p,
                              [],
                              [ 'Paragraphs in HTML need not be closed.\n'
                              ]),
                      element(p,
                              [],
                              [ 'This is called `omitted-tag\' handling.'
                              ])
                    ])
          ])
].

The document is represented as a list, each element being an atom to represent CDATA or a term element(Name, Attributes, Content). Entities (e.g. &lt;) are expanded and included in the atom representing the element content or attribute value.1Up to SWI-Prolog 5.4.x, Prolog could not represent wide characters and entities that did not fit in the Prolog characters set were emitted as a term number(+Code). With the introduction of wide characters in the 5.5 branch this is no longer needed.

2.1 `Goodies' Predicates

These predicates are for basic use of the library, converting entire and self-contained files in SGML, HTML, or XML into a structured term. They are based on load_structure/3.

load_sgml(+Source, -ListOfContent, :Options)
Calls load_structure/3 with the given Options, using the default option dialect(sgml)
load_xml(+Source, -ListOfContent, :Options)
Calls load_structure/3 with the given Options, using the default option dialect(xml)
load_html(+Source, -ListOfContent, :Options)
Calls load_structure/3 with the given Options, using the default options dialect(HTMLDialect), where HTMLDialect is html4 or html5 (default), depending on the Prolog flag html_dialect. Both imply the option shorttag(false). The option dtd(DTD) is passed, where DTD is the HTML DTD as obtained using dtd(html, DTD). See dtd/2.