Clojure Guides: Parsing XML in Clojure

This work is licensed under a Creative Commons Attribution 3.0 Unported License (including images & stylesheets). The source is available on Github.

What Version of Clojure Does This Guide Cover?

This guide covers Clojure 1.12 and Leiningen 2.x.

Overview

Try as you might, XML is difficult to avoid. This is particularly true in the Java ecosystem. This guide will show you how to parse XML with the minimum amount of pain using the excellent tools available in Clojure.

Parsing NZB files

For the purpose of the tutorial I have chosen a simple and fairly well known XML file format: NZB. An NZB file is used to describe files to download from NNTP servers. In this tutorial we will take a basic NZB document and turn it into a Clojure map.

Let us start by creating a new project (for details on using Leiningen, see this guide:

$ lein new nzb

Now edit project.clj to contain the following:

(defproject nzb "0.1.0-SNAPSHOT"
  :description ""
  :url ""
  :license {:name "Eclipse Public License"
            :url "https://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.12.0"]
                 [org.clojure/data.zip "1.0.0"]])

We are including a dependency on clojure.data.zip, which is a "system for filtering trees, and XML trees in particular".

Make a dir called dev-resources at the root of your project, and create a file named example.nzb inside of it. This will be the file we use to test our code (taken from wikipedia). dev-resources is by convention the location to store file resources you use during development / testing.

Put the following XML in the example.nzb file:

<?xml version="1.0" encoding="iso-8859-1" ?>
<!-- <!DOCTYPE nzb PUBLIC "-//newzBin//DTD NZB 1.1//EN" "http://www.newzbin.com/DTD/nzb/nzb-1.1.dtd"> -->
<nzb xmlns="http://www.newzbin.com/DTD/2003/nzb">
 <head>
   <meta type="title">Your File!</meta>
   <meta type="tag">Example</meta>
 </head>
 <file poster="Joe Bloggs &lt;bloggs@nowhere.example&gt;" date="1071674882" subject="Here's your file!  abc-mr2a.r01 (1/2)">
   <groups>
     <group>alt.binaries.newzbin</group>
     <group>alt.binaries.mojo</group>
   </groups>
   <segments>
     <segment bytes="102394" number="1">123456789abcdef@news.newzbin.com</segment>
     <segment bytes="4501" number="2">987654321fedbca@news.newzbin.com</segment>
   </segments>
 </file>
</nzb>

Note The eagle eyed among you will notice that I have commented out the DOCTYPE declaration, as this causes an Exception to be thrown. I will show you how to get around this towards the end of the tutorial.

Let's write a high level test to illustrate more clearly what we are trying to do. Open up the test/nzb/core_test.clj file and make enter the following:

(ns nzb.core-test
  (:use clojure.test
        nzb.core)
  (:require [clojure.java.io :as io]))

(deftest test-nzb->map
  (let [input (io/resource "example.nzb")]
    (is (= {:meta {:title "Your File!"
                   :tag "Example"}
            :files [{:poster "Joe Bloggs <bloggs@nowhere.example>"
                     :date 1071674882
                     :subject "Here's your file!  abc-mr2a.r01 (1/2)"
                     :groups ["alt.binaries.newzbin"
                              "alt.binaries.mojo"]
                     :segments [{:bytes 102394
                                 :number 1
                                 :id "123456789abcdef@news.newzbin.com"}
                                {:bytes 4501
                                 :number 2
                                 :id "987654321fedbca@news.newzbin.com"}]}]}
           (nzb->map input)))))

This should be fairly self-explanatory, I have directly translated the XML into Clojure data structures of maps and vectors. If we were to just use the clojure.xml library to parse the NZB file, we get a tree based representation. For example:

$ lein repl
...
user=> (require '[clojure.java.io :as io] '[clojure.xml :as xml])
nil
user=> (-> "example.nzb" io/resource io/file xml/parse clojure.pprint/pprint)
{:tag :nzb,
 :attrs {:xmlns "http://www.newzbin.com/DTD/2003/nzb"},
 :content
 [{:tag :head,
   :attrs nil,
   :content
   [{:tag :meta, :attrs {:type "title"}, :content ["Your File!"]}
    {:tag :meta, :attrs {:type "tag"}, :content ["Example"]}]}
  {:tag :file,
   :attrs
   {:subject "Here's your file!  abc-mr2a.r01 (1/2)",
    :date "1071674882",
    :poster "Joe Bloggs <bloggs@nowhere.example>"},
   :content
   [{:tag :groups,
     :attrs nil,
     :content
     [{:tag :group, :attrs nil, :content ["alt.binaries.newzbin"]}
      {:tag :group, :attrs nil, :content ["alt.binaries.mojo"]}]}
    {:tag :segments,
     :attrs nil,
     :content
     [{:tag :segment,
       :attrs {:number "1", :bytes "102394"},
       :content ["123456789abcdef@news.newzbin.com"]}
      {:tag :segment,
       :attrs {:number "2", :bytes "4501"},
       :content ["987654321fedbca@news.newzbin.com"]}]}]}]}
nil

That's great, and can sometimes be enough. But I would rather work with the representation I have in the test. To do that, we need a way of traversing this tree and picking out the pieces of information we require. The clojure.zip and clojure.data.zip libraries are perfect for this. The documentation for the data.zip library on github is nice, but it initially left me a little confused as to how to go about using the library (not being familiar with zippers).

A Simple Example

Zippers allow you to easily traverse a data structure. Let's play with it in a REPL and start with the root node of our NZB file:

(require '[clojure.java.io :as io])
(require '[clojure.xml :as xml])
(require '[clojure.zip :as zip])
(require '[clojure.data.zip.xml :as zip-xml])

(def root (-> "example.nzb" io/resource io/file xml/parse zip/xml-zip))

Now we have a zipper for the root element of our document, we can start traversing it for information. The two main functions we will use for this are xml-> and xml1->. The former returns a sequence of items based on the predicates given to it, the latter returning the first matching item. As an example, let's get the meta data from the NZB document root and create a Clojure map:

(into {}
      (for [m (zip-xml/xml-> root :head :meta)]
        [(keyword (zip-xml/attr m :type))
         (zip-xml/text m)]))
;; => {:title "Your File!", :tag "Example"}

A couple of things are happening here. First of all we use xml-> to return a sequence of <meta> tags that live under the <head> tag:

(zip-xml/xml-> root :head :meta)

We use the for list comprehension macro to evaluate each item in the sequence. For each item we find the contents of the :type attribute using the attr function:

(keyword (zip-xml/attr m :type))

This returns us the contents of the attribute as a string, which we turn into a keyword to use as the key in the map. We then use the text function to get the textual contents of the meta tag:

(zip-xml/text m)

We make a tuple of these values, and pass the resulting sequence to into to build the map.

Putting It Together

Using only these functions, we can parse the raw XML into the Clojure data structure from our unit test. If you like, open ./src/nzb/core.clj, and make the changes as you read along.

First let's define our nzb->map function from the test, and pull in the code we have already written for parsing the metadata of the NZB:

(ns nzb.core
  (:require [clojure.xml :as xml]
            [clojure.java.io :as io]
            [clojure.zip :as zip]
            [clojure.data.zip.xml :as zip-xml]))

(defn meta->map
  [root]
  (into {}
        (for [m (zip-xml/xml-> root :head :meta)]
          [(keyword (zip-xml/attr m :type))
           (zip-xml/text m)])))

(defn file->map
  [file]
  ;; TODO
)

(defn nzb->map
  [input]
  (let [root (-> input
                 io/input-stream
                 xml/parse
                 zip/xml-zip)]
    {:meta  (meta->map root)
     :files (mapv file->map (zip-xml/xml-> root :file))}))

The only new thing here is the use of io/input-stream to allow us to use anything as input that the io/input-stream supports. These are currently OutputStream, File, URI, URL, Socket, byte array, and String arguments. See the clojure.java.io docs for details.

Now let's fill in the file->map function:

(defn segment->map
  [seg]
  {:bytes  (parse-long (zip-xml/attr seg :bytes))
   :number (parse-long (zip-xml/attr seg :number))
   :id     (zip-xml/xml1-> seg zip-xml/text)})

(defn file->map
  [file]
  {:poster   (zip-xml/attr file :poster)
   :date     (parse-long (zip-xml/attr file :date))
   :subject  (zip-xml/attr file :subject)
   :groups   (vec (zip-xml/xml-> file :groups :group zip-xml/text))
   :segments (mapv segment->map
                   (zip-xml/xml-> file :segments :segment))})

Again, nothing new. We simply pick out the pieces of the document we wish to process using a combination of the xml1->, xml->, attr, and text functions. Run the test, and it should pass.

Prevent Parsing the DTD

Interestingly, if we uncomment the DTD declaration in the example.nzb file, our code now explodes with an Exception:

org.xml.sax.SAXParseException: The markup declarations contained or pointed to by the document type declaration must be well-formed

We can fix this by swapping out the SAXParserFactory and setting a feature to not validate the DTD. Here's how:

Update the ns declaration to include some required classes:

(ns nzb.core
  (:require [clojure.xml :as xml]
            [clojure.java.io :as io]
            [clojure.zip :as zip]
            [clojure.data.zip.xml :as zip-xml])
  (:import (javax.xml.parsers SAXParser SAXParserFactory)))

Define a function to switch out the SAXParserFactory:

(defn startparse-sax
  "Don't validate the DTDs, they are usually messed up."
  [s ch]
  (let [factory (SAXParserFactory/newInstance)]
    (.setFeature factory "http://apache.org/xml/features/nonvalidating/load-external-dtd" false)
    (let [^SAXParser parser (.newSAXParser factory)]
      (.parse parser s ch))))

Update our nzb->map definition to use it:

(defn nzb->map
  [input]
  (let [root (-> input
                 io/input-stream
                 (xml/parse startparse-sax)
                 zip/xml-zip)]
    {:meta  (meta->map root)
     :files (mapv file->map (zip-xml/xml-> root :file))}))

Yay, our test passes again.

$ lein test

lein test nzb.core-test

Ran 1 tests containing 1 assertions.
0 failures, 0 errors.

Query Predicates

There are a few other useful functions in the clojure.data.zip.xml ns we haven't yet looked at, namely: text=, attr=, and tag=. These functions allow you to construct query predicates to run against a given node. As an example, let's pull out the first file segment from the example.nzb file using the attr= function:

(zip-xml/xml1-> root
                :file
                :segments
                :segment
                (zip-xml/attr= :number "1")
                zip-xml/text)
"123456789abcdef@news.newzbin.com"

From the root node of the document we reach down into :file, :segments, and :segment in turn, then use the attr= query predicate to match a :segment with a value of "1".

Interestingly enough, the other two query predicates have shortcuts for their use. You have already been using the tag= query predicate every time you use a keyword to locate a tag. To use the text= predicate easily, just use a string. For example, to retrieve the second :segment based on its content of 987654321fedbca@news.newzbin.com:

(zip-xml/xml1-> root
                :file
                :segments
                :segment
                "987654321fedbca@news.newzbin.com")
;; ... the resulting node

Finally, you can combine these query predicates to match multiple things on a given node by using a vector:

(zip-xml/xml1-> root
                :file
                :segments
                :segment
                [(zip-xml/attr= :number "1")
                 (zip-xml/attr= :bytes "102394")]
                zip-xml/text)
"123456789abcdef@news.newzbin.com"

Here we are matching on both the :number attribute being "1", and the :bytes attribute being "102394". Obviously, you can use strings here to match against content too.

Creating New Predicates

OK, now let's suppose we want to use some kind of numerical comparison in our XML (like we might do with XPath). As it stands, we have no way to do that with the built-in functions but we can easily define our own.

Let's start with a general function for comparing attribute values:

(defn attr-fn
  [attrname f test-val & [conv-fn]]
  (fn [loc]
    (let [conv-fn (or conv-fn identity)
          val (conv-fn (zip-xml/attr loc attrname))]
      (f val test-val))))

This function takes an attribute name (attrname), a function for making a comparison (f), a value to test agains (test-val) and optionally a conversion function. Imagine our example.nzb file had 100 segments, and we only wanted to get segments over 75. We could now achieve this using our general function:

(zip-xml/xml-> root
               :file
               :segments
               :segment
               (attr-fn :number > 75 parse-long)
               zip-xml/text)

Let's provide a helper for this to make the syntax clearer:

(defn attr>
  [attrname val]
  (attr-fn attrname > val parse-long))

(zip-xml/xml-> root
               :file
               :segments
               :segment
               (attr> :number 75)
               zip-xml/text)

We could build a whole suite of helper functions for examining XML nodes, if we are unlucky enough to be required to do so :)

Conclusion

I hope these simple examples have given you an idea of the ease with which you can process XML using Clojure, and how simple it is to extend the tools already provided in interesting directions.

Contributors

Gareth Jones, 2012 (original author) Sean Corfield, 2023 (updated to Clojure 1.11 etc)

« Middleware in Clojure || Growing a DSL with Clojure »

Parsing XML in Clojure