Chad Perrin: SOB

26 March 2007

How to parse XML with regexen

Filed under: Geek — ???? @ 03:24

(Sterling here, guest blogging again. Sorry I missed all weekend).

I’m working on a project that involves parsing OPML in PHP. I started out using the DOM, but I found that to be slow and the code was becoming ponderous. Then I thought, “when the going gets tough, the tough get regular expressions.”

It turns out that I needed just a couple of functions: one to search for an element and also parse out the attribute list and value (including child elements), and another to parse the attribute list for a specific attribute’s value.

Here they are in all their regexal gory glory:

function parseelement($text, $elementname, &$value, &$attributes)
match('/<' . $elementname . '[^>]\/>|<' . $elementname . '.?>(.?<' . $elementname . '[^\/]?>.<\/' . $elementname .'>.?)?.?<\/' . $elementname . '>/s', $text, $matches);
$element = $matches[0]; // Entire pattern matches the element
pregmatch('/<' . $elementname . '.?>(.)<\/' . $elementname . '>/s', $element, $matches);
$value = $matches[1]; // First subpattern is the value, including child elements
match('/<' . $elementname . '(.?)>/s', $element, $matches);
$attributes = $matches[1]; // First subpattern is any attribute list
return $element; // Return the whole element

function parseattributevalue($text, $attribute)
preg_match('/(^|[\n\s])' . $attribute . '="(.
?)"/s', $text, $matches);
return $matches[2]; // Second subpattern is the value, if any.

OK, I worked this all out without any assistance, so I should be able to explain it, right?

Let’s look at each of the regular expressions. The first one is used to find an element. Given the element name “elem”, it looks like this:


The | delimits two choices here. To the left of the bar is the simpler, self-terminating form of an element: “<” followed by element name followed by anything up to “/>” except an intervening “>”. The “[^>]” matches zero or more characters that are not “>”.

The right side is where it gets interesting. Here we need to find the matching end-element tag. So we have the open and close tags on either end, and in between we look for zero or more occurrences of anything (or nothing) followed by nested pair of the same element, followed by anything else. Note the use of the non-greedy operator (?) to avoid extending our search beyond our matching end element. The exception to this is the content between the nested element’s open and close element, where we want to get the whole thing.

The “s” on the end of the expression is a modifier for PHP’s regex processor that says “treat newlines as a character” so we don’t have to strip out newlines before we parse.

Once we have our full element’s text, we can then break out the value and attributes. I tried doing that within the first regular expression, but because of the way subpatterns are handled in preg_match, I couldn’t know which elements of the matches array I needed to look at, depending on which of the alternate patterns matched.

For the value, the expression looks like this:


Pretty simple. Non-greedy to the first “>” after the element name, then everything between there and the end element, applied only to the portion that matched our element text.

For the attribute list:


We have to do this separately from extracting the value, because the expression for that will not match a self-terminating element. In the case of a self-terminating element, we do get the “/” included, but who cares? The expression we’ll use to parse out attribute values will ignore it:


Here we introduce an alternate pattern at the beginning. Either “^” (beginning of the text) or a newline or whitespace character must precede the attribute name, so we don’t match attributes that end with the name we’re looking for. Then we just catch anything following the equal sign in quotes, non-greedy so we stop at the next quote.

Given the above, I can hierarchically parse OPML by iteratively finding an “outline” element and parsing its attributes. If it has any value portion (which may or may not contain child elements) recurse using just that portion.

We’ll file this under “Geek”. What’s more geeky than regexen?

Please forgive the indentation and styling of the code above. Something in the style sheet I guess. Maybe apotheon can sort it out when he returns.

All original content Copyright Chad Perrin: Distributed under the terms of the Open Works License