Chad Perrin: SOB

26 March 2007

How to parse XML with regexen

Filed under: Geek — ???? @ 03:24

(Sterling here, guest blogging again. Sorry I missed all weekend).

I’m working on a project that involves parsing OPML in PHP. I started out using the DOM, but I found that to be slow and the code was becoming ponderous. Then I thought, “when the going gets tough, the tough get regular expressions.”

It turns out that I needed just a couple of functions: one to search for an element and also parse out the attribute list and value (including child elements), and another to parse the attribute list for a specific attribute’s value.

Here they are in all their regexal gory glory:

function parseelement($text, $elementname, &$value, &$attributes)
{
preg
match('/<' . $elementname . '[^>]\/>|<' . $elementname . '.?>(.?<' . $elementname . '[^\/]?>.<\/' . $elementname .'>.?)?.?<\/' . $elementname . '>/s', $text, $matches);
$element = $matches[0]; // Entire pattern matches the element
pregmatch('/<' . $elementname . '.?>(.)<\/' . $elementname . '>/s', $element, $matches);
$value = $matches[1]; // First subpattern is the value, including child elements
preg
match('/<' . $elementname . '(.?)>/s', $element, $matches);
$attributes = $matches[1]; // First subpattern is any attribute list
return $element; // Return the whole element
}


function parseattributevalue($text, $attribute)
{
preg_match('/(^|[\n\s])' . $attribute . '="(.
?)"/s', $text, $matches);
return $matches[2]; // Second subpattern is the value, if any.
}

OK, I worked this all out without any assistance, so I should be able to explain it, right?

Let’s look at each of the regular expressions. The first one is used to find an element. Given the element name “elem”, it looks like this:

/<elem[^>]\/>|<elem.?>(.?<elem.[^\/]?>.<\/elem>.?)?.?<\/elem>/s

The | delimits two choices here. To the left of the bar is the simpler, self-terminating form of an element: “<” followed by element name followed by anything up to “/>” except an intervening “>”. The “[^>]” matches zero or more characters that are not “>”.

The right side is where it gets interesting. Here we need to find the matching end-element tag. So we have the open and close tags on either end, and in between we look for zero or more occurrences of anything (or nothing) followed by nested pair of the same element, followed by anything else. Note the use of the non-greedy operator (?) to avoid extending our search beyond our matching end element. The exception to this is the content between the nested element’s open and close element, where we want to get the whole thing.

The “s” on the end of the expression is a modifier for PHP’s regex processor that says “treat newlines as a character” so we don’t have to strip out newlines before we parse.

Once we have our full element’s text, we can then break out the value and attributes. I tried doing that within the first regular expression, but because of the way subpatterns are handled in preg_match, I couldn’t know which elements of the matches array I needed to look at, depending on which of the alternate patterns matched.

For the value, the expression looks like this:

/<elem.?>(.)<\/elem>/s

Pretty simple. Non-greedy to the first “>” after the element name, then everything between there and the end element, applied only to the portion that matched our element text.

For the attribute list:

/<elem(.?)>/s

We have to do this separately from extracting the value, because the expression for that will not match a self-terminating element. In the case of a self-terminating element, we do get the “/” included, but who cares? The expression we’ll use to parse out attribute values will ignore it:

/(^|[\n\s])attrname="(.*?)"/s

Here we introduce an alternate pattern at the beginning. Either “^” (beginning of the text) or a newline or whitespace character must precede the attribute name, so we don’t match attributes that end with the name we’re looking for. Then we just catch anything following the equal sign in quotes, non-greedy so we stop at the next quote.

Given the above, I can hierarchically parse OPML by iteratively finding an “outline” element and parsing its attributes. If it has any value portion (which may or may not contain child elements) recurse using just that portion.

We’ll file this under “Geek”. What’s more geeky than regexen?

Please forgive the indentation and styling of the code above. Something in the style sheet I guess. Maybe apotheon can sort it out when he returns.

23 March 2007

COPAcetic

Filed under: Liberty — ???? @ 01:20

Yesterday, U.S. District Judge Lowell Reed issued an order to permanently block enforcement of COPA on First and Fifth Amendment grounds (thanks, Kiltak). This law required content providers to verify the age of viewers if their content was “material that is harmful to minors”, defined as:

(6) Material that is harmful to minors.--The term<br /> `material that is harmful to minors' means any communication,<br /> picture, image, graphic image file, article, recording,<br /> writing, or other matter of any kind that is obscene or<br /> that--<br /> <br />(A) the average person, applying contemporary community
standards, would find, taking the material as a whole and
with respect to minors, is designed to appeal to, or is
designed to pander to, the prurient interest;

(B) depicts, describes, or represents, in a manner<br /> patently offensive with respect to minors, an actual or<br /> simulated sexual act or sexual contact, an actual or<br /> simulated normal or perverted sexual act, or a lewd<br /> exhibition of the genitals or post-pubescent female breast;<br /> and<br /> <br />(C) taken as a whole, lacks serious literary, artistic,
political, or scientific value for minors.

We should all be relieved. Regardless of how you personally feel about pornography, legislation of morality is by definition an attack on liberty. Morality should be governed by the conscience of the individual, or by parents in the case of minors. Not to mention the fact that this broad brush could conceivably cover sites dealing with non-pornographic material such as breast cancer prevention and sexual health and education.

Judge Reed also noted that the provisions of the law are not as effective for the intended purpose as content filters (though those aren’t perfect either). But that’s really beside the point. Government has no business dictating what your children should be able to view online. That is your job as a parent. If we abdicate that responsibility to government, we might as well surrender the whole job of parenting to the collective.

People will point to examples of very bad parenting to argue for government intervention to protect the children. Yes, some parents will allow their kids to view anything online — they probably don’t even know that it’s going on. I’m very sorry for children who are neglected by their parents, but we can’t let those cases justify swatting a fly with a sledgehammer. Someone needs to step up for those children specifically, without tying Nanny-state apron strings around the throats of the rest of the population.

Near the end of the adjudication Judge Reed wrote, “Perhaps we do the minors of this country harm if First Amendment protections, which they will with age inherit fully, are chipped away in the name of their protection.”

Perhaps?!

22 March 2007

PHP + OOP = a mess

Filed under: Geek — ???? @ 03:15

Hello, apotheonistas!

Sterling here. Not only is this the first-ever guest post on SOB, it’s also my first guest post anywhere. I’ll do my best to give you something approaching the apotheonic while Chad’s away. Not that I could imitate apotheon. But hopefully you’ll at least find enough interesting content here to keep your mouse off the kill switch for a few minutes.

Chad gave me the inspiration for my first topic in an IM chat before he left, characterizing the object syntax in PHP as

… a kludge for people who started with the wrong language to support current growth, but don’t have the resources to rewrite the software in another language.

That “kludge” has at least two aspects: (1) the language syntax itself, and (2) customary usage.

Syntax

PHP’s OOP syntax is better than Perl’s, and not quite as good as Java’s. Like Perlobj, OOP was added to PHP as an afterthought — and it shows. That gives it one advantage over Java, though, in that you aren’t forced to use classes for every piece of code you write (Ruby provides a much better solution to that problem, by making classes ubiquitous yet unobtrusive).

But that’s where PHP’s superiority over Java ends, IMHO. Oh, it does inheritance and encapsulation OK (even interfaces — which Ruby again outdoes via mixins). When it comes to polymorphism, though, Polly skipped PHPtown. Partly this has to do with the generally loose typing in PHP, but even though you can invoke stronger typing using Type Hinting, you can’t redeclare a function for different types of a parameter. That leaves you deciding between (a) using is_a (case on type, ugly), (b) creating different member functions to handle different parameter types (moving the case on type into the client code), or (c) trying to build any differences in manipulating the objects into methods of the classes being passed (gets clunky real fast).

What passes for “overloading” in PHP seems to me the most bizarre feature of its OOP syntax. The magic methods (whenever you see something documented as “magic”, run)__get(), __set(), and __isset() essentially provide a mechanism for handling “properties” — i.e., a way to hide getter and setter methods under an ostensible variable reference. But they funnel all such access through a single channel, which forces you to (a) adopt a very uniform internal mechanism for access, or (b) do a case on name to decide how to handle each one. I don’t even want to think about using __call(). That’s just insane obfuscation along the lines of the COBOL ALTER statement, as far as I can tell.

Usage

Most of the object-oriented PHP that I’ve seen really isn’t OOP at all. Usually, they employ a class merely to prevent naming conflicts with variables and function names. That’s what namespaces are for, duh. Oh, wait, PHP doesn’t have namespaces. So, they create a singleton instance of a class, store its handle in a global variable, and reference member data and functions off that one global reference. The bailing wire and bubble gum version of namespaces, but it works.

But then, almost without fail, these PHP wizards find that requiring client code to access a global variable doesn’t sit well. So they wrap the singleton object within an API composed of straight-up functions. Which once again invades the global namespace and undoes half the benefit of using a class in the first place. Some good examples of this can be found in the WordPress core code — cache.php, for instance.

In short, when OOP gets added to PHP, you usually get POOHP.

« Newer PostsOlder Posts »

All original content Copyright Chad Perrin: Distributed under the terms of the Open Works License