Simple Regex Tutorial - DOI Edition

The DOI, or Digital Object Identifier, is a unique string created according to a standardised scheme, which can be assigned to any object and subsequently used to identify or track it. The adjective 'digital' here qualifies 'identifier', not 'object' - one could theoretically pin a DOI on absolutely anything: digital, physical or ephemeral - but in practice the ambiguity resolves itself, as DOIs depend on digital search for much of their usefulness. Most people are likely to encounter DOIs as references to academic journal articles, but over the years they have been assigned to quite a few interesting entities; for example, the NASA Planetary Systems Table, a regularly-updated collection containing the parameters of known exoplanets and their stars, bears the DOI of 10.26133/NEA12.

Given their ubiquity, it's useful to be able to validate DOIs using regular expressions, which in turn makes them a useful starting point for a basic regex tutorial.

Summary

The DOI is composed of two parts: a prefix, which itself incorporates a directory indicator (always 10) followed by a registrant code which may have period-delineated subdivisions; and a suffix which is nearly unconstrained in theory, although in practice there are many established conventions imposed by DOI registration agencies.

Example: 10 (directory indicator) . 1111 (registrant) / febs.16395 (suffix)

According to a blog post by Andrew Gilmartin of CrossRef, a major non-profit DOI registration agency, most modern DOIs can be validated using a relatively simple regular expression:

/^10\.\d{4,9}\/[-._;()\/:A-Z0-9]+$/i

However, a small subset of DOIs, in particular those with the telltale registrant of 1002, owned by John Wiley and Sons, would fail that regex. This old biochemistry textbook is a good example, with its monstrous DOI of 10.1002/(SICI)1099-0844(199912)17:4<290::AID-CBF849>3.0.CO;2-P

Gilmartin suggests a relaxed alternative of 10\.1002\/[^\s]+$ to catch the peskier Wiley and Sons DOIs. Combining the two alternatives, we get:

/^10\.1002\/[^\s]+|10\.\d{4,9}\/[-._;()\/:A-Z0-9]+$/i

We will use this regex to examine a few parts a regular expression's anatomy.

Anchors
Quantifiers
Bracket Expressions
Character Classes
The OR Operator
Flags
Character Escapes

Regex Components

Anchors

The unescaped caret and the dollar sign, (^ and $) in the regular expression denote the beginning and end of the string to be evaluated for a match. The anchors provide context and boundaries for the rest of the expression. In our DOI regex, the caret in particular ensures any string under consideration begins with 10. or 10.1002.

Quantifiers

Quantifiers determine how many times a character should be matched. They directly follow the matching criteria within the expression. In our example of /^10\.1002\/[^\s]+|10\.\d{4,9}\/[-._;()\/:A-Z0-9]+$/i, the values inside braces, {4,9} immediately following \d direct the regex engine to look for no fewer than four and no more than nine characters of the digit class after the initial 10.. Meanwhile, the + quantifier immediately following each of the two bracket expressions simply means that one or more characters satisfying the bracket expression must be matched at that point.

Example: 10.1038/nbt1010-1049 matches the expression, but 10.103/nbt1010-1049 which has only three digits following 10. does not.

Bracket Expressions

Bracket expressions match any of the set of characters enclosed in square brackets. There are two bracket expressions in our DOI regex, one in each alternative: [^\s] and [-._;()\/:A-Z0-9]. The former will match anything that isn't whitespace (^ acts as a negation operator inside the bracket expression, while \s denotes whitespace, so we're matching 'not whitespace'). This is an exceedingly permissive condition, which we are trying to use as seldom as possible, hence the OR operator. The latter will match alphanumerics and several symbols commonly used in CrossRef suffixes.

Example: 10.1111/FEBS.16395 matches the expression, but 10.1111/FEBS.16395.<< does not because < is not part of the bracket expression for ordinary DOIs. 10.1002/FEBS.16395.<< matches again, because John Wiley and Sons 1002 DOIs are subject to a relaxed rule. (see OR operator)

Character Classes

Character classes denote known sets of related characters. In our regex, we use \s to match whitespace and \d to match the digit category, as well as the usual ranges of letters and numbers indicated by A-Z and 0-9 inside the bracket expression. The ranges validate a match to any character within them, in this case any letter and any digit, in addition to the other symbols inside the bracket expression.

Example: 10.1111/FEBS.16395 matches the expression, but 10.1A11/FEBS.16395 does not because only digits (\d) are allowed in the registrant section of the prefix.

The OR Operator

Because of how loose and open-ended the suffix criteria are in the alternative regex, we do not want to use it for all DOIs, only the ones with a 1002 registrant that have a chance of failing the usual regex. This is to minimise the expression's overall chance of matching stray erroneous characters at the end - like sentence-ending full stops. Hence the OR operator, denoted by the vertical pipe character: |. By interposing the OR operator between the two alternatives, we ensure that the small subset of eccentric Wiley and Sons DOIs, caught by the first alternative, 10\.1002\/[^\s]+ as well as all ordinary DOIs, caught by the second alternative, ^10\.\d{4,9}[-._;()\/:A-Z0-9]+, can satisfy the overall expression.

Example: 10.1002/(SICI)1099-0844(199912)17:4<290::AID-CBF849>3.0.CO;2-P will match the expression while 10.1003/(SICI)1099-0844(199912)17:4<290::AID-CBF849>3.0.CO;2-P will not.

Flags

Flags determine certain global rules for evaluating the regex expression. They are placed just after the expression, following the final /. In the case of our regex, the flag used is i, which allows the expression to match DOIs when they are all-lowercase, as they frequently are.

Example: Both 10.1111/febs.16395 and 10.1111/FEBS.16395 match the expression.

Character Escapes

Because certain characters take on special syntactical roles in regex, it is necessary to escape them if they are to be matched. This is done by inserting a backslash, \, in front of such characters. In our example regex, the escaped characters are periods, like so: \. and forward slashes, like so: \/. Were we not to escape these characters, they would change the meaning of the regex - the . would match any character, while the forward slash might interfere with the start and end of the expression.

Example: With an unescaped first . in ^10.1002\/[^\s]+|10\.\d{4,9}\/[-._;()\/:A-Z0-9]+$ the invalid DOI 10Z1002/FEBS.16395 matches the expression because the period matches any character. Escaping the period by placing a \ in front of it fixes the problem.

Author

The author may be found at https://github.com/tadcos29

tadcos29/DOIRegexTutorial.md

Select an option

No results found