Matching a URL with Regular Expression Tutorial

This guide was created to help myself and you learn more about regular expressions and how to use them. Regular expressions are special characters that define a search pattern.

Summary

The regular expression, or regex for short, below is a search pattern for matching a Url. Since this regular expression is between two slashes it is called a regular expression literal. It is made up of anchors, grouping constructions, bracket expressions, and quantifiers. We will look at each component of the regular expression so they make more sense.
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

Anchors
Quantifiers
Grouping Constructs
Bracket Expressions
Character Classes
Flags
Character Escapes
Sources

Regex Components

Anchors

Anchors match position not characters. The ^ is the beginning anchor and $ is the end anchor. A string will be between the two anchors.

If multiline flag (m) is used the anchors will match beginning or ending of the string, otherwise it will match beginning and ending of the string.

Quantifiers

Quantifiers match the quantity of the previous token. Our regular expressions uses a few different quantifiers. We use the question mark symbol ?, the asterisk symbol *, and the plus symbol +.

The numbers inside the curly brackets {} are also quantifiers. They set the minimum and maximum limit of characters in our search for the preciding bracket expression. Since our regex contains {2, 6} this means it matches between 2 and 6 characters.

The question mark symbol ? will match the preceding token 0 to 1 times making it optional. It will also make the expression lazy, meaning it will match as few instances as possible.

The asterisk symbol * will match 0 or more of the instances in the preceding token.

The plus symbol + matches 1 or more instances of the preceding token.

Grouping Constructs

As regular expressions get complex we need to group our searches so we use parentheses (). These () form what are known as subexpressions or grouping constructs. Our regex contains four subexpressions or grouping constructs.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

In our first grouping construct "(https?://)" we are looking for the 'https://' portion of a URL.

Another confusing construct is the last one:

"([/\w .-])"

When broken down, this construct is just matching a forward slash / to any character word set with the character class '\w' and a space, which is then followed by the * which is a quantifier that means to match 0 or more instances of that preceding token or group. This whole construct is followed by the asterisk * quantifier meaning to match 0 or more instances.

Bracket Expressions

Search patterns inside square brackets [] are known as bracket expressions. In our matching Url example we use bracket expressions to search for any alpha numeric characters using character classes which we will go over more below. The bracket expression [\da-z.-] searches for alphanumeric characters a-z, 0-9, hyphens, and periods .

Character Classes

Character classes are used to differentiate between letters and digits. Our regular expression uses the character class "\d" in the second grouping contruct. "\d" refers to any digit characters, so it is the same as searching for [0-9]. The other character class that our regular expression searches for is in the fourth grouping construct "\w". This character class searches any word characters meaning alphanumeric charcters and including the underscore, so it is the same as searching [A_Za-z0-9_].

One other character class symbol to know is the period/dot (.) which matches any character except line breaks.

Flags

Although we don't have flags at the end of our regular expression you still may encounter them or want to use them when using regex to validate characters. They can be used for advanced searches. You can add more than on in any order and are included as part of the regex.

g - global search - looks for all possible matches
i - case-insensitive - ignores case of the expression
m - multiline input - specifies that multiline input string should be treated as such. ^ and $ anchors match at the start or end of any line within the string instead of start and end of the whole string.

You would add to the end of the regular expression as follows:
'/^(https?://)?([\da-z.-]+).([a-z.]{2,6})([/\w .-])/?$/g'

Character Escapes

Character escapes are backslashes \ . When used it tells the search pattern to not take the character literally. For instance when you have "." it tells the system to treat it as a period and not a special character period (.) which would match any character except line breaks.

In our first grouping construct we are looking for '(https?://)' two back slashes so we have to escape them by putting forward slashes in front of them.

Sources

https://regexr.com/
https://coding-boot-camp.github.io/full-stack/computer-science/regex-tutorial
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Character_classes
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#using_the_global_search_flag_with_exec

Author

Derek Meduri
[email protected]
https://github.com/derekmeduri

derekmeduri/RegexTutorial.MD

Select an option

No results found