Regular Expressions in perl
Regular Expressions in perl
For those not familiar with perl patterns, aka "regular expressions",
here is a brief synopsis. To fully understand how things match, you
also need to appreciate that the index files
used actually have the tune title twice: Once in a "canonical"
upper-case form with all non-letters dropped, and again in the full
original form. Between these is the URL and the X: index. Patterns
can take advantage of this order. Anyway, here is how patterns work:
-
.*
-
The most useful pattern element is .*, which matches anything
at all (including nothing). So early.*morn will match anything
with early and morn, in any capitalization. Since the
title is in each line twice, this will also match titles with morn
before early.
Most of the time, this is the only pattern element you will need.
-
Letters, digits and spaces
-
These represent themselves, as literal characters. The index files
contain only spaces, not tabs. Note again that we ignore capitalization.
-
Metacharacters
-
These don't represent themselves, but stand for some special match.
Examples are:
- .
-
represents any single character.
Thus the pattern de.il would match strings such as
devil,
de'il,
de il,
that is, any occurrence
of de and il separated by exactly one character.
- [...]
-
matches a list of characters.
Thus [abcd] matches any single character a,
b, c or d. As a special feature, -
between two characters means to match the entire range, so
[A-Z] will match any single upper-case letter,
[0-9] will match any single digit. You can include
] in the list by putting it first, so [][]
will match either of the bracket characters. Similarly for
-. Or either may be preceded by \, so [\-\]]
will match a hyphen or a right bracket. The
character \ is special inside [...], and
is described below.
- *
-
means any number (zero or more) of the preceding item.
Thus ab*c will match ac, abc,
abbc, and so on. [A-Za-z]* will match a
string of zero or more letters.
- +
-
means one or more of the preceding item.
Thus ab+c will match abc,
abbc, and so on,
but it will not match ac.
-
-
-
-
-
-
-
Escaped symbols
-
If preceded by \ (back-slash), letters have special meaning,
and extend the list of special sorts of matches. Here are some of the
more important escape sequences:
- \s
- matches any non-printing ("white space") character, such as space,
tab, and the CR and LF line separators. Since the ABC index files
contain spaces but not tabs, this is of limited use.
- \w
- matches "alphanumeric" characters, letters and digits,
and the _ for obvious computing reasons.
It is shorthand for [A-Za-z0-9_].
- \b
- matches a "word boundary". That is, it matches only if there is one
of the \w characters on one side and not on the other.
So \blow will match low or lowly, but not below
- \$
- matches a single $. Not too useful here.
- \.
- matches a single dot.
- \\
- matches a single \ (backslash).
In general, \ before a non-alphanumeric character cancels any
special meaning of that character, and causes an exact match. You should
not use \ before a letter or digit unless you know the special
meaning of that sequence, because the result will usually not match sensibly.
-
Groupings.
-
The perl pattern match allows use of parentheses to surround a chunk of the
pattern, marking it for later use. This is of limited use with this tune
matching service, but there is one situation where it is useful: When
combined with the | symbol, which means "or", you can give alternatives.
Thus the pattern Charl(ie|ey|es) will match Charlie, Charley,
or Charles. This pattern could also be written Charl(ie|e(y|s)).
Or you could just use charl.*.
-
Examples
- jenn(y|ie).*charl(ie|ey|es)
- This will find all the "Jenny's Welcome To Charlie" tunes in their various
variant spellings. Actually, any title with both names (in either order)
will be shown, due to the repetition of the title in the index files.
- stanford\.edu.*de[v']*il
- This takes advantage of the fact that the tune indexes have the URL before
the tune's full title, and looks for entries on a Stanford University machine
that have various forms of "devil" in their names.
The \. is used to say "match only a dot here",
cancelling its usual "any character" meaning.
The de[v']*il part matches such spellings as "deil", "de'il" or "devil"
(or "devvil" or "dev'vil" or ...).
-
-
-
-
-
-
-
-
-
-
-
-
-
For more details, or to learn about perl (which is the main language behind the Web),
visit O'Reilly's perl web site or the
Perl Institute's web site.
They have full manuals online, plus the perl sources,
and executables for many common computer systems, all available free.
There are also several well-written books on the language, which aren't free,
but are a good investment for any programmer.
(I expect that few if any musicians will get this far unless they are
also computer programmers. ;-)