Kotlin Regex Pattern Matching

Matching Strings using regular expressions (REGEX) is a difficult topic. Regex strings are often difficult to understand and debug. They often require extensive testing to make sure that the regex is matching what it is supposed to match.

Kotlin goes out of its way to avoid making developers use regex. For example, the split() method of String does not require a regex (unlike its Java counterpart). Doing so reduces bugs and helps keep the code more readable in general. When we need to use a regex, Kotlin has an explicit Regex type.

One advantage of having a regex type is that code is immediately more readable.

val regex = """\d{5}""".toRegex()

Notice a few things about this String. First, we use the triple quoted, or raw, string to define the regular expression. This helps us avoid bugs caused by improper escaping of the regex string. Also, the string has a toRegex() method that converts the String to a Regex object.

The Regex object comes packed with its own methods that are used for pattern matching.

regex.containsMatchIn("My string 00000")
regex.findAll("00000, 000121, 23213")

Of course there are many other methods found on the Regex object, but see the Kotlin documentation for more details: http://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/-regex/index.html

Regex Tables

Below are some common regex symbols, meta symbols, and quantifiers as presented in Oracle Certified Professional Java SE 7 Programmer Exams 1Z0-804 and 1Z0-805 A Comprehensive OCPJP 7 Certification Guide by Ganesh and Sharma.

Common Symbols

Matches either x or y
Symbol Meaning
^expr Matches expr at beginning of the line
expr$ Matches expr at end of line
. Matches any single character (exception the newline character)
[xyz] Matches either x, y, or z
[p-z] Matches either any character from p to z or any digit from 1 to 9
[^p-z] ‘^’ as the first character negates the pattern. This will match anything outside of the range p-z
xy Matches x followed by y
x|y

Common Meta Symbols

\d Matches digits ([0-9])
\D Matches non-digits
\w Matches word characters
\W Matches non-word characters
\s Matches whitespaces [\t\r\f\n]
\S Matches non-whitespaces
\b Matches word boundary when outside of a bracket. Matches backslash when placed in a bracket
\B Matches non-word boundary
\A Matches beginning of string
\Z Matches end of String

Common Quantifiers

expr? Matches 0 or 1 occurrence of expr (expr{0,1})
expr* Matches 0 or more occurrences of expr (expr{0,})
expr+ Matches 1 or more occurrences of expr (expr{1,})
expr{x, y} Matches between x and y occurrences of expr
expr{x, } Matches x or more occurrences of expr

Putting it Together

Here is an example program that uses Regex in Kotlin.

fun main(args : Array<String>){
    val symbols = mapOf(
            "^expr" to "Matches expr at beginning of the line",
            "expr$" to "Matches expr at end of line",
            "." to "Matches any single character (exception the newline character)",
            "[xyz]" to "Matches either x, y, or z",
            "[p-z]" to "Specifies a range. Matches any character from p to z",
            "[p-z1-9]" to "Matches either any character from p to z or any digit from 1 to 9",
            "[^p-z]" to "'^' as the first character negates the pattern. This will match anything outside of the range p-z",
            "xy" to "Matches x followed by y",
            "x|y" to "Matches either x or y")

    val metaSymbols = mapOf(
            "\\d" to "Matches digits ([0-9])",
            "\\D" to "Matches non-digits",
            "\\w" to "Matches word characters",
            "\\W" to "Matches non-word characters",
            "\\s" to "Matches whitespaces [\\t\\r\\f\\n]",
            "\\S" to "Matches non-whitespaces",
            "\\b" to "Matches word boundary when outside of a bracket. Matches backslash when placed in a bracket",
            "\\B" to "Matches non-word boundary",
            "\\A" to "Matches beginning of string",
            "\\Z" to "Matches end of String")


    val quantifiers = mapOf(
            "expr?" to "Matches 0 or 1 occurrence of expr (expr{0,1})",
            "expr*" to "Matches 0 or more occurrences of expr (expr{0,})",
            "expr+" to "Matches 1 or more occurrences of expr (expr{1,})",
            "expr{x, y}" to "Matches between x and y occurrences of expr",
            "expr{x, }" to "Matches x or more occurrences of expr")

    val format = "%-10s\t%s"
    val func = {entry : Map.Entry<String, String> -> println(format.format(entry.key, entry.value)) }

    println("Symbols")
    symbols.entries.forEach(func)

    println("\nMeta Symbols")
    metaSymbols.entries.forEach(func)

    println("\nQuantifiers")
    quantifiers.entries.forEach(func)

    //Create a regex object
    println("\nTesting regex: ^Matches")
    val regex = "^Matches".toRegex()
    symbols.entries.forEach({it ->
        //The Regex Type has a Number of Pattern Matching Methods
        val matchResult = regex.containsMatchIn(it.value)
        println("$matchResult => ${it.value}")
    })
}

Output

Symbols
^expr     	Matches expr at beginning of the line
expr$     	Matches expr at end of line
.         	Matches any single character (exception the newline character)
[xyz]     	Matches either x, y, or z
[p-z]     	Specifies a range. Matches any character from p to z
[p-z1-9]  	Matches either any character from p to z or any digit from 1 to 9
[^p-z]    	'^' as the first character negates the pattern. This will match anything outside of the range p-z
xy        	Matches x followed by y
x|y       	Matches either x or y

Meta Symbols
\d        	Matches digits ([0-9])
\D        	Matches non-digits
\w        	Matches word characters
\W        	Matches non-word characters
\s        	Matches whitespaces [\t\r\f\n]
\S        	Matches non-whitespaces
\b        	Matches word boundary when outside of a bracket. Matches backslash when placed in a bracket
\B        	Matches non-word boundary
\A        	Matches beginning of string
\Z        	Matches end of String

Quantifiers
expr?     	Matches 0 or 1 occurrence of expr (expr{0,1})
expr*     	Matches 0 or more occurrences of expr (expr{0,})
expr+     	Matches 1 or more occurrences of expr (expr{1,})
expr{x, y}	Matches between x and y occurrences of expr
expr{x, } 	Matches x or more occurrences of expr

Testing regex: ^Matches
Disconnected from the target VM, address: '127.0.0.1:61983', transport: 'socket'
true => Matches expr at beginning of the line
true => Matches expr at end of line
true => Matches any single character (exception the newline character)
true => Matches either x, y, or z
false => Specifies a range. Matches any character from p to z
true => Matches either any character from p to z or any digit from 1 to 9
false => '^' as the first character negates the pattern. This will match anything outside of the range p-z
true => Matches x followed by y
true => Matches either x or y

Rerences

Ganesh, S G., and Tushar Sharma. Oracle Certified Professional Java SE 7 Programmer Exams 1Z0-804 and 1Z0-805 A Comprehensive OCPJP 7 Certification Guide. Apress, 2013.

Kotlin String Splitting

Most programming tasks require string splitting. For example, CSV files often separate data based on the comma character, which requires developers to split each line based on the comma in order to extract data. Extracting domain names from a web address is another common use case for String splitting. For example, we might have the address https://stonesoupprogramming.com and we wish to separate the https:// portion of the string. We can split the string into a list where the first part contains http:// and the second index contains stonesoupprogramming.com.

In Kotlin, we use the split() method defined in the String class. It comes in two flavors. One flavor takes the character to split the string on, and the other flavor takes a Regex. Both versions of the split method return a list that contains all potions of the String.

Non-Regex Splitting

The first version of split() takes a varargs parameter of delimiters, an optional boolean argument to ignoreCase and an optional limit argument that restricts how many times the split happens.

val str = "I smell fear on you"
val parts = str.split(" ")
val partsTwo = str.split("I", "fear", "you")
val partsThree = str.split("I", true)
val partsFour = str.split(delimiters = " ", limit = 2)

All versions of split return a list. It’s worth keeping in mind that the returned list will not contain any of the delimiters passed to the delimiters argument in split(). Normally, that isn’t a problem. For example, would you really want the ‘,’ character for all fields in a CSV file?

Regex Version

Most programming languages treat regular expressions, REGEX, as a String. Doing so often leads to unexpected bugs. Consider Java’s String.split() method.

String myString = "Green. Eggs. Ham.";
String [] parts = myString.split(".");

You may think that parts holds {“Green”, “Eggs”, “Ham”}. It doesn’t. The period character is treated as a regex expression that matches to any character. It’s a very common mistake.

Thankfully, Kotlin treats regular expressions as its own type. When we want to use a Regex in Kotlin, we need to create a Regex object. The Kotlin String class has a toRegex() function that performs the conversion from String to Regex.

val str = "Green. Eggs. Ham"
val partsNonRegex = str.split(".") //No Regex. This will split on the period character
val partsRegex = str.split(".".toRegex()) //Now using REGEX matching

Putting it together

As always, we will conclude with an example program that demonstrates the topic. Many of my students are given assignments where they need to track the number of unique words in a String. We will use String splitting and maps to accomplish the goal.

fun main(args : Array<String>){
    val paragraph = """
        |I am Sam.
        |Sam I am.

        |That Sam-I-am!
        |That Sam-I-am!
        |I do not like
        |That Sam-I-am!

        |Do you like
        |Green eggs and ham?

        |I do not like them,
        |Sam-I-Am
        |I do not like
        |Green eggs and ham.
        """.trimMargin()

    //Remove all end line characters and then split the string on the space character
    val parts = paragraph.replace('\n', ' ').split(" ")
    
    //Create an empty mutable map
    val uniqueWords = mutableMapOf<String, Int>()
    
    //Populate the map
    parts.forEach( { it -> uniqueWords[it] = uniqueWords.getOrDefault(it, 0) + 1 })
    
    //Print each word with it's count value
    println(uniqueWords)
}

Here is the output when run.

{I=5, am=1, Sam.=1, Sam=1, am.=1, =3, That=3, Sam-I-am!=3, do=3, not=3, like=4, Do=1, you=1, Green=2, eggs=2, and=2, ham?=1, them,=1, Sam-I-Am=1, ham.=1}

Kotlin Koans—Part 6

Insert Values into String

Kotlin upgrades some of Java’s String capabilities. One of the first things I liked was the ability to insert variables into the String

val a = 1
val b = 2
str = "Here is a String with values a=$a and b=$b)

We could of course have done this with String.format in Java

int a = 1;
int b = 2;
String str = "Here is a String with values a=%d and b=%d".format(a, b);

I think most people agree that the Kotlin approach is more concise.

Multiline Strings

Kotlin supports the “”” for mutliline strings without any need to escape charaters.

str = """
Here is a multiline String
C:\folder\file.txt

"""

The Kotlin Koans tutorial suggested that this was also useful for Regex expressions.

Task

The task for this portion of the tutorial was simple enough. We have a variable named month that we insert into a regex expression.

val month = "(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)"

fun task5(): String{
    return """\d{2}\ $month \d{4}"""
}

This a use case for the triple quote Strings. In Java, we would have needed to escape all of the backslashes in the regex expression. I know that I am not the first developer who has gotten burned by typing \ when I should have typed \\, so the triple quote String is a nice feature.

You can click here to see Part 5.