Kotlin String Splitting

Most programming tasks require string splitting. For example, CSV files often separate data based on the comma character, which requires developers to split each line based on the comma in order to extract data. Extracting domain names from a web address is another common use case for String splitting. For example, we might have the address https://stonesoupprogramming.com and we wish to separate the https:// portion of the string. We can split the string into a list where the first part contains http:// and the second index contains stonesoupprogramming.com.

In Kotlin, we use the split() method defined in the String class. It comes in two flavors. One flavor takes the character to split the string on, and the other flavor takes a Regex. Both versions of the split method return a list that contains all potions of the String.

Non-Regex Splitting

The first version of split() takes a varargs parameter of delimiters, an optional boolean argument to ignoreCase and an optional limit argument that restricts how many times the split happens.

val str = "I smell fear on you"
val parts = str.split(" ")
val partsTwo = str.split("I", "fear", "you")
val partsThree = str.split("I", true)
val partsFour = str.split(delimiters = " ", limit = 2)

All versions of split return a list. It’s worth keeping in mind that the returned list will not contain any of the delimiters passed to the delimiters argument in split(). Normally, that isn’t a problem. For example, would you really want the ‘,’ character for all fields in a CSV file?

Regex Version

Most programming languages treat regular expressions, REGEX, as a String. Doing so often leads to unexpected bugs. Consider Java’s String.split() method.

String myString = "Green. Eggs. Ham.";
String [] parts = myString.split(".");

You may think that parts holds {“Green”, “Eggs”, “Ham”}. It doesn’t. The period character is treated as a regex expression that matches to any character. It’s a very common mistake.

Thankfully, Kotlin treats regular expressions as its own type. When we want to use a Regex in Kotlin, we need to create a Regex object. The Kotlin String class has a toRegex() function that performs the conversion from String to Regex.

val str = "Green. Eggs. Ham"
val partsNonRegex = str.split(".") //No Regex. This will split on the period character
val partsRegex = str.split(".".toRegex()) //Now using REGEX matching

Putting it together

As always, we will conclude with an example program that demonstrates the topic. Many of my students are given assignments where they need to track the number of unique words in a String. We will use String splitting and maps to accomplish the goal.

fun main(args : Array<String>){
    val paragraph = """
        |I am Sam.
        |Sam I am.

        |That Sam-I-am!
        |That Sam-I-am!
        |I do not like
        |That Sam-I-am!

        |Do you like
        |Green eggs and ham?

        |I do not like them,
        |Sam-I-Am
        |I do not like
        |Green eggs and ham.
        """.trimMargin()

    //Remove all end line characters and then split the string on the space character
    val parts = paragraph.replace('\n', ' ').split(" ")
    
    //Create an empty mutable map
    val uniqueWords = mutableMapOf<String, Int>()
    
    //Populate the map
    parts.forEach( { it -> uniqueWords[it] = uniqueWords.getOrDefault(it, 0) + 1 })
    
    //Print each word with it's count value
    println(uniqueWords)
}

Here is the output when run.

{I=5, am=1, Sam.=1, Sam=1, am.=1, =3, That=3, Sam-I-am!=3, do=3, not=3, like=4, Do=1, you=1, Green=2, eggs=2, and=2, ham?=1, them,=1, Sam-I-Am=1, ham.=1}
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s