Kotlin Inheritance

Inheritance is a core part of Object Orientated programming that provides a powerful code reuse tool by grouping properties and behaviors into common classes (called base classes) and having unique properties and behaviors placed in specific classes that grow out from the base class (called child classes). The child class receives all properties and behavior from its parent class, but it also contains properties and behaviors that are unique to itself. Inheritance allows for specialization of software components when a component has a specific need, but also allows for generalization when using items common to the parent.

Let’s consider an example that is more specific. Suppose we have a Vehicle class that models some sort of vehicle that we can drive. We start by creating a class.

open class Vehicle(
        private val make : String,
        private val model : String,
        private val year : Int) {

    fun start() = println("Starting up the motor")

    fun stop() = println("Turning off the engine")

    fun park() = println("Parking " + toString())

    open fun drive() = println("Driving " + toString())

    open fun reverse() = println("Reversing " + toString())

    override fun toString(): String {
        return "${year} ${make} ${model}"
    }
}

We know that all Vehicles have a make, model, and year. Users of a Vehicle can start and turn off the motor. They can also park, drive, or put the Vehicle in reverse. So far so good. However, later on, we need a vehicle that can tow a camper.

We could add a tow() method to Vehicle, but would that really make sense? What if our Vehicle is a Fiat? Would we really tow a camper with a Fiat? What we need is a Truck. Thanks to Inheritance, we don’t need to write Truck from scratch. We can simply create a specialized version of a Vehicle instead.

open class Truck(make: String,
            model: String,
            year: Int,
            private val towCapacity : Int) : Vehicle(make, model, year) {
    fun tow () = println("${toString()} is towing ${this.towCapacity} lbs")
}

This code creates a Truck class based off of Vehicle. As such, the Truck still has a make, model, and year. It can also start(), stop(), drive(), park(), and reverse() just like any other vehicle. However, it can also tow things and has a towCapacity property. What we have essentially done is reused all of the code from Vehicle and just changed what was needed so that we have a new Vehicle like object that also tows things.

Of course, later on, our needs change again and we decide to go camping in the mountains. It may snow in the mountains, so in addition to being able to tow things, we may want four wheel drive. Once again, not all Trucks have four wheel drive, so we don’t want to add a four wheel drive into Truck. As a matter of fact, four wheel drive isn’t even a specific behavior. What it really does is it enhances the already existing behaviors drive and reverse.

Let’s create another child class, based off of Truck, and specialize the driving behavior.

class FourWheelDrive(make: String, model: String, year: Int, towCapacity: Int) :
        Truck(make, model, year, towCapacity) {

    var fourByFour = false

    override fun drive() {
        if(fourByFour){
            println("Driving ${toString()} in four wheel drive")
        } else {
            super.drive()
        }
    }

    override fun reverse() {
        if(fourByFour){
            println("Reversing ${toString()} in four wheel drive")
        } else {
            super.drive()
        }
    }
}

In this class, we are modifing the already existant behaviors of driving and going in reverse. If the truck has four wheel drived turned on, the output of the program reflects this fact. One the other hand, if four wheel drive is turned off, the methods call super.drive(), which means use the behavior defined in Truck (which bubbles up to the original behavior in Vehicle). Thus, FourWheelDrive has specialized behaviors that were originally found in Vehicle. This is known as overriding behaviors.

Now let’s do a demonstration

fun main(args : Array<String>){
    val car = Vehicle("Fiat", "500", 2012)
    val truck = Truck("Chevy", "Silverado", 2017, 8000)
    val fourWheelDrive = FourWheelDrive("Dodge", "Ram", 2017, 8000)

    //drive() comes from Vehicle
    car.drive()
    println()

    //There is no drive() in Truck, but it 
    //inherited the behavior from Vehicle
    truck.drive()
    println()

    //FourWheelDrive override drive() to customize it
    fourWheelDrive.drive()
    println()

    println("Turn on four wheel drive")
    fourWheelDrive.fourByFour = true
    fourWheelDrive.drive()
}

Output

Driving 2012 Fiat 500

Driving 2017 Chevy Silverado

Driving 2017 Dodge Ram

Turn on four wheel drive
Driving 2017 Dodge Ram in four wheel drive

Three vehicles are made at the beginning of main: car, truck, and fourWheelDrive. They are all of type Vehicle, but Truck is a specialized case of Vehicle and fourWheelDrive is a specialized case of Truck. As such, all three objects have a drive() method which we use in the example program. When fourWheelDrive turns on fourByFour and then invokes drive, the console prints out that it is driving in four wheel drive.

Sealed Class

Although Kotlin supports inheritance, it’s use is discouraged. In order to use inheritance in Kotlin, the ‘open’ keyword needs to be added in front of the ‘class’ keyword first. If the ‘open’ keyword is ommitted, the class is considered to be final and the compiler will not allow the class to be used as a parent class. Likewise, all functions in a class also have to be marked as ‘open’ (see drive and reverse in vehicle), otherwise, overriding behaviors is not permitted.

Why would Kotlin choose to do this while other languages encourage inheritence? After studying issues found with inheritance, it became clear that many developers wrote classes without considering that a class may be extended later on. When changes where made in the parent class, the child classes could potentially break as well. This ended up creating a situation called “fragile base classes”.

Kotlin designers decided that by making developers mark classes as open, it would encourage developers to think about the needs of child classes when working on a base class. Kotlin also has powerful delegation mechanisms that encourage developers to use composition and delegation as code reuse mechanisms rather than inheritance.

Example program

Here is the entire source code used in this post

//Has to be marked as open to allow inheritance
open class Vehicle(
        private val make : String,
        private val model : String,
        private val year : Int) {

    fun start() = println("Starting up the motor")

    fun stop() = println("Turning off the engine")

    fun park() = println("Parking " + toString())

    //Has to be marked as open to allow overriding
    open fun drive() = println("Driving " + toString())

    //Has to be marked as open to allow overriding
    open fun reverse() = println("Reversing " + toString())

    override fun toString(): String {
        return "${year} ${make} ${model}"
    }
}

//Has to be marked as open for inheritance
open class Truck(make: String,
            model: String,
            year: Int,
            private val towCapacity : Int) : Vehicle(make, model, year) {
    fun tow () = println("${toString()} is towing ${this.towCapacity} lbs")
}

//This class is not open and therefore cannot be inherited from
class FourWheelDrive(make: String, model: String, year: Int, towCapacity: Int) :
        Truck(make, model, year, towCapacity) {

    var fourByFour = false

    //The override keyword signals to the compiler that we are overriding
    //the drive() method
    override fun drive() {
        if(fourByFour){
            println("Driving ${toString()} in four wheel drive")
        } else {
            super.drive()
        }
    }

    //The override keyword signals to the compiler that we are overriding
    //the reverse() method
    override fun reverse() {
        if(fourByFour){
            println("Reversing ${toString()} in four wheel drive")
        } else {
            super.drive()
        }
    }
}

fun main(args : Array<String>){
    val car = Vehicle("Fiat", "500", 2012)
    val truck = Truck("Chevy", "Silverado", 2017, 8000)
    val fourWheelDrive = FourWheelDrive("Dodge", "Ram", 2017, 8000)

    //drive() comes from Vehicle
    car.drive()
    println()

    //There is no drive() in Truck, but it
    //inherited the behavior from Vehicle
    truck.drive()
    println()

    //FourWheelDrive override drive() to customize it
    fourWheelDrive.drive()
    println()

    println("Turn on four wheel drive")
    fourWheelDrive.fourByFour = true
    fourWheelDrive.drive()
}
Advertisements

Kotlin Encapsulation and Procedural Programming

Software developers use the term Encapsulation to refer to grouping related data and behavior into a single unit, usually called a class. The class can be seen as the polar opposite of procedural based programming where data and behavior are treated as two distinct concerns. It should be noted that OOP and procedural programming have their distinct advantages and one should not be thought of as better as the other. Kotlin supports both styles of programming and it’s not uncommon to see a mix of both procedural and OOP programming.

Procedural Programming Example

Let’s with an example of procedural programming. In this example, we are working with a rectangle object. Its data is stored in a map (a data structure that supports key-value pairs) and then we have functions that consume the data.

val rectangle = mutableMapOf("Width" to 10, "Height" to 10, "Color" to "Red")

fun calcArea(shape : Map<String, Any>) : Int {
    return shape["Height"] as Int * shape["Width"] as Int
}

fun toString(shape : Map<String, Any>) : String {
    return "Width = ${shape["Width"]}, Height = ${shape["Height"]}, Color = ${shape["Color"]}, Area = ${calcArea(shape)}"
}

So we begin with the rectangle object that holds some properties of our rectangle: width, height, and color. Two functions follow the creation of the rectangle. They are calcArea and toString. Notice that these are global functions that accept any Map. This is dangerous because we can’t guarantee that the map will have “Width”, “Height”, or “Color” keys. Another issue is our loss of type safety. Since we need to store both Integers and Strings in the rectangle map, our value has to be of type Any, which is the base type in Kotlin.

OOP

Here is the same problem solved with an OOP approach.

class Rectangle(
    var width : Int,
    var height : Int,
    var color : String){

    fun calcArea() = this.width * this.height

    override fun toString() =
            "Width = ${this.width}, Height = ${this.height}, Color = ${this.color}, Area = ${calcArea()}"
}

The OOP solution demonstrates encapsulation because the data and the behavior associated with the data are grouped into a single entity called a class. The data associated with a class are often referred to as “properties” while the behaviors defined in the class are usually called “methods”. The calcArea() and toString() methods are always guaranteed to work because all objects based on the Rectangle class always have width, height, and color. We also do not lose our type safety because we are free to declare each property as a distinct variable within the class along with it’s type.

When the calcArea() and toString() methods are used, the word ‘this’ refers to the object that is calling these methods. You will notice that unlike the procedural program above, there is no Rectangle parameter supplied to calcArea() or toString(). Instead, the ‘this’ keyword is updated to refer to the object that is currently in use.

Tips on how to choose between Procedural and OOP

It should be noted that many software projects mix procedural and OOP programming. It’s also worth mentioning that almost anything that can be done with OOP can most likely be accomplished with procedural program and vice versa. However, some problems can be more easily solved when using procedural rather than OOP and other problems are better solved with OOP.

Procedural

  • Pure functional program: When we work in terms of pure mathematical functions where a function accepts certain inputs and returns certain outputs without side effects
  • Multi-threading: Procedural programming can help solve many challenges found in multi-threading environments. The integrity of mutable data is always concern in multi-threading, so functional programming works well provided the functions are pure functions that do not change data
  • Input and Output: In many cases, using a class to persist or retrieve an object from a data store is overkill. The same is true for printing to standard IO. Java has been heavily criticized for using the System.out.println() to write text to the console. Kotlin simplified this to println()

OOP and Encapsulation

  • GUI Toolkits: Objects representing buttons, windows, web pages, etc are very well modeled as classes
  • Grouping state or behavior: We often find that entities in software have properties or methods that are commonly held by other similar entities. For example, all road vehicles have wheels and move. Trucks are a specialized vehicle that has a box. Four-wheel drive trucks are specialized trucks that have four-wheel drive. We can use OOP to group all of the items common to all vehicles in a Vehicle class. All items common to Trucks can go in a Truck class, and finally, all items used only in four-wheel drive trucks can go in FourByFourTruck
  • Modularization: OOP allows developers to modularize code into smaller and reusable software components. Since the units of code are small pieces in a system, the code is usually easier to maintain.

Putting it together

Below is a working program that demonstrates both procedural programming and OOP.

package ch1

/**
 * This is a shape object without OOP. Notice how the data is separated from the behavior that works on the
 * data. The data is stored in a Map object, which uses key-value pairs. Then we have separate functions that
 * manipulate the data.
 */
val rectangle = mutableMapOf("Width" to 10, "Height" to 10, "Color" to "Red")

fun calcArea(shape : Map<String, Any>) : Int {
    //How can we guarantee that this map object has "Height" and "Width" property?
    return shape["Height"] as Int * shape["Width"] as Int
}

fun toString(shape : Map<String, Any>) : String {
    return "Width = ${shape["Width"]}, Height = ${shape["Height"]}, Color = ${shape["Color"]}, Area = ${calcArea(shape)}"
}

/**
 * This is a class that represents a Rectangle. You will immediately notice it has less code that the
 * non-OOP implementation. That's because the state (width, height, and color) are grouped together with
 * the behavior. Kotlin takes this a step further by providing us with behavior that lets us change
 * width, height, and color. We only need to add calcArea(), which we can guarantee will always work because
 * we know that there will always be width and height. Likewise, we know our toString() will never fail us
 * for the same reason!
 */
class Rectangle(
    var width : Int,
    var height : Int,
    var color : String){

    fun calcArea() = this.width * this.height

    override fun toString() =
            "Width = ${this.width}, Height = ${this.height}, Color = ${this.color}, Area = ${calcArea()}"
}

fun main(args : Array<String>){
    println("Using procedural programming")
    println(toString(rectangle))

    println("Changing width")
    rectangle["Width"] = 15
    println(toString(rectangle))

    println("Changing height")
    rectangle["Height"] = 80
    println(toString(rectangle))

    println("Changing color")
    rectangle["Color"] = "Blue"
    println(toString(rectangle))

    println("\n*****************************\n")
    println("Now using OOP")

    val square = Rectangle(10, 10, "Red")
    println(square)

    println("Changing height")
    square.height = 90
    println(square)

    println("Changing width")
    square.width = 40
    println(square)

    println("Changing color")
    square.color = "Blue"
    println(square)
}

Here is the output

Using procedural programming
Width = 10, Height = 10, Color = Red, Area = 100
Changing width
Width = 15, Height = 10, Color = Red, Area = 150
Changing height
Width = 15, Height = 80, Color = Red, Area = 1200
Changing color
Width = 15, Height = 80, Color = Blue, Area = 1200

*****************************

Now using OOP
Width = 10, Height = 10, Color = Red, Area = 100
Changing height
Width = 10, Height = 90, Color = Red, Area = 900
Changing width
Width = 40, Height = 90, Color = Red, Area = 3600
Changing color
Width = 40, Height = 90, Color = Blue, Area = 3600

OOP Abstraction

Abstraction is one of the major components of OOP. When we abstract, we are hiding the internal working details of something from its user. The user only cares about the controls that operate an object, but how the object acts on the controls are of no concern to the user.

A common everyday abstraction that people use daily can be found in a smart phone’s operating system. When a user wishes to make a phone call, they do not worry about how the phone makes a call. All the user cares about is using the keypad to dail a phone number and then pressing the call button. The details of connecting to the cell phone tower and then routing the phone call through the phone network are of no concern to the user. Those details have been abstracted.

Kotlin provides a variety of ways to provide abstraction. In the example below, I used the interface feature to model a Vehicle

interface Vehicle {
    fun park()

    fun drive()

    fun reverse()

    fun start()

    fun shutDown()
}

This code defines an abstraction point for all Vehicles. It guarantees that all classes that implement Vehicle have the following behaviors: park, drive, reverse, start, and shutDown. However, what we do not have is details as to how the Vehicle drives, parks, etc. As a matter of fact, the function bodies of all of the methods inside of vehicle are left empty (they are called abstract methods).

We may wish to take our vehicle for a drive. When we drive our vehicle, we are only really concerned with what the vehicle can do. We don’t care how it parks or goes in reverse. Let’s see this example in terms of code.

fun takeForDrive(v : Vehicle){
    with(v){
        //How we start is abstracted. We only care that the vehicle starts, but
        //we don't care about how it starts.
        start()

        //Likewise, we only care that it goes in reverse(). How it goes in reverse
        //is irrelevant here.
        reverse()

        //And so on...
        drive()
        park()
        shutDown()
    }
}

Notice how the takeForDrive function calls all five of our behaviors on the supplied Vehicle object. It doesn’t even know what kind of a vehicle it is driving. The Vehicle could be a car, Truck, airplane, boat, etc. None of that matters to the takeForDrive function. The details are hidden behind the Vehicle interface (in other words, abstracted).

One of the reasons abstraction is so important is that it promotes code reusability and maintainability. For example, now that we have this takeForDrive function, we can use any object that implements Vehicle. So for example, we can create a Truck class that implements Vehicle.

class Truck : Vehicle {
    override fun park() = println("Truck is parking")

    override fun drive() = println("Truck is driving")

    override fun reverse() = println("Truck is in reverse")

    override fun start() = println("Truck is starting")

    override fun shutDown() = println("Truck is shutting down")
}

and now we can take the Truck for a drive.

val truck = Truck()
takeForDrive(truck)

The price of gas may spike later one and we may choose to drive something that is more efficient. As long as our new mode of transportation implements the Vehicle interface, we can take it for a drive. Here is a car class that impelements Vehicle.

class Car : Vehicle{
    override fun park() = println("Car is parking")

    override fun drive() = println("Car is driving")

    override fun reverse() = println("Car is in reverse")

    override fun start() = println("Car is starting")

    override fun shutDown() = println("Car is shutting down")
}

Just like with truck, we can drive the car.

val car = Car()
takeForDrive(car)

Since Vehicle provides an abstraction point, any code that accepts Vehicle as a parameter can use Truck or Car. The function takeForDrive can be said to be loosely coupled to Truck and Car because it indirectly accepts Trucks or Cars using the Vehicle interface. This makes the takeForDrive function highly reusable to other components that may need to get developed in the future.

Example Program

Here is a fully working Kotlin program that ties everything together.

package ch1

//This defines our public interface for all vehicles
interface Vehicle {
    fun park()

    fun drive()

    fun reverse()

    fun start()

    fun shutDown()
}

//Our Truck class provides an implementation of Vehicle
class Truck : Vehicle {
    override fun park() = println("Truck is parking")

    override fun drive() = println("Truck is driving")

    override fun reverse() = println("Truck is in reverse")

    override fun start() = println("Truck is starting")

    override fun shutDown() = println("Truck is shutting down")
}

//Car provides an alternative implementation of Vehicle
class Car : Vehicle{
    override fun park() = println("Car is parking")

    override fun drive() = println("Car is driving")

    override fun reverse() = println("Car is in reverse")

    override fun start() = println("Car is starting")

    override fun shutDown() = println("Car is shutting down")
}

/**
 * This function demonstrates Abstraction. Notice how it accepts a Vehicle object but
 * makes no distinction if it's a Truck or a Car. The details of how the vehicle parks,
 * drives, reverses, starts, or shuts down are abstracted from this function. In the end, we are
 * only concerned with what the Vehicle object does, not how it does it.
 */
fun takeForDrive(v : Vehicle){
    with(v){
        //How we start is abstracted. We only care that the vehicle starts, but
        //we don't care about how it starts.
        start()

        //Likewise, we only care that it goes in reverse(). How it goes in reverse
        //is irrelevant here.
        reverse()

        //And so on...
        drive()
        park()
        shutDown()
    }
}

fun main(args : Array<String>){
    //Create a new Truck and take it for a drive. It works because Truck
    //implements the Vehicle Interface which abstracts the truck's details from
    //the takeForDrive function
    takeForDrive(Truck())

    //Likewise, we can also take a car for a drive. The car class also implements
    //Vehicle so takeForDrive can also use cars.
    takeForDrive(Car())
}

When run, the program prints

Truck is starting
Truck is in reverse
Truck is driving
Truck is parking
Truck is shutting down
Car is starting
Car is in reverse
Car is driving
Car is parking
Car is shutting down

Kotlin and OOP

Like many JVM languages such as Java, Scala, Groovy, etc, Kotlin supports OOP (Object Orientated Programming). OOP allows developers to create reusable and self-contained software modules known as classes where data and behavior are grouped together and contained within the said class. Such packaging allows developers to think in terms of components when solving a software problem and can improve code reuse and maintainability.

There are often four terminologies that are discussed when explaining OOP. The first term is encapsulation. Encapsulation refers to combining a programs data with the behaviors that operate on the said data. This is different than procedural based programming that treats data and behavior as two seperate concerns. However, encapsulation goes further than just simply grouping behavior and data. It also means that we protect our data inside of the class by only allowing the class itself to use the data. Other users of the class may only work on class data through the class’s public interface.

This takes us into the next concept of OOP, Abstraction. A well designed and encapsulated class functions as a black box. We may use the class, but we may only use it through it’s public interface. The details of how the class works internally are taken away from or Abstracted, from the clients of the class. A car is commonly used as an example of abstraction. We can drive the car using the steering wheel and the foot pedals, but we do not get into the internals of the car and fire the fuel injection at the right time. The car takes care of the details of making it move. We only operate it through its public interface. The details of how a car works are abstracted from us.

OOP promotes code reuse through inheritance. The basic idea is that we can use one class as a template for a more specialized version of a class. For example, we may have a class that represents a Truck. As time went on, we realized that we needed a four wheel drive truck. Rather than writing an entirely new class, we simply create a four wheel drive truck from the truck class. The four wheel drive truck inherits all of the computer code from the truck class, and the developer only needs to focus on code that makes it a four wheel drive truck. Such code reuse not only saves on typing, but it also helps to reduce debugging since developers are free to leverage already tested computer code.

Related to inheritence is polymorphism. Polymorphism is a word that means many-forms. For developers, this means that one object may act as if it were another object. Take the truck example above as an example. Since a four wheel drive truck inherited from truck, the four wheel drive truck may be used whenever the computer code expects a truck. Polymorphism goes a set further in allowing the program to act different depending on the context in which certain portions of computer code are used.

Koltin is a full fleged OOP language (although it does support other programming styles also). The language brings all of the OOP concepts discussed above to the fore-front by allowing us to write classes, abstract their interfaces, extend classes, and even use them in different situations depending on context. Let’s begin by looking at a very basic example of how to write and create a class in Kotlin.

package ch1

class Circle(
        //Define data that gets associated with the class
        private val xPos : Int = 20,
        private val yPos : Int = 20,
        private val radius : Int = 10){

    //Define behavior that uses the data
    override fun toString() : String =
            "center = ($xPos, $yPos) and radius = $radius"
}

fun main(args: Array<String>){
    val c = Circle() //Create a new circle
    val d = Circle(10, 10, 20)
    
    println( c.toString() ) //Call the toString() function on c
    println( d.toString() ) //Call the toString() function on d
}

In the above program, we have a very basic example of a Kotlin class called Circle. The code inside of lines 3-12 tell the Kotlin compiler how to construct objects of Type Circle. The circle has three properties (data): xPos, yPos, and radius. It also has a function that uses the data: toString().

In the bottom half of the program, the main method creates two new circle objects (c and d). The circle c has the default values of 20, 20, and 10 for xPos, yPos, and radius because we used the no parenthesis constructor (). Lines 5-7 in the circle class tell the program to simply use 20, 20, and 10 as default values in this case. Circle d has different valeus for xPos, yPos, and radius because we supplied 10, 10, 20 to the constructor. Thus we have an example of polymorphism in this program because two different constructors were used depending on the program’s context.

When we print on lines 18 and 19, we get two different outputs. When we call c.toString(), we get the String “center = (20, 20) and radius = 10” printed to the console. Calling toString() on d results in “center = (10, 10) and radius = 20”. This works because both c and d are distinct objects in memory and each have there own values for xPos, yPos, and radius. The toString() function acts on each distinct object, and thus, the output of toString() reflects the state of each Circle object.

Python Split and Join file

The book Programming Python: Powerful Object-Oriented Programming has an example program that shows how to split and join files. Many utilities exist for such an operation but the program offers a good working example of how to read from and write to binary files in Python3. The code below is an adaptation from the book with my own comments added.

Code

import os


def split(source, dest_folder, write_size):
    # Make a destination folder if it doesn't exist yet
    if not os.path.exists(dest_folder):
        os.mkdir(dest_folder)
    else:
        # Otherwise clean out all files in the destination folder
        for file in os.listdir(dest_folder):
            os.remove(os.path.join(dest_folder, file))

    partnum = 0

    # Open the source file in binary mode
    input_file = open(source, 'rb')

    while True:
        # Read a portion of the input file
        chunk = input_file.read(write_size)

        # End the loop if we have hit EOF
        if not chunk:
            break

        # Increment partnum
        partnum += 1

        # Create a new file name
        filename = os.path.join(dest_folder, ('part%004' % partnum))

        # Create a destination file
        dest_file = open(filename, 'wb')

        # Write to this portion of the destination file
        dest_file.write(chunk)

        # Explicitly close 
        dest_file.close()
    
    # Explicitly close
    input_file.close()
    
    # Return the number of files created by the split
    return partnum


def join(source_dir, dest_file, read_size):
    # Create a new destination file
    output_file = open(dest_file, 'wb')
    
    # Get a list of the file parts
    parts = os.listdir(source_dir)
    
    # Sort them by name (remember that the order num is part of the file name)
    parts.sort()

    # Go through each portion one by one
    for file in parts:
        
        # Assemble the full path to the file
        path = os.path.join(source_dir, file)
        
        # Open the part
        input_file = open(path, 'rb')
        
        while True:
            # Read all bytes of the part
            bytes = input_file.read(read_size)
            
            # Break out of loop if we are at end of file
            if not bytes:
                break
                
            # Write the bytes to the output file
            output_file.write(bytes)
            
        # Close the input file
        input_file.close()
        
    # Close the output file
    output_file.close()

Explanation

split

The code snippet shows to sample functions that either split a file into parts or join those parts back together into one file. The split function begins by taking three parameters. The first parameter, source, is the file that we wish to split. The second parameter, dest_folder, is a folder that stores the output files created by the split operation. The final parameter, write_size, is the size of the file parts in bytes.

Split starts by checking if dest_folder exists or not. If the folder does not exist, we call os.mkdir to create a new folder on the file system. Otherwise, we obtain a list of all files in the folder by calling os.listdir and then remove all of them by calling os.remove. When calling os.remove, we use os.path.join to create a full path to the target file that’s getting deleted.

Once the destination folder has been prepared, the function continues by performing the actually split operation. A partnum variable is created on line 13 that tracks the number of file parts created by the split operation. The source file is opened on line 16 in binary mode. Binary mode is used in this case because we could be dealing with audio or video files and not just text files.

The split function enters an infinite loop on line 18. On line 20, we read a number of bytes, specified by write_size, from the source file and store them in the chunk variable. On line 23, we test if chunk actually recieved any bytes from the read operation. If chunk did not read any bytes, then we have hit end of file (EOF) and we break out of the loop. Otherwise, we increment partnum by one and begin to write the file part.

Line 30 creates the name and destination for the file part by using os.path.join, the dest_folder, and a string template that accepts the current part number. The destination file is created on line 33 with a call to open (also in binary mode) and then on line 36, we write chunk to the file. Line 39 has an explicit call to closing the file. While we normally wait for files to close in garabage collection, this function opens a lot of files so ideally we should close them in oder to make sure we don’t exceed the number of file handles the underlying OS allows. The function ends by closing the input_file and returning the number of part files created.

join

The join function does the reverse job of the split function. It begins by accepting a source_dir, a destination file, and the size of the part files. The output_file is created on line 50 (opened in binary mode) and then on line 53, we use os.listdir to get a list of all parts.

Since our part files contain a number that identifies the parts, we can store all parts in a list and call sort() on it. Then it’s just a matter of looping through all of the parts and assembling them into a single file. The for loop starts on line 59. On line 62, we use os.path.join to create a full path to the part file and then we can open the part file on line 65.

The program enters an infinite join loop on line 67. Inside of the while loop, we read a part of the input_file and return the bytes read. If bytes is empty, we have it end of file so we can test for this on line 72 and use break to end the while loop if we have hit end of file. Otherwise, we can write to the output file on line 76.

When we have finished reading our part file, we again close it explicitly on line 79. When all parts of have been read we close the output_file. The output_file contains the bytes of the original file that was split in the first places

Thoughts

The code contained in this post isn’t ideal for production but is instead meant to be a learning tool. In this code, we cover reading and writing to binary files and functions of the os module. There are areas we could improve this code. For example, split destroys the contents of the destination folder, but ideally, it should instead throw an exception back to the caller and let the caller delete all files in a folder instead.

We also don’t test if our input files are really files and if our folders are really folders. That is certainly an area for improvement. Another thing that could be improved upon is using an enumeration for the size of the file parts. Right now, write_size in split and read_size in join are specified in bytes, but that isn’t clear to clients of these functions.

References

Lutz, Mark. Programming Python. Beijing, OReilly, 2013.

Find Python Source Files in Home Directory

Truthfully, most users aren’t very interested in finding the largest and smallest Python source files in their home directory, but doing so does provide for an exercise in walking the file tree and using tools from the os module. The program in this post is a modified example taken from Programming Python: Powerful Object-Oriented Programming where the user’s home directory is scanned for all Python source files. The console outputs the two smallest files (in bytes) and the two largest files.

Code

import os
import pprint
from pathlib import Path

trace = False

# Get the user's home directory in a platform neutral fashion
dirname = str(Path.home())

# Store the results of all python files found
# in home directory
allsizes = []

# Walk the file tree
for (current_folder, sub_folders, files) in os.walk(dirname):
    if trace:
        print(current_folder)

    # Loop through all files in current_folder
    for filename in files:

        # Test if it's a python source file
        if filename.endswith('.py'):
            if trace:
                print('...', filename)

            # Assemble the full file python using os.path.join
            fullname = os.path.join(current_folder, filename)

            # Get the size of the file on disk
            fullsize = os.path.getsize(fullname)

            # Store the result
            allsizes.append((fullsize, fullname))

# Sort the files by size
allsizes.sort()

# Print the 2 smallest files
pprint.pprint(allsizes[:2])

# Print the 2 largest files
pprint.pprint(allsizes[-2:])

Sample Output

[(0,
  '/Users/stonesoup/.local/share/heroku/client/node_modules/node-gyp/gyp/pylib/gyp/generator/__init__.py'),
 (0,
  '/Users/stonesoup/.p2/pool/plugins/org.python.pydev.jython_5.4.0.201611281236/Lib/email/mime/__init__.py')]
[(219552,
  '/Users/stonesoup/.p2/pool/plugins/org.python.pydev.jython_5.4.0.201611281236/Lib/decimal.py'),
 (349239,
  '/Users/stonesoup/Library/Caches/PyCharmCE2017.1/python_stubs/348993582/numpy/random/mtrand.py')]

Explanation

The program starts with a trace flag that’s set to false. When set to True, the program will print detailed information about what is happening in the program. On line 8, we grab the user’s home directory using Path.home(). This is a platform nuetral way of finding a user’s home directory. Notice that we do have to cast this value to a String for our purposes. Finally we create an empty allsizes list that holds our results.

Starting on line 15, we use the os.walk function and pass in the user’s home directory. It’s a common pattern to combine os.walk with a for loop so that we can traverse an entire directory tree. Each iteration os.walk returns a tuple that contains the current_folder, sub_folders, and files in the current folder. We are interested in the files.

Starting on line 20, the program enters a nested for each loop that examines each file individually. On line 23, we test if the file ends with ‘.py’ to see if it’s a Python source file. Should the test return True, we continue by using os.path.join to assemble the full path to the file. The os.path.join function takes into account the underlying operating system’s path separator, so on Unix like systems, we get / while Windows systems get \ as a path separator. The file’s size is computed on line 31 using os.path.getsize. Once we have the size and the file path, we can add the result to allsizes for later use.

The program has finished scanning the user’s home folder once the program reaches line 37. At this point, we can sort our results from smallest to largest by using the sort() method on allsizes. Line 40 prints the two smallest files (using pretty print for better formatting) and line 43 prints the two largest files.

References

Lutz, Mark. Programming Python. Beijing, OReilly, 2013.

Python Multiprocessing Producer Consumer Pattern

Python3 has a multiprocessing module that provides an API that’s similar to the one found in the threading module. The main selling point behind multiprocessing over threading is that multiprocessing allows tasks to run in a truly concurrent fashion by spanning multiple CPU cores while threading is still limited by the global interpreter lock (GIL). The Process class found in multiprocessing works internally by spawning new processes and providing classes that allow for data sharing between processes.

Since multiprocessing uses processes rather than threads, child processes do not share their memory with the parent process. That means we have to rely on low-level objects such as pipes to allow the processes to communicate with each other. The multiprocessing module provides high level classes similar to the ones found in threading that allow for sharing data between processes. This example demonstrates the producer consumer pattern using processes and the Queue class sharing data.

Code

import time
import os
import random
from multiprocessing import Process, Queue, Lock


# Producer function that places data on the Queue
def producer(queue, lock, names):
    # Synchronize access to the console
    with lock:
        print('Starting producer => {}'.format(os.getpid()))
        
    # Place our names on the Queue
    for name in names:
        time.sleep(random.randint(0, 10))
        queue.put(name)

    # Synchronize access to the console
    with lock:
        print('Producer {} exiting...'.format(os.getpid()))


# The consumer function takes data off of the Queue
def consumer(queue, lock):
    # Synchronize access to the console
    with lock:
        print('Starting consumer => {}'.format(os.getpid()))
    
    # Run indefinitely
    while True:
        time.sleep(random.randint(0, 10))
        
        # If the queue is empty, queue.get() will block until the queue has data
        name = queue.get()

        # Synchronize access to the console
        with lock:
            print('{} got {}'.format(os.getpid(), name))


if __name__ == '__main__':
    
    # Some lists with our favorite characters
    names = [['Master Shake', 'Meatwad', 'Frylock', 'Carl'],
             ['Early', 'Rusty', 'Sheriff', 'Granny', 'Lil'],
             ['Rick', 'Morty', 'Jerry', 'Summer', 'Beth']]

    # Create the Queue object
    queue = Queue()
    
    # Create a lock object to synchronize resource access
    lock = Lock()

    producers = []
    consumers = []

    for n in names:
        # Create our producer processes by passing the producer function and it's arguments
        producers.append(Process(target=producer, args=(queue, lock, n)))

    # Create consumer processes
    for i in range(len(names) * 2):
        p = Process(target=consumer, args=(queue, lock))
        
        # This is critical! The consumer function has an infinite loop
        # Which means it will never exit unless we set daemon to true
        p.daemon = True
        consumers.append(p)

    # Start the producers and consumer
    # The Python VM will launch new independent processes for each Process object
    for p in producers:
        p.start()

    for c in consumers:
        c.start()

    # Like threading, we have a join() method that synchronizes our program
    for p in producers:
        p.join()

    print('Parent process exiting...')

Explanation

The program demonstrates the producer and consumer pattern. We have two functions that run in their own independent processes. The producer function places supplied names on the Queue. The consumer function monitors the Queue and removes names from it as they become available.

The producer function takes three objects: a Queue, a Lock, and a List of names. It start with acquiring a lock on the console. The console is still a shared resource so we need to make sure only one Process writes to the console at a time or they will write over the top of one another. After acquiring a lock on the console, the function prints out its process id (PID).

The producer function enters a for each loop on lines 14-16. It sleeps between 0-10 seconds on line 15 to simulate a delay in processing and then it places a name on the Queue on line 16. When the for each loop is complete, the function aquires another console lock and then notifies the user it is exiting. At this point, the process ends.

The consumer function runs in it’s own process as well. It takes the Queue and the Lock as it’s parameters and then acquires a lock on the console to notify the user it is starting. The consumer prints out it’s PID also. Next the consumer enters an infinte loop on lines 30-38. It similuates sleeping on line 31 and then makes a call the queue.get() on line 34. If the queue has data, the get() method returns that data immediately and the consumer prints the data on line 38. Otherwise, get() blocks execution until data is available.

Line 41 is the entry point to the programing, using the if __name__ == ‘__main__’ test. We begin on 44 by making a list of names. The Queue object is created on line 49 and the Lock() object is made on line 52. Then on lines 57-59, we enter a for-each loop and create our producer Process objects. We use the target parameter to point the Process at the producer function and then pass in a tuple for the arguments that the function is called with.

Creating the consumers processes has one extra that that isn’t needed when creating the Producers. Lines 62-68 creates the consumer processes, but on line 67, set the daemon property to True. This is needed because the consumer function uses and infinite loop and those processes will never terminate unless they are marked as daemon processes.

Once are processes are created, we start them by calling start() on each Process object (lines 72-76). Like threads, Processes also have a join() method that can be used to synchronize a program. Our consumer processes never return, so calling join() on them would cause the program to hang, but our producer processes do return so we use join() on line 80 to cause the parent process to wait for the producer processes to exit.

Resources

Lutz, Mark. Programming Python. Beijing, OReilly, 2013.

Programming Python: Powerful Object-Oriented Programming