Python Split and Join file

The book Programming Python: Powerful Object-Oriented Programming has an example program that shows how to split and join files. Many utilities exist for such an operation but the program offers a good working example of how to read from and write to binary files in Python3. The code below is an adaptation from the book with my own comments added.

Code

import os


def split(source, dest_folder, write_size):
    # Make a destination folder if it doesn't exist yet
    if not os.path.exists(dest_folder):
        os.mkdir(dest_folder)
    else:
        # Otherwise clean out all files in the destination folder
        for file in os.listdir(dest_folder):
            os.remove(os.path.join(dest_folder, file))

    partnum = 0

    # Open the source file in binary mode
    input_file = open(source, 'rb')

    while True:
        # Read a portion of the input file
        chunk = input_file.read(write_size)

        # End the loop if we have hit EOF
        if not chunk:
            break

        # Increment partnum
        partnum += 1

        # Create a new file name
        filename = os.path.join(dest_folder, ('part%004' % partnum))

        # Create a destination file
        dest_file = open(filename, 'wb')

        # Write to this portion of the destination file
        dest_file.write(chunk)

        # Explicitly close 
        dest_file.close()
    
    # Explicitly close
    input_file.close()
    
    # Return the number of files created by the split
    return partnum


def join(source_dir, dest_file, read_size):
    # Create a new destination file
    output_file = open(dest_file, 'wb')
    
    # Get a list of the file parts
    parts = os.listdir(source_dir)
    
    # Sort them by name (remember that the order num is part of the file name)
    parts.sort()

    # Go through each portion one by one
    for file in parts:
        
        # Assemble the full path to the file
        path = os.path.join(source_dir, file)
        
        # Open the part
        input_file = open(path, 'rb')
        
        while True:
            # Read all bytes of the part
            bytes = input_file.read(read_size)
            
            # Break out of loop if we are at end of file
            if not bytes:
                break
                
            # Write the bytes to the output file
            output_file.write(bytes)
            
        # Close the input file
        input_file.close()
        
    # Close the output file
    output_file.close()

Explanation

split

The code snippet shows to sample functions that either split a file into parts or join those parts back together into one file. The split function begins by taking three parameters. The first parameter, source, is the file that we wish to split. The second parameter, dest_folder, is a folder that stores the output files created by the split operation. The final parameter, write_size, is the size of the file parts in bytes.

Split starts by checking if dest_folder exists or not. If the folder does not exist, we call os.mkdir to create a new folder on the file system. Otherwise, we obtain a list of all files in the folder by calling os.listdir and then remove all of them by calling os.remove. When calling os.remove, we use os.path.join to create a full path to the target file that’s getting deleted.

Once the destination folder has been prepared, the function continues by performing the actually split operation. A partnum variable is created on line 13 that tracks the number of file parts created by the split operation. The source file is opened on line 16 in binary mode. Binary mode is used in this case because we could be dealing with audio or video files and not just text files.

The split function enters an infinite loop on line 18. On line 20, we read a number of bytes, specified by write_size, from the source file and store them in the chunk variable. On line 23, we test if chunk actually recieved any bytes from the read operation. If chunk did not read any bytes, then we have hit end of file (EOF) and we break out of the loop. Otherwise, we increment partnum by one and begin to write the file part.

Line 30 creates the name and destination for the file part by using os.path.join, the dest_folder, and a string template that accepts the current part number. The destination file is created on line 33 with a call to open (also in binary mode) and then on line 36, we write chunk to the file. Line 39 has an explicit call to closing the file. While we normally wait for files to close in garabage collection, this function opens a lot of files so ideally we should close them in oder to make sure we don’t exceed the number of file handles the underlying OS allows. The function ends by closing the input_file and returning the number of part files created.

join

The join function does the reverse job of the split function. It begins by accepting a source_dir, a destination file, and the size of the part files. The output_file is created on line 50 (opened in binary mode) and then on line 53, we use os.listdir to get a list of all parts.

Since our part files contain a number that identifies the parts, we can store all parts in a list and call sort() on it. Then it’s just a matter of looping through all of the parts and assembling them into a single file. The for loop starts on line 59. On line 62, we use os.path.join to create a full path to the part file and then we can open the part file on line 65.

The program enters an infinite join loop on line 67. Inside of the while loop, we read a part of the input_file and return the bytes read. If bytes is empty, we have it end of file so we can test for this on line 72 and use break to end the while loop if we have hit end of file. Otherwise, we can write to the output file on line 76.

When we have finished reading our part file, we again close it explicitly on line 79. When all parts of have been read we close the output_file. The output_file contains the bytes of the original file that was split in the first places

Thoughts

The code contained in this post isn’t ideal for production but is instead meant to be a learning tool. In this code, we cover reading and writing to binary files and functions of the os module. There are areas we could improve this code. For example, split destroys the contents of the destination folder, but ideally, it should instead throw an exception back to the caller and let the caller delete all files in a folder instead.

We also don’t test if our input files are really files and if our folders are really folders. That is certainly an area for improvement. Another thing that could be improved upon is using an enumeration for the size of the file parts. Right now, write_size in split and read_size in join are specified in bytes, but that isn’t clear to clients of these functions.

References

Lutz, Mark. Programming Python. Beijing, OReilly, 2013.

Advertisements

Find Python Source Files in Home Directory

Truthfully, most users aren’t very interested in finding the largest and smallest Python source files in their home directory, but doing so does provide for an exercise in walking the file tree and using tools from the os module. The program in this post is a modified example taken from Programming Python: Powerful Object-Oriented Programming where the user’s home directory is scanned for all Python source files. The console outputs the two smallest files (in bytes) and the two largest files.

Code

import os
import pprint
from pathlib import Path

trace = False

# Get the user's home directory in a platform neutral fashion
dirname = str(Path.home())

# Store the results of all python files found
# in home directory
allsizes = []

# Walk the file tree
for (current_folder, sub_folders, files) in os.walk(dirname):
    if trace:
        print(current_folder)

    # Loop through all files in current_folder
    for filename in files:

        # Test if it's a python source file
        if filename.endswith('.py'):
            if trace:
                print('...', filename)

            # Assemble the full file python using os.path.join
            fullname = os.path.join(current_folder, filename)

            # Get the size of the file on disk
            fullsize = os.path.getsize(fullname)

            # Store the result
            allsizes.append((fullsize, fullname))

# Sort the files by size
allsizes.sort()

# Print the 2 smallest files
pprint.pprint(allsizes[:2])

# Print the 2 largest files
pprint.pprint(allsizes[-2:])

Sample Output

[(0,
  '/Users/stonesoup/.local/share/heroku/client/node_modules/node-gyp/gyp/pylib/gyp/generator/__init__.py'),
 (0,
  '/Users/stonesoup/.p2/pool/plugins/org.python.pydev.jython_5.4.0.201611281236/Lib/email/mime/__init__.py')]
[(219552,
  '/Users/stonesoup/.p2/pool/plugins/org.python.pydev.jython_5.4.0.201611281236/Lib/decimal.py'),
 (349239,
  '/Users/stonesoup/Library/Caches/PyCharmCE2017.1/python_stubs/348993582/numpy/random/mtrand.py')]

Explanation

The program starts with a trace flag that’s set to false. When set to True, the program will print detailed information about what is happening in the program. On line 8, we grab the user’s home directory using Path.home(). This is a platform nuetral way of finding a user’s home directory. Notice that we do have to cast this value to a String for our purposes. Finally we create an empty allsizes list that holds our results.

Starting on line 15, we use the os.walk function and pass in the user’s home directory. It’s a common pattern to combine os.walk with a for loop so that we can traverse an entire directory tree. Each iteration os.walk returns a tuple that contains the current_folder, sub_folders, and files in the current folder. We are interested in the files.

Starting on line 20, the program enters a nested for each loop that examines each file individually. On line 23, we test if the file ends with ‘.py’ to see if it’s a Python source file. Should the test return True, we continue by using os.path.join to assemble the full path to the file. The os.path.join function takes into account the underlying operating system’s path separator, so on Unix like systems, we get / while Windows systems get \ as a path separator. The file’s size is computed on line 31 using os.path.getsize. Once we have the size and the file path, we can add the result to allsizes for later use.

The program has finished scanning the user’s home folder once the program reaches line 37. At this point, we can sort our results from smallest to largest by using the sort() method on allsizes. Line 40 prints the two smallest files (using pretty print for better formatting) and line 43 prints the two largest files.

References

Lutz, Mark. Programming Python. Beijing, OReilly, 2013.

Python Signals

Python has a signal module that is used to respond to signals generated by the operating system. Signals are a very low-level form of interprocess communication, but there are some cases where a program may wish to respond to a signal. For example, it may be useful to watch for program signals when writing developer toolkits.

This post demonstrates how to respond to an alarm signal. The example is borrowed from Programming Python: Powerful Object-Oriented Programming. I added my own comments to help explain the workings of the program.

Code

import sys, signal, time


# Function that returns the time
def now():
    return time.asctime()


# Function that handles the signal
def onSignal(signum, stackframe):
    print('Got alarm', signum, 'at', now())


while True:
    print('Setting at', now())
    
    # This tells the program to respond to the alarm signal
    # by calling the onSignal function
    signal.signal(signal.SIGALRM, onSignal)
    
    # Raise SIGALRM (Note this can be done by other processes also)
    signal.alarm(5)
    
    signal.pause()

Explanation

The code defines an onSignal function that works as a handler to operating system signals found on lines 10-11. All it does is prints text to the console. On line 19, we register onSignal as a handler for the SIGALRM os signal. Line 22 shows how to raise an os signal, which then invokes onSignal. Note that we don’t have to have our programs actually raise signals. We can also simply listen for other os signals raised by other programs (for example, the kill signal which is raised by executing killall in a unix shell).

References

Lutz, Mark. Programming Python. Beijing, OReilly, 2013.

Python Basic Pipes

Python provides two main avenues of parallel processing. One avenue is to use multithreading where a program itself multitasks, while the other approach is to have a program relaunch itself as a separate program in a new process. One approach is not necessarily better than the other approach but instead, should be throught of as tools for different use cases. Threads have low overhead and share a program’s memory space, which allows for easy communication between threads. Processes operate as if we launched a new copy of the program from our operating system and allow programs to spread themselves out over an operating system or even a network.

However, processes do not share a global memory space, which means they need a way to communicate with one another. One approach to interprocess communication (IPC) is to use pipes. This post shows an example of IPC using pipes taken from Programming Python: Powerful Object-Oriented Programming. I have added my own comments to the code for clarity.

Code

import os, time


# Function called by child processes
def child(pipeout):
    zzz = 0
    while True:
        time.sleep(zzz)

        # We have to encode our string to binary to use
        # with pipes
        msg = ('Spam {}'.format(zzz)).encode()

        # Send the data back to the parent process
        os.write(pipeout, msg)
        zzz = (zzz + 1) % 5


def parent():
    # Creates our pipes. The pipeout gets passed to the child
    # process while parent keeps pipein
    pipein, pipeout = os.pipe()

    if os.fork() == 0:
        # We are now in the child process so call child and supply
        # it with pipeout so that it can send information back to
        # the parent.
        child(pipeout)
    else:
        # This is the parent process
        while True:
            # Read data from the child process
            # This call blocks until there is data
            line = os.read(pipein, 32)

            # Print to the console
            print('Parent {} got [{}] as {}'.format(os.getpid(), line, time.time()))


if __name__ == '__main__':
    parent()

Explanation

We have two functions in the program named child() and parent(). The child() function is intended to run in child processes while parent() contains the main program. Parent() is defined on lines 19-37. The function begins by calling os.pipe() on line 22 which returns a tuple containing two ends of a single pipe. Pipes are unidirectional and thus pipein is used by the parent to read data that comes from the child process. The child process uses pipeout to send data to the parent.

The program forks into two different processes on line 24. The program is in the child process when os.fork() returns zero. Line 28 calls the child() function and passes pipeout to the child function so that the child process can send data back to the parent. The child process enters an infinite loop on line 7. On line 12, a msg variable is created that contains a String variable. Pipe send binary data, so we have to call encode() on the String to convert it to a binary string. Then on line 15, we send the msg varaiable back to the parent using os.write and supplying pipeout and msg to that function.

The parent process continues on line 31. It attempts to read data from the child process on line 34 using os.read. Notice that os.read requires a pipein variable and the size of binary data to read (32 bytes in this program). If the pipe contains data, os.read returns immedialy and stores the value in the line variable. Otherwise, os.read blocks the program until the pipe has data. The parent process prints the data on line 37.

References

Lutz, Mark. Programming Python. Beijing, OReilly, 2013.

Python Fork() Exit Status

Most computer programs return an exit code to the operating system’s shell. Shell scripting tools can use the exit status of a program to indicate if the program exited normally or abnormally. In either case, the shell script can react depending on the outcome of the child process.

Python programs can use os.fork() to create child processes. Since the child process is a new instance of the program, it can be useful in some cases to inspect if the child process exited normally or not. For example, a GUI program may spawn a child process and notify the user if the operation completed successfully or not.

This post shows an example program taken from Programming Python: Powerful Object-Oriented Programming that demonstrates how a parent process can inspect a child process’ exit code. I added comments to help explain the workings of the program.

Code

import os

exitstat = 0


# Function that is executed after os.fork() that runs in a new process
def child():
    global exitstat
    exitstat += 1
    print('Hello from child', os.getpid(), exitstat)
    
    # End this process using os._exit() and pass a status code back to the shell
    os._exit(exitstat)


# This is the parent process code
def parent():
    while True:
        # Fork this program into a child process
        newpid = os.fork()
        
        # newpid is 0 if we are in the child process
        if newpid == 0:
            # Call child()
            child()
            
        # otherwise, we are still in the parent process
        else:
            # os.wait() returns the pid and status and status code
            # On unix systems, status code is stored in status and has to
            # be bit-shifted
            pid, status = os.wait()
            print('Parent got', pid, status, (status >> 8))
            if input() == 'q':
                break


if __name__ == '__main__':
    parent()

Explanation

This program is pretty basic. We have two functions, parent() and child(). When the program starts on line 38, it calls parent() to enter the parent() function. The parent() function enters and infinite loop that forks this program on line 20. The result of os.fork() is stored in the newpid variable.

Our program is executing in the child process when newpid is zero. If that case, we call our child() function. The child() function prints a message to the console on line 10 and then exits by calling os._exit() on line 13. We pass the exitstat variable to os._exit() whose value becomes the exit code for this process.

The parent process continues in the meantime. On line 32, we use os.wait() to return the pid and status of the child process. The status variable also containes the exitstat value passed to os._exit() in the child process, but to get this code, we have to perform a bit shift operation by eight bits. The following line prints the pid, status, and the child process’ exit code to the console. When the user presses ‘q’, the parent process ends.

References

Lutz, Mark. Programming Python. Beijing, OReilly, 2013.

Use Python to Launch Another Program

Many Python programs operate as wrappers to other programs. For example, there are a wide variety of Python programs that work as a GUI wrapper for CLI programs. This tutorial uses an example from Programming Python: Powerful Object-Oriented Programming to demonstrate how a Python program can be used to launch another program.

Code

Parent Process

This code represents a parent process that is capable of launching a child process. I added my own comments to help further explain the code.

import os

# Track the number of spawned children
spawned_children = 0

# Enter an infinite loop
while True:
    
    # Increment the number of children
    spawned_children += 1
    
    # Now launch a new process
    # If we are still in the parent process, pid is non-zero
    # otherwise, pid is zero and indicates that we are in the
    # child process
    pid = os.fork()
    
    # Test if we are in the child process
    if pid == 0:
        
        # Now, since we are in the new process, we use
        # os.execlp to launch a child script.
        os.execlp('python', 'python', 'child.py', str(spawned_children))
        assert False, 'error starting program'
    else:
        # If we are here, then we are still in the parent process
        print('Child is', pid)
        if input() == 'q': break

Child Process

This is a simple script that simply represents a sample child process.

import os, sys

# Parent passes the child process number
# as a command line argument
print('Hello from child', os.getpid(), sys.argv[1])

Explanation

We have two core concepts used in this program. The first concept is forking the parent process into a child process. The other concept is actually launching a new program.

Forking

Forking is the process of launching a copy of the running program. The program that launched a copy of itself is called the parent process, while the copy of the program is called the child process. In theory, a parent and its children can spawn as many children as the underlying operating system allows.

The example program spawns a child process on line 16 and assigns the result to the variable called pid. When os.fork() returns, pid is zero if we are running in the child process or non-zero if we are still in the parent. The program uses an if – else statement (lines 19 and 25 respectively) to perform different operations depending if we are in the parent or child process.

The parent process executes lines 27 and 28. It simply prints out the child’s process id and then asks the user if they wish to continue the program. The user can enter ‘q’ to exit the parent process.

The child process executes code on lines 23 and 24. Line 23 uses os.execlp to start the companion child script and passes the number of spawned children as a command line argument. The child script prints its pid and the number of child processes and then exits.

Launching Programs

The Python os module provides a variety of ways to launch other programs via a Python script. This example makes use of os.execlp which looks for an executable on the operating system’s path. The first argument is the name of the program followed by any number of command line arguments.

The os.execlp is not supposed to return to the caller because if successful, the program is replaced by the new program. This is why it’s necessary to call os.fork() prior to using os.execlp (unless you wish your program to simply end with the new process). When the example script calls os.execlp, the child process created by os.fork is replaced by the new program launched by os.execlp.

Line 24 handles the case when os.execlp fails. Should os.execlp fail, control is returned back to the caller. In most cases, a develop should notify the user that launching the new program has failed. The example script accomplishes this by using an assert False statement with an error message.

References

Lutz, Mark. Programming Python. Beijing, OReilly, 2013.

Python Basic Forking

Many programs need to execute tasks simultaneously and Python provides us with a few different mechanisms for concurrent programming. One of those mechanisms is called forking, where a call is made to the underlying operating system to create a working copy of a program that’s already running. The program that created the new process is called the parent process, while the processes that are created by the parent are called the child process.

This post shows the most basic form of creating processes in Python and helps serve as a foundation to understanding forking. The example is derived from Programming Python: Powerful Object-Oriented Programming, and I added my own comments to help better explain the program.

import os


# This is a function called by the child process
def child():
    # Use os.getpid() to get the pid for this process
    print('Hello from child', os.getpid())

    # force the child process to exit right away
    # or the child process will return to the infinite loop in parent()
    os._exit(0)


def parent():
    while True:
        # Attempt to fork this program into a new process. When forking is complete
        # newpid will be non-zero for the original process, but it wil be
        # 0 in the child process
        newpid = os.fork()

        # The program now goes in two different directions at the same time
        # When newpid is 0, we call child() and the child process exits
        if newpid == 0:  # Test if this is a child process
            child()
        else:
            # If are here, then we are still in the parent process
            # We print the pid of the parent and the child process (newpid)
            print('Hello from parent', os.getpid(), newpid)
        if input() == 'q':
            break


if __name__ == '__main__':
    parent()

When run on my machine, the program shows the following output

Hello from parent 87800 87802
Hello from child 87802
k
Hello from parent 87800 87803
Hello from child 87803
k
Hello from parent 87800 87804
Hello from child 87804
k
Hello from parent 87800 87805
Hello from child 87805
q

Explanation

The hardest part to grasp about this program is that when os.fork() is called on line 19, the program actually launches a copy of itself. The operating system creates the new process and that new process gets a copy of all variables in memory and execution of the new and old programs continue after line 19. (Note: The OS may not exactly copy the parent process, but functionally speaking, the child process can be considered to be a copy of the parent).

The os.fork() function returns a number called a PID (process ID). We can test the pid to see if we are running in the parent or child process. When we are in the child process, the value returned by os.fork() is zero. So on line 23, we test for 0 and if newpid is zero, we call the child() function.

The alternative case is that we are still running in the parent process (bearing in mind, that the child process is also running at this point in time as well). If we are still in the parent process os.fork() returns a non-zero value. In that case, we use the else block to print the parent and child PID.

The parent process continues to loop until the user enters q to quit. Each time the loop iterates, a new child process is created by the parent. The parent prints its own PID (using os.getpid()) and the pid of the child on line 28.

The child process also uses os.getpid() to get its own PID. It prints its own PID on line 7 and then on line 11, we use os._exit(0) to force the child process to shut down. This is a critical step for this program! If we were to omit the call to os._exit on line 11, the child process would return to the parent function and enter the same infinite loop the parent is using.

Conclusion

This is the most basic example of creating child processes using Python. Keep in mind that processes do not share memory (unlike threads). In real world programs, processes often need to sycnchronize data from one process to another process using tools such as network sockets, databases, or files. When a child process is spawned it gets a copy of the memory of the parent process, but then functions as an independent program.

Source

Lutz, Mark. Programming Python. Beijing, OReilly, 2013