Python Split and Join file

The book Programming Python: Powerful Object-Oriented Programming has an example program that shows how to split and join files. Many utilities exist for such an operation but the program offers a good working example of how to read from and write to binary files in Python3. The code below is an adaptation from the book with my own comments added.

Code

import os


def split(source, dest_folder, write_size):
    # Make a destination folder if it doesn't exist yet
    if not os.path.exists(dest_folder):
        os.mkdir(dest_folder)
    else:
        # Otherwise clean out all files in the destination folder
        for file in os.listdir(dest_folder):
            os.remove(os.path.join(dest_folder, file))

    partnum = 0

    # Open the source file in binary mode
    input_file = open(source, 'rb')

    while True:
        # Read a portion of the input file
        chunk = input_file.read(write_size)

        # End the loop if we have hit EOF
        if not chunk:
            break

        # Increment partnum
        partnum += 1

        # Create a new file name
        filename = os.path.join(dest_folder, ('part%004' % partnum))

        # Create a destination file
        dest_file = open(filename, 'wb')

        # Write to this portion of the destination file
        dest_file.write(chunk)

        # Explicitly close 
        dest_file.close()
    
    # Explicitly close
    input_file.close()
    
    # Return the number of files created by the split
    return partnum


def join(source_dir, dest_file, read_size):
    # Create a new destination file
    output_file = open(dest_file, 'wb')
    
    # Get a list of the file parts
    parts = os.listdir(source_dir)
    
    # Sort them by name (remember that the order num is part of the file name)
    parts.sort()

    # Go through each portion one by one
    for file in parts:
        
        # Assemble the full path to the file
        path = os.path.join(source_dir, file)
        
        # Open the part
        input_file = open(path, 'rb')
        
        while True:
            # Read all bytes of the part
            bytes = input_file.read(read_size)
            
            # Break out of loop if we are at end of file
            if not bytes:
                break
                
            # Write the bytes to the output file
            output_file.write(bytes)
            
        # Close the input file
        input_file.close()
        
    # Close the output file
    output_file.close()

Explanation

split

The code snippet shows to sample functions that either split a file into parts or join those parts back together into one file. The split function begins by taking three parameters. The first parameter, source, is the file that we wish to split. The second parameter, dest_folder, is a folder that stores the output files created by the split operation. The final parameter, write_size, is the size of the file parts in bytes.

Split starts by checking if dest_folder exists or not. If the folder does not exist, we call os.mkdir to create a new folder on the file system. Otherwise, we obtain a list of all files in the folder by calling os.listdir and then remove all of them by calling os.remove. When calling os.remove, we use os.path.join to create a full path to the target file that’s getting deleted.

Once the destination folder has been prepared, the function continues by performing the actually split operation. A partnum variable is created on line 13 that tracks the number of file parts created by the split operation. The source file is opened on line 16 in binary mode. Binary mode is used in this case because we could be dealing with audio or video files and not just text files.

The split function enters an infinite loop on line 18. On line 20, we read a number of bytes, specified by write_size, from the source file and store them in the chunk variable. On line 23, we test if chunk actually recieved any bytes from the read operation. If chunk did not read any bytes, then we have hit end of file (EOF) and we break out of the loop. Otherwise, we increment partnum by one and begin to write the file part.

Line 30 creates the name and destination for the file part by using os.path.join, the dest_folder, and a string template that accepts the current part number. The destination file is created on line 33 with a call to open (also in binary mode) and then on line 36, we write chunk to the file. Line 39 has an explicit call to closing the file. While we normally wait for files to close in garabage collection, this function opens a lot of files so ideally we should close them in oder to make sure we don’t exceed the number of file handles the underlying OS allows. The function ends by closing the input_file and returning the number of part files created.

join

The join function does the reverse job of the split function. It begins by accepting a source_dir, a destination file, and the size of the part files. The output_file is created on line 50 (opened in binary mode) and then on line 53, we use os.listdir to get a list of all parts.

Since our part files contain a number that identifies the parts, we can store all parts in a list and call sort() on it. Then it’s just a matter of looping through all of the parts and assembling them into a single file. The for loop starts on line 59. On line 62, we use os.path.join to create a full path to the part file and then we can open the part file on line 65.

The program enters an infinite join loop on line 67. Inside of the while loop, we read a part of the input_file and return the bytes read. If bytes is empty, we have it end of file so we can test for this on line 72 and use break to end the while loop if we have hit end of file. Otherwise, we can write to the output file on line 76.

When we have finished reading our part file, we again close it explicitly on line 79. When all parts of have been read we close the output_file. The output_file contains the bytes of the original file that was split in the first places

Thoughts

The code contained in this post isn’t ideal for production but is instead meant to be a learning tool. In this code, we cover reading and writing to binary files and functions of the os module. There are areas we could improve this code. For example, split destroys the contents of the destination folder, but ideally, it should instead throw an exception back to the caller and let the caller delete all files in a folder instead.

We also don’t test if our input files are really files and if our folders are really folders. That is certainly an area for improvement. Another thing that could be improved upon is using an enumeration for the size of the file parts. Right now, write_size in split and read_size in join are specified in bytes, but that isn’t clear to clients of these functions.

References

Lutz, Mark. Programming Python. Beijing, OReilly, 2013.

Leave a comment