Find Python Source Files in Home Directory

Truthfully, most users aren’t very interested in finding the largest and smallest Python source files in their home directory, but doing so does provide for an exercise in walking the file tree and using tools from the os module. The program in this post is a modified example taken from Programming Python: Powerful Object-Oriented Programming where the user’s home directory is scanned for all Python source files. The console outputs the two smallest files (in bytes) and the two largest files.

Code

import os
import pprint
from pathlib import Path

trace = False

# Get the user's home directory in a platform neutral fashion
dirname = str(Path.home())

# Store the results of all python files found
# in home directory
allsizes = []

# Walk the file tree
for (current_folder, sub_folders, files) in os.walk(dirname):
    if trace:
        print(current_folder)

    # Loop through all files in current_folder
    for filename in files:

        # Test if it's a python source file
        if filename.endswith('.py'):
            if trace:
                print('...', filename)

            # Assemble the full file python using os.path.join
            fullname = os.path.join(current_folder, filename)

            # Get the size of the file on disk
            fullsize = os.path.getsize(fullname)

            # Store the result
            allsizes.append((fullsize, fullname))

# Sort the files by size
allsizes.sort()

# Print the 2 smallest files
pprint.pprint(allsizes[:2])

# Print the 2 largest files
pprint.pprint(allsizes[-2:])

Sample Output

[(0,
  '/Users/stonesoup/.local/share/heroku/client/node_modules/node-gyp/gyp/pylib/gyp/generator/__init__.py'),
 (0,
  '/Users/stonesoup/.p2/pool/plugins/org.python.pydev.jython_5.4.0.201611281236/Lib/email/mime/__init__.py')]
[(219552,
  '/Users/stonesoup/.p2/pool/plugins/org.python.pydev.jython_5.4.0.201611281236/Lib/decimal.py'),
 (349239,
  '/Users/stonesoup/Library/Caches/PyCharmCE2017.1/python_stubs/348993582/numpy/random/mtrand.py')]

Explanation

The program starts with a trace flag that’s set to false. When set to True, the program will print detailed information about what is happening in the program. On line 8, we grab the user’s home directory using Path.home(). This is a platform nuetral way of finding a user’s home directory. Notice that we do have to cast this value to a String for our purposes. Finally we create an empty allsizes list that holds our results.

Starting on line 15, we use the os.walk function and pass in the user’s home directory. It’s a common pattern to combine os.walk with a for loop so that we can traverse an entire directory tree. Each iteration os.walk returns a tuple that contains the current_folder, sub_folders, and files in the current folder. We are interested in the files.

Starting on line 20, the program enters a nested for each loop that examines each file individually. On line 23, we test if the file ends with ‘.py’ to see if it’s a Python source file. Should the test return True, we continue by using os.path.join to assemble the full path to the file. The os.path.join function takes into account the underlying operating system’s path separator, so on Unix like systems, we get / while Windows systems get \ as a path separator. The file’s size is computed on line 31 using os.path.getsize. Once we have the size and the file path, we can add the result to allsizes for later use.

The program has finished scanning the user’s home folder once the program reaches line 37. At this point, we can sort our results from smallest to largest by using the sort() method on allsizes. Line 40 prints the two smallest files (using pretty print for better formatting) and line 43 prints the two largest files.

References

Lutz, Mark. Programming Python. Beijing, OReilly, 2013.

Advertisements

Python os.walk

It’s very typical for a program to have to walk a file tree. In Recursion Example — Walking a file tree, I demonstrated how to use recursion to traverse a file system. Although it’s totally possible to walk through a file system in that fashion, it’s less than ideal because Python provides os.walk for this purpose.

The following script is a modified example borrowed from Programming Python: Powerful Object-Oriented Programming that demonstrates how to traverse a file system using os.walk.

import os
import sys


def lister(root):
    # os.walk returns a tuple with the current_folder, a list of sub_folders,
    # and a list of files in the current_folder
    for (current_folder, sub_folders, files) in os.walk(root):
        print('[' + current_folder + ']')
        for sub_folder in sub_folders:

            # Unix uses / as path separators, while Windows uses \
            # If we use os.path.join, we don't need to worry about which
            # path separator to use since os.path.join tracks that for us.
            path = os.path.join(current_folder, sub_folder)
            print('\t' + path)

        for file in files:
            path = os.path.join(current_folder, file)
            print('\t' + path)


if __name__ == '__main__':
    lister(sys.argv[1])

When run, this code prints out all of the files and directories starting at the specified root folder.

Explanation

os.walk

The os.walk function does the work of traversing a file system. The function generates a tuple with three fields. The first field is the current directory that os.walk is processing. The second field is a list of sub folders found in the current folder and the last field is a list of files found in the current folder.

Combining os.walk with a for loop is a very common technique (shown on line 8). The loop continues to iterate until os.walk finishes walking through the file system. The tuple declared in the for loop is updated on each iteration of the loop, providing developers with all of the information needed to process the contents of the directory.

os.path.join

Line 15 shows an example of using os.path.join to assemble a full path to a target folder or file. It’s import to use os.path.join to assemble file paths because Unix-like system use ‘/’ to separate file paths, while Windows systems use ‘\’. Tracking the path separator could be tedious work since it requires making a determination about which operating system is running the script. That’s not very ideal so Python provides os.path.join to take care of such work. As long as os.path.join is used, the assembled file paths will use the proper path separator for the os.

References

Lutz, Mark. Programming Python. Beijing, OReilly, 2013.

Walk a Filetree in Python

Python has a powerful os.walk function that let’s a script walk through a file system in an efficient fashion. In this example, taken from Programming Python: Powerful Object-Oriented Programming, we will walk a file tree that will remove any p-code files that are present in the file tree.

Code

Here is the code, with my comments added.

import os, sys

# Do we only want to find files only?
findonly = False

# Either use the CWD or a directly specified by command line arguments
rootdir = os.getcwd() if len(sys.argv) == 1 else sys.argv[1]

# Keep track of the found and removed files
found = removed = 0

# Walk through the file tree
for (thisDirLevel, subsHere, filesHere) in os.walk(rootdir):

    # Go through each file in the directory
    for filename in filesHere:

        # Check if it ends with .pyc
        if filename.endswith('.pyc'):

            # Assemble the full file name
            fullname = os.path.join(thisDirLevel, filename)
            print('=>', fullname)

            # Attempt to remove the file if asked to do so
            if not findonly:
                try:
                    # Attempt to delete the file
                    os.remove(fullname)

                    # Increment the removed count
                    removed += 1
                except:
                    # Handle the error
                    type, inst = sys.exc_info()[:2]

                    # Report that this file can't be removed
                    print('*'*4, 'Failed:', filename, type, inst)
            found += 1

# Output the total number of files removed
print('Found', found, 'files removed:', removed)

Detailed Explanation

This script functions in a findonly or remove mode. So the first variable we create on line 4 is a flag that decides if we are only looking for p-code files or if we are finding and removing such files. Next we create a rootdir varaible that is either the current working directory or a directory supplied by a command line argument. We create two variables on line 10, found and removed, which track how many files we have found and removed.

We get into the meat of the program on line 13 when we enter into a loop that iterates over os.walk. The os.walk function takes a directory path to start at and then goes through every single subdirectory in that file tree. It’s the standard way to walk a file tree in python. The function returns a tuple that includes the directory the os.walk function is currently examining, the number of subdirectories, and the number of files.

We create a nested loop on line 16 so that we can look at each file in the directory individually. On line 19, we check if the file ends with the .pyc extension. If it does, we use os.path.join to assemble a full file path in a platform agnostic fashion and then print out the full file path to the console.

If we are deleting files, we use os.remove on line 29 to attempt to delete a file. It’s critical that we wrap this in a try block because we may not hvae permission to delete the file. If deleting the file is successful, we increment the removed count. If it fails, the program execution will jump to line 35 and we report the error. The loop ends on line 39 and then repeats.

When the program is finished, we report how many files we found and removed.

Recursion Example — Walking a file tree

Many developers use Python as a platform independent scripting language to perform file system operations. Sometimes it’s necessary to walk through a file system. Here is one way to navigate a file system recusively. (Of course, Python has libaries that do this!)

import os

def walk_fs(start_dir):
    # Get a list of everything in start_dir
    contents = os.listdir(start_dir)

    # This stores the output
    output = []

    # Loop through every item in contents
    for f in contents:
        # Use os.path.join to reassmble the path
        f_path = os.path.join(start_dir, f)

    # check if f_path is directory (or folder)
    if os.path.isdir(f_path):
        # Make recusive call to walk_fs
        output = output + walk_fs(f_path)
    else:
        # Add the file to output
        output.append(f_path)

    # Return a list of files in the directory
    return output

if __name__ == '__main__':
    try:
        result = walk_fs(input('Enter starting folder => '))
        for r in result:
            print(r)
    except FileNotFoundError:
    print('Not a valid folder! Try again!')

The key to this is to begin by using os.listdir, which returns a list of every item in a directory. Then we can loop through each item in contents. As we loop through contents, we need to reassemble the full path because f is only the name of the file or directory. We use os.path.join because it will insert either / (unix-like systems) or \ (windows) between each part of the path.

The next statement checks if f_path is a file or directory. The os.path.isdir function is True if the item is a directory, false otherwise. If f_path is a folder, we can make a recursive call to walk_fs starting with f_path. It will return a list of files that we can concat to output.

If f_path is a file, we just add it to output. When we have finished iterating through contents, we can return output. The output file will hold all of the files in start_dir and it’s subdirectorys.