Finding those lost data files

It’s been a long time since I’ve had the bandwidth to write up a code snippet here. This morning I had not quite enough time between Zoom meetings to tackle something more involved, so here goes!

In this case I needed to find ~200 sequence (fasta) files for a student in my lab. They were split across several sequencing runs, and for various logistical reasons it was getting a bit tedious to find the location of each sequence file. To solve the problem I wrote a short Python script to wrap the Linux locate command and copy all the files to a new directory where they could be exported.

First, I created a text file “files2find.txt” with text uniquely matching each file that I needed to find. One of the great things about locate is that it doesn’t need to match the full file name.

head files2find.txt
151117_PAL_Sterivex_1
151126_PAL_Sterivex_2
151202_PAL_Sterivex_3
151213_PAL_Sterivex_4
151225_PAL_Sterivex_5
151230_PAL_Sterivex_6
160106_PAL_Sterivex_7
160118_PAL_Sterivex_9
160120_PAL_Sterivex_10
160128_PAL_Sterivex_11

Then the wrapper:

import subprocess
import shutil

with open('files2find.txt') as file_in:
    for line in file_in:
        line = line.rstrip()

        ## Here we use the subprocess module to run the locate command, capturing
        ## standard out.

        temp = subprocess.Popen('locate ' + line,
                                shell = True,
                                executable = '/bin/bash',
                                stdout = subprocess.PIPE)

        ## The communicate method for object temp returns a tuple.  First object
        ## in the tuple is standard out.       
        
        locations = temp.communicate()[0]
        locations = locations.decode().split('\n')

        ## Thank you internet for this one-liner, Python one-liners always throw
        ## me for a loop (no pun intended). Here we search all items in the locations
        ## list for a specific suffix that identifies files that we actually want.
        ## In this case our final analysis files contain "exp.fasta".  Of course if
        ## you're certain of the full file name you could just use locate on that and
        ## omit this step.

        fastas = [i for i in locations if 'exp.fasta' in i] 
        
        path = '/path/to/where/you/want/files/'
        
        found = set()

        ## Use the shutil library to copy found files to a new directory "path".
        ## Copied files are added to the set "found" to avoid being copied more than
        ## once, if they exist in multiple locations on your computer.
        
        for fasta in fastas:
            file_name = fasta.split('/')[-1]
            if file_name not in found:
                shutil.copyfile(fasta, path + file_name) 
                found.add(file_name)

        ## In the event that no files are found report that here.
                
        if len(fastas) == 0:
            print(line, 'not found')
14291 Total Views 2 Views Today
This entry was posted in Computer tutorials. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield