Paperless Archive Makeover — ruby and unix to the rescue

I recently had to clean out my digital paperless archive — a task made much more difficult by encrypted PDFs and a bunch of lint in my directory. We have been paperless since 2001 and have many gigs of files in a hierarchical file structure that was becoming difficult to use and maintain. (i.e. Where exactly does that government credit card statement go? Under work or credit cards?) After making symlinks to get around this problem, I started to notice that as I moved from linux, to mac, to windows, back to linux, I was losing the ability to find my files. Also, not only were a lot of my PDFs scattered in deep directory trees, but many were secured in different ways. Some files I couldn’t open at all, helped by a period in my life when I thought all PDFs should be password protected. Other files I could open and print, but not modify in any way. I had to open (de-encrypt) all these files since I am making my paperless storage full-text searchable (using Adobe Acrobat Pro’s batch ocr and PDF optimizer to downsample images). To then find the files I need, I am using Yep on the mac and Knowledge Tree on my file-server to have instant tag-based and full-text access to my documents.

My home file system is divided into two main areas: library and personal. Library contains everything that others have created for the masses and I have generally purchased: applications burned to image, music, online books, etc (> 100 gigs). Personal is reserved for files created for or by me or my family. I generally do this to prioritize my backup strategy where personal files get much more care. Inside the personal directory are directories for photographs, movies, sound recordings, my working directories (one for each person) and my digital archive, the focus of this article.

As stated above, my digital archive needed cleaning. After a good scrub with fslint to remove duplicates/cruft, the first thing I had to take care of was a whole collection of non-pdf files that had accumulated over the years as I struggled to determine what exactly should go in my paperless archive, my online filing cabinet. I have decided that working files can be whatever format they need to be in, but my paperless archive is for static documents in PDF — only. With this decision, files in a format like Excel or Matlab should be moved to my personal working directory, and other static files like gif’s, tiff’s should be converted to PDF. Now all archived documents are searchable with one method and the filing cabinet metaphor works much better (apologies to postmodern abstractions, I am still old fashioned in my metaphors). To do all this, I found the unix find command and some ruby scripting very useful. Before making any modifications, I made a backups to my off-site storage, my raid array, and local desktop. Armed with a good backup, I used find to tell me what and where all the non-PDFs were:

find . -type f -not -iregex ".*.pdf$"

which was roughly a gig of tiffs, strange files, and saved web-pages (what was I thinking). These were easy enough to move out of the way with:

mkdir not_pdf
find . -type f -not -iregex ".*.pdf$" -exec cp --parents {} not_pdf/ ;

Note the use of –parents to ensure that I didn’t overwrite files with the same name in one big directory. After carefully verifying that I had copied all relevant non-PDF files to the not_pdf directory and moved this directory outside the top-level directory, I ran the (dangerous):

find . -type f -not -iregex ".*.pdf$" -exec rm {} ;

Now, to tackle the challenge of encrypted PDFs. My strategy here is to move them from the archive to a stand-alone directory corresponding to their status (totally locked down, or modifications restricted, etc) because I am going to have different scripts/strategies to deal with each of these. The totally encrypted files, will require pdfcrack facilated by a list of frequenty used passwords. The restricted files are from my broker (seriously, why not let me own/modify my documents) and I am going to have to brute force crack them (preferred) or de-drm them via printing/etc. This is a huge annoyance and I hope financial firms will give us documents in the future that let us use them however we like. As it is now, I can’t merge them or search them.) To do this binning of PDFs, I built the following ruby script around the unix executable pdfinfo. It needs refactoring, but got the job done. The difficult part for me was to capture both stderr and stdout since I needed to know if I couldn’t open the document.


#!/usr/bin/ruby

require 'find'
basedir = '/path/to/documents'

my_count = 0

Find.find(basedir) do |file|
  if file =~ /.pdf$/
  begin
    result = %x[pdfinfo "#{file}" 2>&1]
  rescue
    puts "trouble with #{file}"
  end
  if result =~ /Encrypted:.*syess(print:(w+)scopy:(w+)schange:(w+)saddNotes:(w+))/
    can_print, can_copy, can_change,can_add_notes = $1,$2,$3,$4
    if can_print
      puts "can print #{file}"
      `mv "#{file}" /my_archive/need_to_print/`
    else
      puts "can't print #{file}"
      `mv "#{file}" /my_archive/cant_print/`
    end
    my_count += 1
  elsif result =~ /Error: Incorrect password/i
    puts "deal with no password #{file}"
      `mv "#{file}" /my_archive/need_password/`
    my_count += 1
  end
  #puts file
  end
end

puts "#{my_count} password protected files"

How I deal with these will be the subject of a future post. Please let me know any pointers on any of this.

Leave a Reply