Software Archeology Using Rsync

The most powerful tool in the knapsack of a software archeologist/maintainer, is the grep search. Unfortunately the signal-to-noise ratio for grep search results can often be quite low. This happens when the project source files are intermingled with other artifacts such as generated files, raw templates, library/framework documentation files and examples.

One trick to filtering out the noise is to define a shell script that uses Rsync to create/update a searchable shadow copy of the working folder, and then to search that copy… In case you’re not familiar with Rsync, it is a tool intended to keep two remote file systems synchronized. Rsync’s main claim to fame is that it’s fast because it only transmits the differences, but Rsync is also quite powerful when it comes to specifying exactly which files and folders are to be synchronized and how. It’s this secondary feature of Rsync that allows us to filter out the noise. There are two parts to this solution: the actual shell script, and a file that lists all of the inclusion and exclusion patterns. (This example uses CygWin, running on a Windows box.)

Here is the (entire) shell script (C:\work\cmd\searchcopy.sh):

 #!/bin/sh
 pushd /cygdrive/c/work
 mkdir -p /cygdrive/e/work_search
 rsync -vrut --filter='. /cygdrive/c/work/cmd/searchcopy_filelist.txt' alpha bravo charlie /cygdrive/e/work_search
 popd
  • /cygdrive/c/work is your working folder (that’s CygWin speak for C:\work).
  • Alpha, bravo, and charlie are the folder names of the projects that you are interested in.
  • /cygdrive/e/work_search is the name of the searchable shadow copy you want to create/update (over on your E: removable USB drive).

Here is (an abbreviated version of) the filter file (C:\work\cmd\searchcopy_filelist.txt), to give you an idea:

 - .svn/
 - bin/
 - build*/
 - deployment/
 - lib/
 - log/
 - .#*
 - *.[ehjstw]ar
 - *.[Bb][Aa][Kk]
 - *.doc
 - *.[Ee][Xx][Ee]
 - *.gif
 - *.httpunit
 - *.ico
 - *.jasper
 - *.jpg
 - *.library
 - *.log
 - *.[Oo][Ll][Dd]
 - *.pdf
 - *.[Zz][Ii][Pp]

In this case, they are all exclusions (leading minus sign), Thus, everything in the alpha, bravo, and charlie folders will be copied, except files or subfolders matching these patterns.

Tips for using Rsync:

  • Don’t waste time with the –include and –exclude switches, they are merely dumbed-down versions of the –filter switch, so just use the –filter switch right off.
  • Avoid the –cvs-exclude switch, if you can, and pay close attention to what it ignores if you can’t. For example, it ignores any file or folder named “core”, and it ignores *.script files; both of which burned me when I tried using it on a certtain Tapestry application.
  • Most implementations of Rsync are case sensitive, including CygWin’s! So if there is a possibility of filenames that exist with multiple casings, then you either have to repeat the pattern or use the square bracket notation:
     - *.EXE
     - *.Exe
     - *.exe

    or

     - *.[Ee][Xx][Ee]
  • Pay close attention to the man pages that describe other aspects of the pattern matching algorithm. For example, leading and trailing slashes each have special significance.

Post a Comment

You must be logged in to post a comment.



© 2006-2007 Maxim Software Corp.  All rights reserved.