How to implement fast file search on a computer

How to implement fast find on a computer

One of the fundamental problems in computing is finding what file resides where in a filesystem be it for a home computer with music to a company like *search-engine with files distributed accross a datacentre which is further distributed across multiple datacenters around the globe

The second problem is of course sorting (not second rather since it’s very intertwined with searching), but that will be out of scope for this article.

Every file in your computer has:-

  • inode
  • i-number
  • symlink(optional)
  • hardlink(optional)

inode metadata is information about the state of the file i.e access time, change(or creation) time and modification time.

The best linux utility for finding files with certain 'attributes' parameters is find

Example 1:

find . -name '*.mp3' -ctime -14 -print

The above line finds mp3 files created within the last two weeks. The 'ctime' options stands for modification of inode. You can also use the case insensitive “-iname” in place of “-name” to also print '.MP3' files as in music files with an uppercase extension.

Example 2:

find . -maxdepth 1 -type d -printf  '%i %a %p \n'

The above command; search within the current directory NOT recusrively -maxdepth 1 and print the last access time of the directories and their i-number

This can be very handy if you have a Samba) share directory in your place of work and you want to check who made what change in a directory or who checked out of a directory last. On a NFS directory on *nix systems where multiple users have read/write access to a single resource.

The 'find' command can be very slow compared with locate which can parse the whole filesystem within a blink of a second.Remember to use updatedb command before running locate. The hasteness of locate is also its demise, not to mention the lack of multiple parameters.

Upto now you are wondering what is the 3rd magical command, well it's still the archaic find on steroids.That my friend is the joy of Unix, combining singly purpose programs in bizzerish intuitive ways to create programs unthought of.

find . -print0 | xargs -0 ls -id > .findcache.new
mv -f .findcache.new .findcache

Type this above lines in a file called .findcache.sh and add it as a cron job to execute daily or twice a day(in the morning and evening) depending on your needs.

The -print0 in the first line tells find to handle files with whitespace correctly. This is useful when piping to xargs. In simple terms xargs executes every output from stdout sequentially not parallely(doesn't work really well with 'dd').

xargs also has -0 option to handle whitespace in filenames. The command 'ls -id' prefixes the output with an i-number on every entry before writing to '.findcache'

The name cache is misleading since it's not an actual cache but a form of database index.

Next the mv command moves the most recent .findcache.new to the old .findcache overwriting it in its entirety on every successful cron job.

After the indexing is complete here are some few more commands to try.

Example 3:

egrep - i '* artist*' .findcache

This grep finds all files (you guessed it) with artist on their name presumably music files.The preceding "-i " is for case insensitive.

Example 4:

awk   ' BEGIN{IGNORECASE=1 } ( ( /artist/) && ( /^.mp3/)) {print $0 }' .findcache

Using an awk script is more descriptive(my bad) in search patterns than the last one. This is more beneficial since it will actually return mp3 files rather than any file with "artist" on it.

Reference

  1. The awk user manual https://www.gnu.org/software/gawk/manual/gawk.html
  2. The find user manual https://www.gnu.org/software/findutils/manual/html_mono/find.html
  3. The grep command https://www.gnu.org/software/grep/manual/grep.html