How to implement fast file search on a computer
How to implement fast find on a computer
One of the fundamental problems in computing is finding what file resides where in a filesystem be it for a home computer with music to a company like *search-engine with files distributed accross a datacentre which is further distributed across multiple datacenters around the globe
The second problem is of course sorting (not second rather since it’s very intertwined with searching), but that will be out of scope for this article.
Every file in your computer has:-
- inode
- i-number
- symlink(optional)
- hardlink(optional)
inode metadata is information about the state of the file i.e access time, change(or creation) time and modification time.
The best linux utility for finding files with certain 'attributes' parameters is find
Example 1:
find . -name '*.mp3' -ctime -14 -print
The above line finds mp3 files created within the last two weeks. The 'ctime' options stands for modification of inode. You can also use the case insensitive “-iname” in place of “-name” to also print '.MP3' files as in music files with an uppercase extension.
Example 2:
find . -maxdepth 1 -type d -printf '%i %a %p \n'
The above command; search within the current directory NOT recusrively
-maxdepth 1
and print the last access time of the directories and their i-number
This can be very handy if you have a Samba) share directory in your place of work and you want to check who made what change in a directory or who checked out of a directory last. On a NFS directory on *nix systems where multiple users have read/write access to a single resource.
The 'find' command can be very slow compared with locate which can parse the whole filesystem within a blink of a second.Remember to use updatedb command before running locate. The hasteness of locate is also its demise, not to mention the lack of multiple parameters.
Upto now you are wondering what is the 3rd magical command, well it's still the archaic find on steroids.That my friend is the joy of Unix, combining singly purpose programs in bizzerish intuitive ways to create programs unthought of.
find . -print0 | xargs -0 ls -id > .findcache.new
mv -f .findcache.new .findcache
Type this above lines in a file called .findcache.sh
and add it as a cron job
to execute daily or twice a day(in the morning and evening) depending on your needs.
The -print0
in the first line tells find to handle files with whitespace correctly.
This is useful when piping to xargs. In simple terms xargs executes every output from stdout sequentially not parallely(doesn't work really well with 'dd'
).
xargs also has -0
option to handle whitespace in filenames. The command 'ls -id' prefixes the output with an i-number on every entry before writing to '.findcache'
The name cache is misleading since it's not an actual cache but a form of database index.
Next the mv command moves the most recent .findcache.new
to the old .findcache
overwriting it in its entirety on every successful cron job.
After the indexing is complete here are some few more commands to try.
Example 3:
egrep - i '* artist*' .findcache
This grep finds all files (you guessed it) with artist on their name presumably music files.The preceding "-i " is for case insensitive.
Example 4:
awk ' BEGIN{IGNORECASE=1 } ( ( /artist/) && ( /^.mp3/)) {print $0 }' .findcache
Using an awk script is more descriptive(my bad) in search patterns than the last one. This is more beneficial since it will actually return mp3 files rather than any file with "artist" on it.
Reference
- The awk user manual https://www.gnu.org/software/gawk/manual/gawk.html
- The find user manual https://www.gnu.org/software/findutils/manual/html_mono/find.html
- The grep command https://www.gnu.org/software/grep/manual/grep.html