I have a deduplicated storage of some million files in a two-level hashed directory structure. The filesystem is an ext4 partition on a magnetic disk. The path of a file is computed by its MD5 hash like this:
e93ac67def11bbef905a7519efbe3aa7 -> e9/3a/e93ac67def11bbef905a7519efbe3aa7
When processing* a list of files sequentially (selected by metadata stored in a separate database), I can literally hear the noise produced by the seeks (“randomized” by the hashed directory layout as I assume).
My actual question is: Is there a (generic) way to process a potentially long list of potentially small files in a seek-optimized manner, given they are stored on an ext4 partition on a magnetic disk (implying the use of linux)?
Such optimization is of course only useful if there is a sufficient share of small files. So please don’t care too much about the size distribution of files. Without loss of generality, you may actually assume that there are only small files in each list.
As a potential solution, I was thinking of sorting the files by their physical disk locations or by other (heuristic) criteria that can be related to the total amount and length of the seek operations needed to process the entire list.
A note on file types and use cases for illustration (if need be)
The files are a deduplicated backup of several desktop machines. So any file you would typically find on a personal computer will be included on the partition. The processing however will affect only a subset of interest that is selected via the database.
Here are some use cases for illustration (list is not exhaustive):
- extract metadata from media files (ID3, EXIF etc.) (files may be large, but only some small parts of the files are read, so they become effectively smaller)
- compute smaller versions of all JPEG images to process them with a classifier
- reading portions of the storage for compression and/or encryption (e.g. put all files newer than X and smaller than Y in a tar archive)
- extract the headlines of all Word documents
- recompute all MD5 hashes to verify data integrity
While researching for this question, I learned of the
ioctl command (e.g. mentioned here) which may be worth a shot, because the files will not be moved around and the results may be stored along the metadata. But I suppose that will only work as sort criterion if the location of a file’s inode correlates somewhat with the location of the contents. Is that true for ext4?
*) i.e. opening each file and reading the head of the file (arbitrary number of bytes) or the entire file into memory.