Move Large Files With Ease

Gathering file listings and file attributes is a very expensive operation when storage is not local.  Operating systems typically keep a cache of attribute data for files on local storage. The OS continuously indexes the files on it’s local storage in the background, and keeps the cache up to date with values that can be accessed quickly when needed. But in today’s world of network storage, object storage (on-prem or in the cloud), file attribute data may not be indexed the same way local storage. This means that to collect the required data to even start a transfer of a very large fileset, a lot of time may be spent just initializing, perhaps several hours. Even when the values are indexed, the size of the cache is finite. Usually values are stored in a MRU (Most recently used) fashion.   

What this all means is that for very large file sets, traditional file transfer tools may spend significant time preparing to transfer, even if the files are on local storage. In the event of an interrupted transfer, problems with resuming such a large file set can also pose issues if the tool has to start over.

Even when a tool is able to handle the transfer of millions of files, it may have issues with navigating to the required directories because they are so large. Such tools may be great for browsing and performing ad-hoc transfers of small numbers of files or directories. But browsing and trying to display millions of files in a UI is very difficult, considering these tools need to load the attributes of every single file in the listing. As described earlier, they must spend a lot of time calculating file sets sizes, attributes, etc… make the tools perform slowly or hang.

Web based uploaders typically do not allow the upload of entire directories so users are forced to browse into massive directories which often times can not even be rendered by the web browser. Though web based tools usually provide an easy to use interface for browsing file sets, with no installation of 3rd party software, they just don’t work well with very large file sets. The same reason desktop applications/tools are not efficient for browsing large files sets apply to most web based tools. 

OK, suppose you are able to browse and initiate a transfer of millions of files with your file transfer tool. What happens when there is a network interruption and you were 2 million files into a 10 million file transfer?  Consideration has to be given to this case and how quickly the transfer can be resumed at all. There is a good chance with such a large volume of files that something will go wrong mid transfer. You must ensure your file transfer tool can reliably determine what has and has not been transferred, or you can spend even more time scanning files and/or retransferring data that was already sent.

The solution to the issue of migrating a very large file set efficiently is through the use of streaming. Calculating the total size of all files and calculating the ETA of the transfer must be set aside. When a file transfer tool uses streaming, even when there are millions of files, the transfer can start immediately. Because not all file attribute data is loaded in advance, memory usage remains low.

What about the issue of interrupted transfers of millions of files? To overcome this issue, an efficient database is needed to track files that were transferred successfully, as well as the last modification time of those files. When a transfer job needs to be restarted, files may be quickly compared against the database, and skipped. This means that files that were previously transferred successfully can quickly be skipped and new data starts to transfer quickly.

So why don’t all file transfer tools simply overcome these issues by using streaming and database caching? The answer is that these tools still serve a purpose and do the job they were meant for very well. People want to browse their files, they want to do ad-hoc transfer, and they like to see file transfer ETA, percentage complete, etc…

When you consider your use case, and need to migrate millions of files, it is just not feasible to use traditional file transfer tools, or simple web based tools. As of FileCatalyst 3.8, our FileCatalyst HotFolder application supports all of the above mentioned facilities and features. It can be used to migrate millions of files from a source location to a destination. As a bonus, it also integrates with several leading cloud/object storage providers, providing an extremely powerful tool to perform cloud migration tasks.

Stay tuned for our public release of FileCatalyst Direct and FileCatalyst Central, version 3.8 this spring! Until then, follow us on Twitter @filecatalyst & sign up to our monthly newsletter HERE