cygwin + bash + rsync
You could have setup cygwin with bash and rsync on the windows machines as well. Rsync is better then cp when it comes to moving files between systems.
Recently I copied 60 million files from one Windows file server to another. Tools used to move files from system to another are integrated into every operating system, and there are third party options too. The tasks they perform are so common that we tend to ignore their limits. Many systems administrators are guilty of not …
I have used rsync extensively, one of the things i commonly use it for is duplicating entire system installs from one system to another... I can do an initial copy while the source system is live to get 99% of the data, and then only need to copy the differences once i have shut down the source system. I have a busy mailserver which uses the maildir format (one file per message) and that had more than 60 million small files on it when i migrated it to new hardware.
You've tried this in cygwin on a windows box have you? And it works?
No'one is disputing rsync works on linux. The article is about how to copy that many files from a windows box. If someone suggests doing it in cygwin it's a more useful suggestion if they actually know whether it works, rather than leaving someone else to do their work for them.
In theory any of the other tools discussed n the article should also work. In practice they didn't.
It has options for 'synchronising' two folders, including file permissions.
Dump the text output into a file then search for 'ERROR', or better, pass it through FINDSTR to filter out anything but lines beginning with 'ERROR'
Sure, it bombs out om folder/filenames longer than 256 characters, but those are an abomination and the users who made them(usually by using filenames suggested by MS Office apps) really needs to be punished anyway.
Been pushing around a few million files this spring and summer...
robocopy also has the /create option which copies the files with 0 data. OK, a total operation take 2+ passes, but it has several advantages :
1. 60million files WILL cause the MFT to expand. If you are filling the disk with data as well as MFT then you can end up fragenting the MFT which can lead to performance problems. If the disk is only writing MFT (such as during a create), then the MFT can expand to adjacent space.
2. Since there is no data, the operation completes is a fraction of the time, so if you log the operation, you can see where any failures will occur, and fix them for the final copy.
For planned migrations, you can run the same job several times (only run the /create the first time), and therefore the final sync should take a lot less time :)
I asked is "cygwin" had been considered, I did not say "Use cygwin! It's the wins! L0LZ!!11!" The author had looked at various tools and not mentioned "cygwin", so my question seems perfectly reasonable to me (others have asked the same question).
You seem to have taken umbrage at a few "cygwin" related posts, do you have an issue with this tool (I used it for some light-weight ssh and rsync work and really like it). If you do, have you filled bugs, got involved?
Or do you know something about "cygwin" and large jobs? "Oh, you can't use cygwin for that because it's job index will overflow, see bug-1234".
Or are you just some reactionary pillock who can't see through the Windows? "!Microsoft==Bad"
I know which conclusion I am drawing at the moment...
You're jumping to the wrong conclusions. The point is simply that before trying you'd expect half the tools tried in the article to work on windows. However, they didn't. So it's helpful to make clear whether you know the solution you're suggesting works or not. It sounds like you don't. Also, you didn't ask whether cygwin had been considered. Re-read what you posted. It's nothing to do with windows / linux / os of choice. I work mostly on bsd and linux derivatives for a living. Oh, and i like cygwin.
And you do know the price of a tape drive, right? Company I worked for bought a pair of Ultrium-4 drives from IBM. The bloody thing itself costs a little over US$3000 a pop, and you need two. If you work in a company where accounts is a /b/tard, you'll know how painful it is to get them to approve the upgrade for a drive, let alone two drives.
+1 This has the most desirable side effect of stopping any bugger from trying to update the "wrong" file, or creating new ones while the copy is in progress.
p.s. Don't forget the second part of any professional data copying activity is to VERIFY that what you copied actually did turn out to be the same as what you copied from. Many a backup has turned out to be just a blank tape and an error message without this stage.
Clearly you've never worked in a decent sized enterprise.
There are LOTS of scenario's when you can't just move the disks or use backup hardware. Maybe not all the data on one disk is being moved. Maybe the data is on different SANs or NAS and can't be swapped or zoned. maybe you can't attach the same type of tape to each server.
"So the best way to move 60 million files from one Windows server to another turns out to be: use Linux."
D'oh? On a serious note, you should also give rsync a go. And try also _storing_ the data on an OS that is not windows, you might find that you are then able to do things you need to do, like say copy 60m files, without resorting to booting virtual machines of another OS just to use that OS's utilities to manage your main OS. Just saying. Please don't flame me.
"Not easy once you try doing a file transfer via rsync through an ssh tunnel, like your suggesting, but the destination server isn't running an ssh server....let alone use / as a path convention."
Well, if the target system is running LInux, then turn on its sshd service!
If it's running Windoze ... well, borrow another PC, boot an appropriate Linux live CD, mount the MS Shared folder as CIFS, start the ssh daemon, and rsync through the temporary Linux system to the MS system.
Linux: the system that provides answers and encourages creativity.
Windoze: the system that erects obstacles and encourages stupidity.
Personally, I swear by Directory Opus, by GPsoftware.
I can attest to it's incredibly reliable performance, error handling and insanely flexible advanced features.
Aside from being able to copy vast quantities of data, handle errors, log all actions, migrate NTFS properties, automatically unprotect restricted files and re-copy files if the source is modified, it also has built-in FTP, an advanced synchronisation feature (useful for mopping up failed files after you've fixed the problem that stopped them being copied), and a truly unparralelled batch renaming system which among other things, can use Regular Expressions.
It also has tabbed browsing (you can save groups of tabs), duplicate file finding, built in Zip management, custom toolbar command creation, file/folder listing and printing....
Stangely, not a lot of sysadmins know about DOpus. I learnt of it during my Amiga days, in what seems like a lifetime ago. I always have a copy installed on my workstation, and at least 1 of my servers
I like it! I'm often frustrated dealing with 10,000 small files in Windows, never mind 60m! On a desktop Windows 7 is painfully slow displaying 2000 photos in a single folder. It shows them straight away (detailed view, not thumbs) but then takes 20 secs to sort them by date modified! Aah!
But why do you have 60m files? Could you store that data in a better way? Could it be put into a database for example?
From the article:
I wanted to give several command-line tools a go as well. XCopy and Robocopy most likely would have been able to handle the file volume but - like Windows Explorer - they are bound by the fact that NTFS can store files with longer names and greater path than CMD can handle. I tried ever more complicated batch files, with various loops in them, in an attempt to deal with the path depth issues. I failed.
....which makes me very interested in what he was doing, and why robocopy wasn't an option
I've managed to use robocopy to create files I couldn't delete from windows before (because I'd gone from c:\ to d:\somefolder\someotherfolder and that pushed the bottom of the folders past the filepath limit)
"Richcopy can handle large quantities of files, but can multi-thread the copy, and so is several hours faster than using a Linux server as an intermediary"
Richcopy is not so fast that there is time to defragment an NTFS partition with 60 million files on it before CP would have finished."
Wut?
Richcopy copies more than one file a time.
If your destination is a Windows server, then doing this causes massive fragmentation. So the total time to finish the copy is "time to copy" +"time to defrag". The goal is not just to get files from server A to server B, but to get them there in a ashion that ensures that server B is ready for prime time.
rsync
I don't know what the state of rsync servers on windows is, but on unix systems it's the one true way for copying files over a network.
It's fast, handles failure well, doesn't get it's nickers in a twist when doing large recursive copies, and will run over ssl with a bit of work. It can also give you plenty of feedback about progress if you need it.
The idea of copying that many files using something like a file manager or cp, no matter how good, just fills me with horror.
I've had some success with synctoy for such tasks, but not with quite that number of files! Would be interested to see how it fares out actually.
Another way to use linux cp would be via Cygwin, which allows you to access main system drives within a unix-like shell.. also might be interesting to try..
Yarp: I would handle that.
Where would I put the .bkf though? The originating server doesn't have a spare 10tb, the destination doesn't have 10tb worth of buffer space, and writing your *.bkf to a network share is madness past about 2tb worth of bkf. (A network hiccough WILL occur, and you WILL loose that backup.)
That said, for most tasks, WIndows Backup Services serve me just fine.
Well, maybe... Our Hero reckons that Richcopy will leave the destination disk with loads of fragmented files, which he'd then want to undo. So doing the job in Richcopy and then defragmenting the disk would take longer.
I am not sure that (1) there would be great fragmentation and (2) it would be a pressing issue. You could let users use the new volume in the fragmented state, it'll work, and do your defragmenting later.
Then again, if you back up filewise, you can simply restore from your backup to the new copy location... unless the process we're discussing IS your backup?
Far and away the fastest way to move 60 million files is to use an image copy. Of course you can't restructure the underlying disk paritions, but it avoids all those tens of millions of file directory operations. You can then use a file sync program, like rsync to get the new and old back in full alignment (assuming that you aren't able to freeze the source file system in the meantime).
Depending on the total size, network bandwidth between the two servers and physical distance you can transfer the partition image using anything from a USB external drive to a network mount.
I want to thank you for taking some time to investigate this problem and propose a few solutions! I don't do this often, but I think that filing away a copy of Richcopy would be a good idea.
I have found it absolutely unbelievable that for all of these years (since Windows 95) Microsoft hasn't had their door beaten down with requests to make Explorer a lot more robust when copying or moving files. Well, maybe they have...who could know?
I would love to see the ability to bypass an error and move on added to Explorer at some point. Since we're living in a post-Vista era now and the Window GUI shell is now Really Broken to my view, I'm not sure I care.
Now, if anyone knows what to do about the files created by Adobe Flush and even Internet Explorer in temporary areas that violate the long file name conventions and cannot therefore be whacked in any way I've tried thus far...I'm all ears!
Our problems have been moving files from remote hosted server with provider a to remote hosted server with provider b, with internet connections between the two, moving aprox 20GB of data.
I find that most of the tools don't seem to deal with connection drops that well, or are just generally so slow as to make it impossible.
I think last time I ended up going with Raring the lot but splitting the RAR's in to <200mb chunks. This made transferring them simpler.
Never thought of using *nix tools.
Your problem is solved with RSync (as has been pointed out by many others). RSync is a delta-copying program, which makes successive copies faster/less bandwidth because it only copies changes in files. Great for WAN connections. Not only that, but it has a retry in event of connection loss. If all else fails, you can always restart the transfer and it will make sure all is in sync (in-line verification!).
Linux has its place in the world. It comes into play when you need to do something that your ACTUAL (usually Windows) servers can't.
> ctrl + c then ctrl + v?
In Explorer?
If you try it, with 60 Million files (probably around 6TB of data), explorer will sit there for ages "Preparing to copy" as it calculates the total data size and how long it will take.
You wander away for lunch (its going to take several hours to do the copy), come back to see an error "Could not copy 'report.doc' ", helpfully NOT printing the source or destination directory. The copy has stopped, and can not be continued.
Which report.doc file is it talking about? There are a few thousand files of the same name in assorted directories.
If you figure that out, there is no way to restart the operation with out overwriting any existing files - there is no "Skip all" option on the overwrite confirm dialog box.
With most of the the other copy tools, they will keep going with the remaining files when an error occurs. You can the review the log and copy the last files by hand. Some tools can check when copying that the destination is the same as the source and not copy - so you can just return the copy command again for the uncopyed files.
I love some of the suggestions that appeared between composing and posting my last comment!
@Russel Howe
Please tell me you're joking!
Pull out the hard drive? What, one hard drive containing 60m files? Or even a server, and it has ONE hard drive?!? This isn't the 70s, Rus.
1) it's at minimum RAID 5 array, quite possibly utilising the raid controller built into the servers motherboard. The raid controller is integral to keeping the data readable
2) You don't just 'pull out hard drive' on a server. In all likelyhood, that machine is still live, and hosting a miriad of roles and services for the network.
@Zax
I'll give you 10/10 for optimism there. Xcopy would have failed just like the other commandline tools Trevor had tried. I've seen many a solution using insanely complex batch and kix scripts fail time and time again. The simple fact of the matter is that in a complex environment such as this, scripted systems invariably fail due to the unexpected and unforeseeable.
@Zaf
We have actually had to shift FROM a unix file server, TO a windows system. Over 2 terrabytes, we encountered severe limitations with the filesystem. That being we were regularly exceeding the inode limitations. After several weeks of research, we discovered that this was a fundamental design flaw of the filesystem, which assumed over that size the partition was going to be filled with files greater that 1Gb, not tens of millions of 1Kb files.
On top of that, Unix has a less advance Kerberos implementation, meaning computer account permisions could not be applied, and the time saving benefits of giving users access to Volume Shadow Copy dynamic restores meant were weren't forever routing through tape backups for every user that accidentally overwrote their word document.
"""We have actually had to shift FROM a unix file server, TO a windows system. Over 2 terrabytes, we encountered severe limitations with the filesystem. That being we were regularly exceeding the inode limitations. After several weeks of research, we discovered that this was a fundamental design flaw of the filesystem, which assumed over that size the partition was going to be filled with files greater that 1Gb, not tens of millions of 1Kb files."""
Good thing there's just one filesystem for all *nix systems, and none of them let you tune them for target file size. Oh wait, there are about 6 mature filesystems, and most of them can be tuned to easily cope with the load you describe. Reiser was always pretty good with huge numbers of small files, and I don't believe it even uses inodes.
And NTFS is nearly the worst 'modern' filesystem that there is, narrowly edging out HFS+.
When my #1 qualification is compression, I have one choice: NTFS. For some reason *nix folks not only refused to add compression to their filesystems and largely ignored pleas and projects to include it, they ignored and outright handwaved away both theoretical arguments and hard benchmarks proving that compression was almost always a net win on modern systems ten years ago. I know because I made and was involved in some of them. A decade later, with ZFS showing how dedup and compression trounces older fs, we finally get some pushes to include it in the mainstream. That kind of blindness in the pursuit of narrow perfection is what separates Linux development from those that need sales to survive - and this from a regular user of Linux.
1) Learn to use proper tools
2) Use a functional OS
tar -cf - * | ssh user@target-host "tar -C /target/path -xvf -"
If you want it compressed (if your interconnect is slower than your disk arrays), you can instead do:
tar -zcf - * | ssh user@target-host "tar -C /target/path -zxvf -"
The limits of this approach are only those of your disk space and file system capabilities. It will also preserver file ownership and other attributes. And it's a one-liner. The fact that you wrote a whole article ranting about a problem that takes a 1-liner to solve is quite fascinating.
Worse, you don't even need *NIX to do this, you could have done it with cygwin!
@Gordan
What shell are you using? What shell is able to do globbing on millions of files?
Since you, with some arrogance, tell people to learn to use proper tools, at least do mention them...
Several times I had to convert scripts doing "scp *" to sftp due to command line arguments limitations of shells (and I don't mean just bash).
JMTC, jco
"""What shell are you using? What shell is able to do globbing on millions of files?"""
Well there are 2 situations:
a) You have millions of files in one directory.
b) They're in many directories.
If you run into a) just go up one level higher and tar that directory, no globbing required.
If you hit b) then you'll only glob a few directories, and it'll just work.
* globs don't recurse, tar takes care of that.
I've done similar, took ~400GB of 1-10KB files in a complex sea of subdirecories, packaged them into a tar file, sent that over nc, and wrote it directly to an LVM lv, then later reversed the process to copy them back.
Could ZFS have handled or helped with such task? Does anybody use that, and could answer that please? Would it be available in the first place?
Inquiring minds want to know. This is an interesting question.
If such a filesystem is named Zettabyte File System, and it is entirely responsibility of the FS to handle that kind of action, would it have helped if the entire set of files was stored in ZFS environment? Or is it designed only to handle LARGE files, not necessarily MANY files?
I am googling Richcopy now, though.
If I had to move 60m files, I would have just unplugged the HDD with them and plugged them on the destination too, I admit my ignorance.
I would have tried backup tapes too.
I would have 7-zipped 1m of files at a time.
I would have set a HTTP server on the host machine, a HTML of FTP page with the list of files and used Getright copy-all-links feature to copy on the other.
I would have all of those combined, if it helped.
zfs would have made this too easy and would have allowed the copy to run at almost network wire-speed (all block-level, you see). you could take a filesystem snapshot and send it over the wire (rsh,ssh) and import it at the other end. I have done this more times than I care to count with filesystems containing up to 30 million files. it also actually defragments the files in the process (currently pretty much the only way you can do that with ZFS)
of course you would have had to copy all the files from the NTFS filesystem to ZFS first though. :facepalm:
I have set up a system using Solaris 10, samba+winbind to store in excess of 100 million files in a multi-user Active Directory environment on ZFS. All kerberized and shared over the network. It even supports NFSv4 ACLs (requires patching samba a bit though).
I will be moving this from one server to another next week with a single command. Nice.
I can't believe that, having failed to make batch files work, the sysadmin who wrote this article didn't use a Windows scripting language instead of an old DOS one. The task is trivial in any version of VB or even VBA:
[Pseudocode]
For each [directory[ in [file structure]
For each [file] in [directory]
Copy file to destination
Next
Next
Another few lines to record and/or handle exceptions, and job done.
Great article, but surely an odd conclusion. Why would I go to the trouble of setting up a Linux virtual machine, when you already said I can run Richcopy in single-threaded mode and get exactly the same result?
Of course, if my servers were running Linux in the first place, that would be a different story. Then I'd have to set up a Windows VM to run Richcopy (er, maybe not!)
Because the Linux VM was significantly faster than Richcopy in single-threaded mode.
Richcopy is faster in multi-threaded mode, but leaves the system fragmented. In single-threaded mode, using the Linux VM was about 20% faster. As to *why* the Linux VM would be that much faster...you got me there. (Maybe because it's not preserving permissions?) In my case I wasn't worried about permissions. I just needed the files moved. (We were resetting permissions on entire trees upon arrival anyways. Domain change and all that...)
I only tried the Linux VM on a lark: I happened to have a web server which had CIFS mounts pointing to both machines. I figured “what the heck, let’s see what it does.”
Imagine my surprise ½ hour later when I clocked how far it had gotten down the miserable “many small files” directory. Farther than Rich copy in the same time, I assure you.
Also: no, I wasn’t running the tests concurrently.
I raised the point only because your article implied (though admittedly did not explicitly state) that cp and Richcopy took the same amount of time:-
"You could restrict Richcopy to a single thread, but then it is no faster than cp."
If that were the case, I'm certain the majority of Windows users would rather install a single app to get the job done, rather than set up a Linux VM. Indeed, you'd have to have an awful lot of files to copy for it to be worth saving 20% of the time involved.
Thanks for the info, that puts things into perspective. I agree that I'd use a Linux VM for this task - certainly if it was something I'd be doing on a regular basis.
Your 2.5 minutes does assume some familiarity with Linux, VMs and the process in question, however. Your average Windows user (or indeed your average Windows sysadmin) would probably stick to Richcopy.
Total Commander
I would have liked to have known how this held up.
With the synchronization feature it would have taken a while, but i'm sure you'd end up with a perfect working sync, and really good reports for failures etc.
How can 50 comments miss the one tool that's capable of this that's been running for years ? (1993 - same time as winrar.)
Do IT people even read this site anymore. Let alone write for it.
Having piqued my interest, I wrote a quick script to create 60,000,000 zero byte files to play with. Presently numbering 450,000 those pesky little critters are already occupying 1.7GB. I fear I have not the raw capacity (let alone the free space) on my laptop to complete this logical exercise. Now I need another script to delete them. Oh well it killed the last 15 mins before 18:00.
Hark, is that a pint calling?
Did a search for Richcopy and got taken to the folowing Technet article.
http://technet.microsoft.com/en-us/magazine/2009.04.utilityspotlight.aspx
Tried downloading HoffmanUtilitySpotlight2009_04.exe but IE 9 beta blocked it becaue a virus was detected!
by the way I have used EMCopy from EMC for my file migrations, similar to RoboCopy but works!
I found a bug in RoboCopy, was quite a specific issue but after speaking with the developer who agreed it would require a major rewrite I went with EMCopy with which I have never had any issues.
directory by directory
If you have 60m files under a single directory then you have management or application problem as this will surely be causing file system performance issues.
i'd do it via hardware raid doing the grunt work, might need to mod the partion table though.
my guess it will take a long time to check - like generating and comparing file hashes.
why do I immediately think this is a government department.....
We've got a server here we're trying to migrate to a SAN. Linux file system. Works fine so long as you query a file directly (by database reference) but if you ls a directory NFS throws all manners of errors.
I'm new here, trying to help out with a lot of things, and one of them is developing a script to parse the db and move the files and update the db records of their locations on the fly. file system was built when they expected to have a few hundred thousand files. ...there's millions, in flat folders. Huge nightmare. Can't be backed up.
People keep mentioning tar/rsync/etc. without thought, perhaps, that copying files requires preserving the metadata? If you copy all the data and filenames and 'everything', but the ownerships and permissions are wrong at the end, you haven't finished but simply wasted your time.
Using Samba at least preserves the user-visible metadata.
I've no experience with copying this amount of stuff, so just throw in a couple of points: while everyone's nominated rsync, there's a tool called unity which does similar but more flexibly (never used either).
I do use something called cfv which produces md5 hashes of a directory. I run it on a dir I'm about to back up then again when the copy's finished and diff them. Seems like good practice for you too. cfv's documentation is poor though.
To the main point, are you sure you've got 6x10^7 files? Are you sure windows is reporting that correctly, as it seems unable to copy them why trust it to report the count accurately? WTF have you got that many for, if they're not indexed in some way how do you ever expect them contents to be discoverable? Is the directory structure also the index?
Re. your assumptions about fragmentation, IIRC (and I do mean IIRC) NTFS packs smaller files into a disk block to prevent too much wasted space so fragmentation there isn't so much a problem. You assume that some tool copy several files at a time must cause fragmentation, that's an assumption you can't make. IIRC you can reserve disk space up front in windows which you can then fill. Doing so - if that's done by any tools - should prevent this (and would increase write speed significantly of course).
Also, size. If you are concerned about fragmentation then you must also be assuming that considerable fraction of these files must be greater than a disk block ie 4K default on ntfs. If each file was 4k then it's about 1/4 terabyte - ignoring internal file structures! Let's have it, what total size of files are you moving? What's the background to this event anyway?
cfv as I've set it up produces a hash per file while traversing the directory recursively. You then diff the (large) log file. Easy. I understand it has another mode but couldn't get it to work. Again, the docs are sucky.
cfv is here <http://cfv.sourceforge.net/>
Tool I called unity is actually unison: <http://en.wikipedia.org/wiki/Unison_%28file_synchronizer%29> and <http://www.cis.upenn.edu/~bcpierce/unison/>
btw, rsync on window+cygwin: could never get that to work.
I'd really like to know how & why you managed to produce 6E7 files & what their distribution of sizes are. Really, really.
ta
Xcopy won’t see anything with a path depth larger than 255 characters. Otherwise, it would trundle through 60M files just fine, I expect.
Robocopy theoretically /should/ see files with a path depth longer than 255 characters, but in my experience simply doesn’t. It will see a file with a /name/ larger than 255 characters (or at least that’s what the length of the file name looks like at first glance,) however when I feed it /path depths/ larger than 255 is continually refuses to copy the file. I banged away at it for about half an hour before giving up and moving on to the next tool on my list. It should be noted that I tried only the command line version of robocopy. I did not give the GUI loader much of a go. (I figured if I was going to faff about with GUI tools, Richcopy > Robocopy anyways, so….)
Cygwin + Rsync was actually the very first thing I tried. <3 rsync. Sadly, rsync didn’t seem to play nice with long path depths either. Wholly apart form that, it blew up somewhere around 6M files for reasons I can’t discern. The *nix version doesn’t seem to have that problem, only when running rsync + cygwin under Windows did I encounter it. The system in question was a Server 2003 R2 instance, fully patched as of August 20, 2010.
Also of random note: VMWare Server 2.0 has served me well in the past for many things. Need to toss a Linux VM on a system for a few days do perform a task just like this? Works a treat. There is a caveat to that plan however: VMWare Server 2.0 absolutely /abhors/ TCP offloading. They don’t play well. Additionally, it can be the DOS settings:
http://support.microsoft.com/kb/898468
If you decided to load it up in order to move files around, you might consider this first. Otherwise your VM won’t talk to the host quite as well as would be required to pull this off.
A lot of file copy tools will probably fail because they don't understand NTFS links (e.g. soft links, hard links and Junctions), so blindly follow them, rather than ignoring them or replicating them on the target disk. If RichCopy or Linux don't provide configurable support for NTFS links, then your target will be a mess of redundant folders and files or missing vital links!
Beware, Vista and Windows 7 make extensive use of Junctions, in user profiles and common folders; so naive tools like xxcopy, and your favourite two pane file tool will fail!
Write a file copier in Java 7 (Beta) using the new java.nio.file functionality; it provides quite advanced filesystem specific support for file attributes, timestamps, links, and ACLs, and supports directory cursors; it even provides a FileVistor class to make recursive folder traversal easier, with support for error trapping.
I also like DirectoryOpus, and note that it can create the various link types, and can see them, however it doesn't appear to provide options not the follow them or replicate them on the target drive.
Well, the beard went a few decades ago, but...
Not in the least surprised that a simple, straightforward -nix tool turned out to do the job. Do we have to be reminded that Unix was a stable, multi-tasking, multi-user OS before Windows was born?
Trevor, I never did anything of this magnitude, and for all my big words, I have no idea if my instinctive turning to Linux (well, actually, I would have been working with Unix machines anyway, so the question wouldn't have arisen) would have been an instant solution, or taken just as long as several experiments with Windows software.
It's been far too long, so I can't tell you how I did copy relatively large numbers of files from one machine to another, but I seem to remember some strange combinations of dd, cpio, tar, maybe rcp, and it would be strange not to have a find command in a regular unix maintenance script! Combine with some sprinkling of `command substitution` and | pipelines to taste and enjoy!
Ahhh... Happy days :)
I swear, after all the fiddling, cp really required no special treatment. I used a webserver that had CIFS shares from the windows servers mounted locally. I sshed into the Linux VM and typed the following:
> cp /fs1-root /fs1-new-root/from-fs1-root –R
All files I needed to move were DFS “subfolders” of what to the webserver appeared to be fs1-root.
I then merely needed to cut/paste whole folders into their correct “new” positions on the destination server when I was done. Oh, and apply permissions: part of this move was to reset permissions on all files thanks to a domain migration. (That and some fairly lousy permissions management that had crept in over the years preceding the move.)
No need for dd, cpio, tar or anything else. cp just went hard and finished fine.
At the risk of sounding like yet another 'you should have tried X' commentard have you come across the GNU coreutils at all?
http://en.wikipedia.org/wiki/GNU_Core_Utilities
http://gnuwin32.sourceforge.net/packages/coreutils.htm
This gives you a bunch of useful 'nix functions including cp. It'd be interesting to see if the win32 binaries perform as well as the real deal in a vm. They have done so for me so far but my uses are pretty noddy in comparison to yours.
I've had to migrate a datastore of some number of tens of millions of 2KBish tga files, what I did was took a snapshot (EMC DMX timefinder mirror) and mounted it on the new server. Now you do need a shedload of money for this, but it is quick!
I've also done it between DMX arrays using SRDF, even over long distance IP links, although this is slower...
This is an older tool but still a goodie ..
Karens Replicator
Btw, there is a way you can bypass the 256 character limit when transfering files:
\\?\<driveletter>:\<path> for local files or network drives
\\?\UNC\<server>\<share> for UNC paths (though i've never gotten this to work)
Here's an article in MSDN about this: http://msdn.microsoft.com/en-us/library/aa365247(VS.85).aspx
Most people over here were recommending cp over Cygwin; while Cygwin is a nice tool for home PCs, I have never understood why everyone flocks to Cygwin when MS has actually implemented a POSIX-compliant subsystem on Windows with better perfomance than Cygwin. That said:
I would recommend doing the cp trick with Services For Unix installed. If both servers have SFU, you can actually enable NFS shares on the destination, nfsmount on the source, and copy all the 60m files using cp, all the stuff will go through NFS instead of CIFS. I haven't checked the permissions, but this method will probably preserve the file ACLs as well!
I wonder if part of the issue is that most apps will need to get the full directory listing before copying. With millions of files, that can eat up some RAM before copying even starts.
This makes me want to experiment with .NET 4's new System.IO.Directory.EnumeateFiles() method. The new function does not get a full directory listing, rather it gets the *next* listing as you iterate over the collection it returns.
Using Parallel.ForEach to iterate over the collection makes for an easily coded multi-threaded file management app I would venture to guess...
No mention of the Microsoft File Server Migration Toolkit which is designed specifically to do this kind of migration. The tool also handles a lot of the other problems of changing file servers too so I would suggest giving it a go next time, or contact a consultant with some experience of MS toolsets :)
FSMT is fantastic...if the layout of your server on the new domain will be identical to how it was on the old one. In this case it wasn't. Users were completely different in naming scheme, groups were radically different, and the organisational hierarchy of the files was changed. What needed to happen was to get the files from A to B. After that, security would be changed, and files reorganised. So for this particular case, FSMT wasn't any more useful, (and was a bit slower than) other options.
cp works, yes, but its not the nest way of dealing with the method. the old tried and trusted method is to use tar
eg basic example
tar cf - * | ( cd /target; tar xfp -)
you could use eg find or a 'for i in' bash script to ensure you only dealt with certain file types or names too. I really like this as it means basic/duff/temp files can all be omitted from backups etc
Surely you would do this in parts using something like the native zip functionality or a third party program like WinRAR to turn a large number of these files into a single archive.
The thought of transferring 60m files across a network connection makes quail. Even the web servers that I look after top out at 7.5m files.
Wow, I cant believe you were migrating a live server and you weer "copying" files?
DFS the drive, add the second server as a replication destination, then sync between the two servers. 1 - its multithreaded, 2 - it syncs therefore changes are mapped right until you kill the server. 3 -its free.
Even good old robocopy will need to be run multiple times to ensure you havent killed anything. Fook knows what you would have done if you would have needed a restore inbetween.
Bill Gates is the richest man in the world. And best of all, he achieve this producing an operating system which people inexplicably buy even though it cannot do the everyday task of simply copying 60 million files from one place to another.
And everyone who chooses his software must be an idiot because if they ever need to copy 60 million files, they're screwed.
The finest brains on the planet wrote the Windows Copy Engine (the slothful malevolence behind CopyFileEx) and its brilliance was explained at some length here - http://blogs.technet.com/b/markrussinovich/archive/2008/02/04/2826167.aspx
But unfortunately, a few hundred commentators didn't seem to agree. Over the course of a year or so, they hatefully ignored the genius theory and concentrated on the woeful real world performance instead.
The GNU/Linux coders - clearly neophytes - didn't bother with a Copy Engine, and just coded cp as a open file, followed by a chunked read/write loop, followed by a close file, letting the kernel make sense of it all.
And that's what worked for 60 million files. Priceless.
@Notas Badoff, rsync does preserve permissions.
@bluesxman, can't you just "rm -R (whatever directory the 60,000,000 files are in)"?
So, older rsync versions did the whole traversing the entire tree, then copying, but newer ones are incremental, which I'm sure would be required for this many files. I love rsync, although the tar solutions would certainly work as well, that is what I used to use.
They could at least fix the Windows explorer so it actually didn't stop Windows from multi-tasking when it copies. That'd be a start.
Try copying a 30GB file in Windows and then try to continue working...
Not a chance. Always struck me as odd, that in 20 years Microsoft never got this to work.
I don't use Win 7 so I don't know if this pain-in-the-arse has been fixed.
but i would think if you setup samba in a rush, you are probably ignoring file permission and ownership. and copy it under "administrator" rights.
transversing the 60m by cp -pR is still a hard task. if I have the time, i'd probably let reorganize the files into directories first and do that copy in several (or tens) of runs. because I dont think all 60m files are in used, they are surely static (that's why you can start copying!).
This post has been deleted by its author
i've used Norton Ghost before to perform disk backups/transfers, not with 60m files but i can't see why it shouldn't work.
Regularly use it to make my own automated recovery disks by saving the partiton* to an image file, which i then burn to DVD/CD(s) along with a batch script to boot from the DVD/CD.
*i use nLite/vLite to remove crap/add drivers/customise the installation before installing the OS.
Then, once it's booted up, i add essential software (codecs, AV, WinRar/WinZip, Firefox, etc...) before saving everything to the image file.
I don't know if anyone's mentioned it but...
pipe tar -c into nc then have nc listening on the other end, and piping into tar -x.
You can also add things if you like such as piping through a compression program, or encryption program. I've personally found this to be the fastest way to get things transferred.
That's not going to suit everyone's needs, it has its downfalls, but is useful to know.
What you need to keep in mind, no matter what you use, 60M files are probably going to take a long time to transfer, assuming they are large.
In such a case, you may want to think about something such as using an external drive, ensuring you have gigabit+ LAN, etc.
That worries me!
60M files in the cloud!
Do any of these tools check (verify) the files afterwards?
I urge a local backup copy, perhaps a PCI-X / PCI-e Raid card and some HDD's??
Very interesting article, looking forward to testing those tools, I copy files from dead Windows installations (Windows installs are always slowly dying!) but yeah not on that scale!
I will need to buy some more HDD's this/early next year to back-up customers data, WinXP can only use a single Volume of 2TB, so a Raid 1 array of 2x2TB drives will suit :)
I might buy Windows 8 lol
and relax!
Backup over WAN shouldn't scare you at all! Let me give you a brief rundown:
At our central (physical) location we have two (logical) sites. (In total there are four physical and five logical sites.) One is the production/manufacturing setup for this physical location. The other is "head office." Each logical site has a pair of DFSR-twinned (http://www.theregister.co.uk/2010/09/27/sysadmin_dfsr_clustering/) file servers. This physical location also has a dedicated "backup server" that stands apart from the rest.
We don't bother backing up the manufacturing site-specific files, as they are only really relevant for a brief period (measured in days.) In addition, if the physical location were to burn down, those files cannot be relevant to us as the manufacturing apparatus to use them no longer exists.
Now, the entirety of the “head office” logical site’s files are backed up to the backup server every night. Selected files from the manufacturing sites’ networks (such as databases and configuration information) are also trucked over to the backup server. It munched and crunched and vomits out some highly-compressed fileage. These files are then plopped onto a share on the Head Office logical site’s file servers. This share is DFSR-replicated to the twinned file servers in all the other logical sites.
This ensures that the backup files containing all the critical company data are replicated “offsite” over the WAN. (Site-to-site VPN links and uncapped internets are <3.) It’s simple, doesn’t involve the “cloud” and works like a hot damn.
http://www.theregister.co.uk/Design/graphics/icons/comment/go_32.png
Missing details on connectivity with between two windows servers and if there is shared or replicated storage (-vs- DAS in each systems).
I've successfully moved on several different occasions for several customers over 60+ million files including one system with 5000 roaming windows profiles (citrix environment) with 100,000 file change count per day using DoubleTake. (Competing products have choked doing this) This product uses a windows filter driver which is certified by Microsoft and is able to mirror and replicate simultaneously locally or across slower links (a.g. WAN) and optionally use compression. This product does byte level replication (not block or file level as many other solutions do). It is also a very fast solution. Only recommendation for ease of use, speed and verification is to break down the number of files into smaller quantities and replicate those.