alpha754293

data management suggestions for > 1 M files?

Recommended Posts

What do people use here to manage > 1 million files?

I'm trying to find an efficient method of cataloging that many files, sort, etc. and I was wondering if there's a way of doing it via command line in Solaris as well.

So far, I've just been building a text file that contains a list of all of the files but if I am looking for a bunch of files say of a certain extension, the text index doesn't always work so well because when I grep the pattern; it can return results that aren't files, but that contain the pattern.

Suggestions?

Greatly appreciate any and all the help I can get. Thanks.

Share this post


Link to post
Share on other sites

Windows Desktop search what else? Well there is google desktop.. But Enterprise Vault only integrates with WDS :P

Probably not the answer you were looking for.

Is the only thing you want to do in "managing," them, is to be able to find them easily?

Share this post


Link to post
Share on other sites
Windows Desktop search what else? Well there is google desktop.. But Enterprise Vault only integrates with WDS :P

Probably not the answer you were looking for.

Is the only thing you want to do in "managing," them, is to be able to find them easily?

Well...finding it "easier" is better than going through a text version of a file index such that the index itself is encroching 140 MB.

(Again, the index was only generated by:

$ find /data > /data/filelist.txt

Right now, it takes my system about 8 minutes to make that list (4.3 TiB used, 7.1 TiB capacity).

The original intent of using the index was to be able to locate files faster by just searching through the index, finding it's location and then because to access it.

(Since Windows search isn't the best (at least in XP when going through that many files)).

But say, I want to look for like *.CATPart as an example;

if I ran:

$ cat filelist.txt | grep CATPart

it will give me anything and everything that has "CATPart" in it.

But, if say there was a directory called "CATPart" that had a whole bunch of files that isn't *.CATPart; that would show up in the results list as well.

So, right now, my only alternative is to run:

$ find /data -name *.CATPart -print

again in order to do the same thing, and I am trying to come up with a more efficient way of managing I think it's about 1.5 M files right now, of various types, sorts, sizes, etc.

I'm guessing that there's got to be something out there that would be able to do something like that, and to help me be better at managing it.

I've used Picasa to manage my pictures, and that seems to work "ok" except for the fact that you can't tell it where to put the database, and when you've got that big of a file system; I think the last time I ran it, the database ate up my 36 GB Raptor (OS drive) and stopped working properly (w/ db@24 GB).

Any hints/help/advice would be greatly appreciated as I can foresee that my data management demands/needs will only grow from this point forward.

Share this post


Link to post
Share on other sites

Use egrep and regular expressions to exactly pinpoint what you are looking for.

Are these static files or is a program generating them? If it is a program, change that program since it has no idea how to efficiently function.

In windows you can use WhereIsIt. It works fast with large catalogs. FileLocator PRO for filesystem searches. I use both. They are both paid programs.

What do people use here to manage > 1 million files?

Share this post


Link to post
Share on other sites
Use egrep and regular expressions to exactly pinpoint what you are looking for.

Are these static files or is a program generating them? If it is a program, change that program since it has no idea how to efficiently function.

In windows you can use WhereIsIt. It works fast with large catalogs. FileLocator PRO for filesystem searches. I use both. They are both paid programs.

What do people use here to manage > 1 million files?

Once it's written to the server, it remains mostly static after that.

The contents of the file doesn't change much (if at all), and it might be auto-generated only if I set the program up to do so. (like for my analysis/simulation programs; you can set it up to write the data/result files after x amount of time).

Are those index based, or are they a more generalized version of Google's Picasa?

Share this post


Link to post
Share on other sites

I guess the first thing is what are these files and does it make sense to be creating them in the manner that you are doing in the first place.

I would seriously look at hierarchical file formats like HDF. Go read up on it at http://www.hdfgroup.org and see if that makes more sense for what you are doing. We have been getting our users to try and convert to using this for their output from their applications for the last year or two since they had a similar issue with number of files. They have been able to go from their sim generating 20-40 output files per run to just 2-3 files (and in reality it could be just 1 file but they are keeping it 3 for legacy issues).

Share this post


Link to post
Share on other sites
I guess the first thing is what are these files and does it make sense to be creating them in the manner that you are doing in the first place.

I would seriously look at hierarchical file formats like HDF. Go read up on it at http://www.hdfgroup.org and see if that makes more sense for what you are doing. We have been getting our users to try and convert to using this for their output from their applications for the last year or two since they had a similar issue with number of files. They have been able to go from their sim generating 20-40 output files per run to just 2-3 files (and in reality it could be just 1 file but they are keeping it 3 for legacy issues).

There are some really really small files (few hundred bytes) to files that go up to 10+ GB. Average filesize is 3.8 MB, but I don't have data on the actual statistical distribution.

Can HDF support multi-TB arrays? Will I require a separate RAID card for such multi-TB arrays?

I'm currently using ZFS because I can replace my existing server for about $700-1200 and get 6-8 TB out of it. With most other solutions that I've found thus far, I can't do it for that price, and price is of a huge concern to me.

I will have to look into that though. I would have thought that there'd be a more generalized version of Picasa that can handle all known/associated file types so that you would be able to sort through it.

Share this post


Link to post
Share on other sites

Right now I'm trying out Xinorbis 4.10 beta to generate file stats. It took 4min 30sec on my Pentium system to scan a non-system partition with 4434 folders containing 90,562 assorted files.

Share this post


Link to post
Share on other sites
Right now I'm trying out Xinorbis 4.10 beta to generate file stats. It took 4min 30sec on my Pentium system to scan a non-system partition with 4434 folders containing 90,562 assorted files.

I'm giving that a shot now to see how well it would do with 1.5 MILLION files.

If I wanted to look for more information, what should I be really searching for?

I've tried enterprise data management (EDM), but a lot of them talk about like data models and stuff rather than like sorting/organizing 1.5 million assorted files.

For example, I can probably use Google's Picasa to handle the pictures and videos. But I also have like a whole slew of other data types as well that it can't handle. So what do companies use for that? Do they use like a regular straight-up SQL database?

Share this post


Link to post
Share on other sites
I'm giving that a shot now to see how well it would do with 1.5 MILLION files.

If I wanted to look for more information, what should I be really searching for?

I've tried enterprise data management (EDM), but a lot of them talk about like data models and stuff rather than like sorting/organizing 1.5 million assorted files.

For example, I can probably use Google's Picasa to handle the pictures and videos. But I also have like a whole slew of other data types as well that it can't handle. So what do companies use for that? Do they use like a regular straight-up SQL database?

SQL server is supposed to eventually serve as the underlying data engine driving Windows filesystems and network directories, but apparently Microsoft isn't there yet. As others say it's more effective in the long run to use a file OS or server appliance which already operates in this way. Windows file managers are normally subject to NTFS capacity and constraints.

I only recommended Xinorbis because it's free and makes it easy to do complex filter search on the indexes it generates, not unlike what you do now. Its metadata and reports are only useful for profiling hierarchical distribution and types. I use Total Commander 7.04 religiously for everyday tasks but I doubt it can efficiently handle 1.5M files.

Edited by tracker

Share this post


Link to post
Share on other sites
I'm giving that a shot now to see how well it would do with 1.5 MILLION files.

If I wanted to look for more information, what should I be really searching for?

I've tried enterprise data management (EDM), but a lot of them talk about like data models and stuff rather than like sorting/organizing 1.5 million assorted files.

For example, I can probably use Google's Picasa to handle the pictures and videos. But I also have like a whole slew of other data types as well that it can't handle. So what do companies use for that? Do they use like a regular straight-up SQL database?

SQL server is supposed to eventually serve as the underlying data engine driving Windows filesystems and network directories, but apparently Microsoft isn't there yet. As others say it's more effective in the long run to use a file OS or server appliance which already operates in this way. Windows file managers are normally subject to NTFS capacity and constraints.

I only recommended Xinorbis because it's free and makes it easy to do complex filter search on the indexes it generates, not unlike what you do now. Its metadata and reports are only useful for profiling hierarchical distribution and types. I use Total Commander 7.04 religiously for everyday tasks but I doubt it can efficiently handle 1.5M files.

I ran Xinorbis overnight on my server and it seems to be pretty good. I haven't gone through all of it to find out what it can and can't do/tell me about the data that I've got and stuff, but it does work. It took 11.6 hours to process all of my 1.57 MILLION files.

Share this post


Link to post
Share on other sites
(Since Windows search isn't the best (at least in XP when going through that many files)).

Windows Search is pretty terrible for speed and resource useage.. Google Desktop is significantly faster, but lacks the tight integration.

I realize it's not applicable to your situation, but if you have an NTFS drive connected to a Windows machine, then Everything Search Engine is bar none the fastest file search. It reads the NTFS metadata directly and creates an index with that, so it ignores file permissions and such. It is only for searching file names, and can't be used to search a file share (unless you are running the software on the remote PC).

And best of all, it's free. On one of our servers, with over 300,000 files, it shows results as I type, about as fast as I type. Creating an index of the entire array <10 seconds. Truly marvelous.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now