Save costs by knowing your data

Classification tools help storage managers gain efficiency and find stored files faster

One size fits all used to be the standard approach to enterprise storage, with that one size being expensive, high-performance disk arrays. Now agency managers have storage choices that allow them to mix and match storage platforms by sacrificing some performance for lower costs.

But how do storage managers decide which data to store on which platform? New products that perform data classification can help managers sort through and understand what they have. That knowledge can help them boost the efficiency of their storage practices, assign data to the most cost-effective storage platform and meet legal requirements.

A variety of vendors now offers classification products, including start-ups and storage industry veterans, and the products they offer are equally diverse. Data classification sometimes is a feature embedded in a broader storage offering. Some products automatically classify files, and others prompt users to supply a label. A number of companies deliver software with classification capabilities and others bundle classification software with hardware.

The cost of data classification products can run into six figures. However, industry executives say data classification offers a compelling return on investment. They expect it to catch on in a year or two, given the growing volume of data in most organizations.  

“This is still a relatively nascent marketplace,” said Todd Oseth, chairman and chief executive officer of storage consultant SANZ.

Some companies include data classification technology in a broader suite of storage management software. Compellent, for example, has been offering data classification since 2005 as part of its storage-area network solution.

Compellent’s data classification feature was one of the biggest factors behind its selection by the South Carolina Office of the Attorney General, said John Loy, network engineer in the attorney general’s office.

Sorting things out
The Compellent SAN automatically classifies data based on how often it is accessed. Frequently accessed documents and images move to fast storage while seldom-used documents migrate to Serial Advanced Technology Attachment storage.

“I don’t have to spend all my time on a server managing what files are going to be progressed up and down the scheme,” Loy said. “We have a very small staff. Having a system that automatically does [data classification] on the block level, behind the scenes, was a major reason we picked it.”

Because of Compellent’s ability to classify blocks — blocks are the units of storage that SANs use — a large database file can be stored across more than one tier of storage. The newly written data blocks may be assigned to faster primary storage while the old data is stored on more affordable disks, company officials said.

Other storage management products perform data classification at the file level. That is true of CommVault’s Data Classification Enabler, the company’s new Simpana data management suite. The Data Classification Enabler includes an agent that runs on a client device — a file and print server, for example — and classifies data as it is created. That feature works with the Windows file system.

Compellent, CommVault and other offerings perform the classification task by collecting metadata. Metadata, which are basically descriptions of data, takes note of the type of data being stored, when it was created and when it was last accessed.  

File-oriented CommVault uses an additional technique. It searches a  file’s contents.

“We can enable keyword content indexing of all the words inside the files that are important,” said Kelly Polanski, director of product marketing at CommVault.

That form of indexing enables search and retrieval based on keywords and metadata. CommVault’s Simpana software can handle metadata and text.

EMC’s Infoscape, which the company introduced in 2006, classifies files based on metadata or file content. The result is more accurate classification, company officials said.

Matthew Coblentz, principal product manager for EMC’s Documentum unit, said classification based on file content is an area in which today’s data classification products truly surpass the abilities of traditional hierarchical storage management (HSM) products.

“Where HSM systems were solely predicated on the file system metadata that was available, now you can use classification products to crawl into the file and look for information inside,” Coblentz said.

Storage giants such as EMC aren’t the only companies interested in advancing data classification. Several new companies employ a combination of metadata and content indexing to sort things out. Products from vendors such as Arkivio, Kazeon, Njini and StoredIQ fall into the information classification and management (ICM) category. The companies in that group offered software suites or hardware appliances with built-in software.

Data classification products also differ in how they perform content indexing. Arkivio, which added Fast Search & Transfer’s InStream to its Auto-Stor software earlier this year, treats content indexing as a back-end process, said Buzz Walker, vice president of marketing and business development at Arkivio.

Walker said most products index on the front end before data moves to storage. Arkivio said its approach lets customers index files when computing cycles are available rather than consuming cycles during normal operating periods.

And there’s yet another way to measure the data classification market. John Merryman, principal consultant at GlassHouse Technologies, views the market as having two camps. In one part of the market are enterprise search tools, such as Infoscape and Kazeon’s appliances. In another are end-user tools.

In the latter group, products let users classify documents and e-mail messages as they create them. Other tools automatically classify data based on a predetermined policy.

Titus Labs is among the vendors representing the user-driven form of classification. The company focuses on Microsoft Office documents. Charlie Pulfer, vice president of product management at Titus Labs, said the company has customers among military and intelligence agencies.

The common thread that runs though the entire category of products is this: Most data classification wares handle unstructured data and semi-structured data such as e-mail.

“The market is focused mostly on unstructured data and messaging,” Merryman said. “A lot of that is driven by the main pain points in the market: everyone has a ton of e-mail and a huge file-server farm that is out of hand.”

Why classify?
Customers with storage growing pains may turn to data classification for a number of reasons. They may use tools to improve storage efficiency, manage tiered-storage environments or more readily locate electronic documents for legal discovery. For storage efficiency, managers can use data classification to speed the backup process.

CommVault’s metadata collection feature, for example, flags new files on a given file system. Administrators performing a backup session can skip the preliminary disk scan, which involves combing a file system for new files requiring backup. Polanski said a disk scan may take longer than the actual backup. Polanski recalled one case in which a customer’s disk scan took an hour and the backup was completed in three minutes.

Classification can also separate data from unimportant data that does not warrant backing up.

Some storage experts say data classification could prove to be an important tool for operating tiered storage. Tiered storage typically has a primary storage layer consisting of fast but expensive disks for frequently used mission-critical data and a secondary layer of cheaper disk storage for infrequently accessed data. A third archival layer, often consisting of magnetic tape, might also exist. The task of managing the movement of data from tier to tier is referred to as information life-cycle management, or ILM.

“Tiering storage is a very cost-effective way of doing things,” Oseth said. “You are going to want some level of classification in order to tier it properly.”

In the South Carolina Attorney General’s office, the Compellent SAN moves data from primary to secondary storage after 12 days. Data accessed four times after moving to secondary storage is moved back to the top storage tier, Loy said.

Electronic discovery, or e-discovery, is an emerging application that data classification vendors have targeted. Organizations charged with producing electronic documents for legal or regulatory compliance purposes may increasingly turn to classification tools to do so, some storage analysts say.

“A lot of people are being required to retrieve information in a very short period of time,” Oseth said. Such requirements necessitate having products with content indexing and search capabilities, he said. For example, searches based on keyword indexing can find files relevant to legal proceedings more readily that searches based on metadata.

“You have to have some level of knowledge of what is inside,” Oseth said. Such products, he said, “sniff the content.”

Buyers evaluating the range of available solutions may find that they need more than one tool, given the range of data types and user cases. Customers “would love to buy a product that does everything, but…we don’t see a single solution as a silver bullet,” Merryman said.

The good news is that data classification products are evolving, he added. “A couple of years ago, the [technology] was very speculative. But today it is very real.”
Financial considerationsData classification’s promise of storage efficiency and easier data retrieval doesn’t come cheap when organizations have a large volume of data to be classified.

The ultimate price depends on the scope of the engagement, said John Merryman, principal consultant at GlassHouse Technologies. Some solutions have entry points of around $50,000, he said. Projects encompassing software, hardware and services could cost up to $150,000.

Such investments, however, can pay off. Data classification tools can help move data off expensive tier-one storage, a situation that may let organizations avoid, or at least delay, buying expensive disk drive arrays. Limited deployments of classification tools — installing products within a single department, for example — can generate returns.

But industry executives said an enterprisewide installation will yield greater returns. In that scenario, data classification helps organizations achieve the efficiencies of an information life cycle management strategy.

Another big payback is in e-discovery for legal requirements and regulatory compliance.

“The biggest business case is really going to be around search and discovery, once you get past the benefits of tiering,” Merryman said.

He said an organization that hires a third party to cull the data relevant to a case can pony up $2.5 million per terabyte. Tools that reduce the volume of data sent to a third-party firm can yield big savings, he added.

Steve d’Alencon, vice president of product marketing at Kazeon, said the return on investment for automated classification tools is massive, with manual or semi-automated discovery methods costing $1,800 to $5,000 per gigabyte.

The potential impact of this still-young technology may become more noticeable in the government space in the coming months.

Matt Decker, manager of IT services at the National Nuclear Security Administration’s Kansas City Plant, has deployed Arkivio’s data classification solution as the primary mechanism for compliance and policy-based management. He said he expects to be able to report on significant changes resulting from the Arkivio installation in about six months.

Honeywell Federal Manufacturing and Technologies operates the Kansas City Plant.
 — John Moore

2014 Rising Star Awards

Help us find the next generation of leaders in federal IT.

Reader comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above