|
|
 |
Converting
Legacy Data To The Web
As "unstoppable" as legacy-to-Web conversion is, make
sure you actually have a reason for the conversion and then be sure to
thoroughly plan the process.
By James R. Dukart
The viral growth of the Internet and e-commerce means just about any
organization of any size today has legacy data that might be appropriate for
the World Wide Web. But knowing that simple fact, and even having a handle
on what information would be best to put on the Web, is only the first and
may be the easiest step in any such data conversion. For the most part, much
of the legacy information that might be Web-enabled exists in pre-Internet
formats such as old contracts and sales agreements, electronic data in
stored proprietary databases fields and books, annual reports, photographs,
microfiche, and photocopies stored in company libraries.
As with any document conversion project, moreover, converting that type
of data to Web-enabled content is a complex process featuring multiple
layers of analysis, not only about what content there is and will be after
conversion, but also about the structure of the organization and about how,
why, when, where, and by whom converted data will ultimately be used. As if
that were not enough, organizations considering the conversion of legacy
data to the Web must be aware of the numerous legal and business
consequences of conversion to and use of electronic data-keeping an eye on
everything from public disclosure and privacy laws to copyright and fair use
doctrines.
According to Stephen Poe, CEO of technology consultancy Nautilus
Solutions, there is a "huge market" for legacy-to-Web data conversion, one
that is "only limited by the return on investment or cost justification."
Most companies, Poe says, have at least some data they would like to Web
enable. Far fewer, he observes, invest the appropriate time and effort to
make sure the data conversion they undertake ever meets its ultimate goals.
"One of the biggest issues is, What is my business rationale for doing
this?" Poe says. "What am I potentially going to use the data for? The
answers to those types of questions will determine not only how you get the
information to the Web, but how you will use it once it is there."
Conversion Strategies and Formats
Organizations can use any of a number of different methods to Web-enable
legacy data, Poe notes, ranging from quick, relatively inexpensive and
somewhat simple processes to more complex formats that allow much greater
flexibility and data use going forward.
"The first and easiest way is to rasterize and stick a picture out
there," Poe says, calling that "quick and cheap and usually not very
useful." Next, companies can scan text using OCR or ICR and then convert it
to a widely-used format such as HTML (Hypertext Markup Language) or PDF
(Adobe's Portable Document Format). That, Poe says, can be more useful than
straight scanning, but offers its own trade-offs. "Getting accurate OCR can
be medium-difficult to very, very expensive," he comments, adding that even
slightly inaccurate data- in the right circumstances-can prove to be very
costly. "If the originals are in English, you can usually get some very
accurate OCR, but that is not always the case," he says. "Plus to get 99.9%
accuracy can be very expensive, and if you are doing something like rate
sheets, getting a wrong number on a rate can cost you a lot of money-it is
not like missing a "v" in some word processing text."
Perhaps more importantly, OCR offers little ability to add intelligence
to documents or data, thus in many cases reducing or eliminating the
opportunity to do full-text searches, higher level indexing, and full
cross-referencing.
In defense of the PDF format in particular, Rick Bess, director of
product management for Adobe's Acrobat product group, notes that for the
right type of document, PDF offers several advantages. First, he says, PDF
allows for content locking and security, meaning that viewers of a document
cannot change its original nature. That plays well into the electronic
distribution of copyrighted works, maintenance manuals that are fairly
standard and do not change over time, as well as white papers and position
papers that companies want to publish electronically. "If you want to post
things in final form and preserve it, this is the way to go," Bess says.
Another PDF advantage, he argues, is nearly universal access to online
documents using Adobe reader software. Bess says there are more than 350
million Acrobat readers on the Internet today, and notes that companies such
as Cisco Systems, IBM, and Microsoft post tens of thousands of PDF documents
on their computers on a regular basis.
The next step up from PDF or HTML formatting usually leads to use of the
Extensible Markup Language (XML) for Web-enabled content. Mark Gross,
president of Data Conversion Laboratory, says nearly all the legacy data he
sees converted to the Web today is done using XML. A key advantage to XML,
he says, is the ability to "repurpose" data for use in multiple formats once
a conversion has been done. Through the use of granular tags that describe
to computers the type and nature of the data (e.g., this field is a date)
rather than what the actual data is (e.g., July 4, 1776), data that is
XML-tagged can be fully indexed, sorted, categorized, cross-referenced, and
separated from the whole to be used in many other formats or documents. A
key project that Data Conversion Laboratory is working on is Web-enabling
documents in the Library of Congress, and Gross says the company has already
converted over 1 million pages of documents, all with full-text search
capabilities based on dates, historical figures, or any of a number of other
search criteria contained within any page of text.
Frank Gilbane, editor of the Gilbane Report, a publication that focuses
on document management technologies, argues that both the power and
popularity of XML is itself leading to a surge in data conversion from
legacy systems to the Web. "The phenomenal acceptance of XML is encouraging
people to do more legacy conversion than they would otherwise," Gilbane
says, adding that "enterprises are not looking for just Web content
management. They are looking for enterprise content management. You have to
integrate it into all the other systems you have, and nobody would even
think of doing that without XML."
Pre-Conversion Considerations
Regardless of the format used, there are crucial considerations that
accompany any legacy-to-Web data conversion. One is the state of the legacy
data itself. "What kind of shape are your originals in?" Poe asks. "What
kind of variations do you have in format? You may think everything is 8.5 by
11 until you open the file and find out it is not. You will find old
photocopies on thermal paper that have faded. There may be a photograph on a
half-screen that has almost no contrast to scan. Contracts and letters may
have coffee stains or be torn in the corner."
P.G. Bartlett, vice president of Marketing for Arbortext, says people are
also sometimes surprised to find their documents have no real structure
whatsoever -or at least no consistent structure from one document to the
next. "The number one problem is that people go into it saying we have rules
for how our authors are supposed to write, so therefore our information is
already very well structured," he says. "The rule is it is always worse than
you think. The tools do not enforce structure. The same author over time
will do things differently, and no two authors will do things the same all
the time."
Another critical consideration is the scale of a conversion. "People tend
to want to do it all," Gross says, with the usual result being overly
ambitious goals along with underestimated project timelines and costs-a
recipe for failure in any project. Commenting on the critical nature of
preparation in data conversion projects, Gross quotes Abraham Lincoln: "If I
had eight hours to chop down a tree, I'd spend six sharpening my ax."
Organizations also often underestimate the time and cost involved in
fixing errors from inaccurate data conversion, Gross adds. Using the example
of a 5,000 page project, Gross calculates that spending only five minutes
per page to fix errors would cost an organization some 25,000 minutes, or
417 hours, to get the data correct. At seven hours per day that cost would
be 59.6 days of full-time productivity lost.
Another challenge comes in deciding what data to convert and what to
leave behind-or at least to leave for a later conversion. Gilbane suggests
using analytic software to determine the types of content people use most
regularly as well as to establish patterns of use, which are then used to
select the most popular or most critical data for conversion. Bartlett calls
data conversion "a process of discovery," adding that "the discovery always
comes later, when it is more expensive to make a change." Arbortext convenes
groups of end-users before the conversion process, asking them to suggest
ways in which they think they may use the converted content or data.
Familiarity with the business goals of the organization doing the conversion
is also important. "You have to have someone who understands not only the
conversion process, but also what you want to accomplish," Bartlett says.
"You don't just go in there and say 'let's start converting our content and
see where it leads us.'"
Mark Ruport, president and CEO of Optika, adds two more challenges
companies face when converting legacy data to Web content. One is network
capacity. "You are dealing with latency issues relative to retrieval,"
Ruport says. "People are used to using the Web for HTML files, and it may
take longer to download your images. You may end up putting up something
that your customers are not satisfied with." Ruport's second point concerns
database tuning and administration. "You are opening up a lot more users to
access the information in different ways," he points out. "Someone in a
local division of accounting can now see it, so you may need more keys and
indexes. The way you set up your database becomes the critical factor in
delivering information to the desktop quickly. Until you get it up and
running, you may not understand whether you have the right infrastructure."
"Unstoppable" Momentum
Ruport says the main driver behind any conversion project is "the
intrinsic value of the information the customer wants to access," and that
that factor remains critical regardless of the size, industry focus,
geographic location or any other attribute of the organization doing the
conversion. Within larger organizations, companies tend to approach data
conversion at a departmental rather than enterprise-wide level, picking one
or two key areas that are believed to offer the fastest or largest ROI
first, and using those projects as benchmarks for further conversion
efforts. Bess points to the Internal Revenue Service as an example of an
organization that has eagerly embraced the Web-enablement of legacy
information-tax filing forms in this case-something that has become so
popular that Adobe sees a pronounced spike in the number of requests for its
Acrobat reader every tax season. Bartlett warns that many organizations
still underestimate the scope of conversion projects, particularly not
stopping to consider that much of the legacy data that exists today is
relatively unstructured and will prove costly and difficult to convert to
uniform content accessible via the Web.
That said, Ruport sums up by saying that even with all the considerations
that must go into a successful legacy-to-Web conversion, the trend to
Web-enabling data is strong, growing, and "unstoppable."
"We have seen people have grandiose plans of putting stuff on the Web and
it was really overkill," he says. "They spent a lot of money putting up
stuff that was not needed. The first thing they want to do is make
everything accessible to everybody. Then they do a cost-benefit analysis and
quickly do triage-if it doesn't affect the bottom line, impact customer
satisfaction, have fast ROI, or help pay the bills faster, it may not have
to be done. The most important data rises to the top."
"The movement to the Web is unstoppable," he continues. "It is the best
medium out there. People will find faster ways to retrieve information, we
will have faster queries and more precise ways of focusing on what is
available, but every major movement will be headed towards the Web. It is
unstoppable." |