HIPAA Privacy and Security Manual for Employers Helps Companies Comply with Healthcare Legislation Requirements
[read]

 

Press & Events

 

Home | Contact

Click Here to view a print version of this pageConverting Legacy Data To The Web

As "unstoppable" as legacy-to-Web conversion is, make sure you actually have a reason for the conversion and then be sure to thoroughly plan the process.
By James R. Dukart

The viral growth of the Internet and e-commerce means just about any organization of any size today has legacy data that might be appropriate for the World Wide Web. But knowing that simple fact, and even having a handle on what information would be best to put on the Web, is only the first and may be the easiest step in any such data conversion. For the most part, much of the legacy information that might be Web-enabled exists in pre-Internet formats such as old contracts and sales agreements, electronic data in stored proprietary databases fields and books, annual reports, photographs, microfiche, and photocopies stored in company libraries.

As with any document conversion project, moreover, converting that type of data to Web-enabled content is a complex process featuring multiple layers of analysis, not only about what content there is and will be after conversion, but also about the structure of the organization and about how, why, when, where, and by whom converted data will ultimately be used. As if that were not enough, organizations considering the conversion of legacy data to the Web must be aware of the numerous legal and business consequences of conversion to and use of electronic data-keeping an eye on everything from public disclosure and privacy laws to copyright and fair use doctrines.

According to Stephen Poe, CEO of technology consultancy Nautilus Solutions, there is a "huge market" for legacy-to-Web data conversion, one that is "only limited by the return on investment or cost justification." Most companies, Poe says, have at least some data they would like to Web enable. Far fewer, he observes, invest the appropriate time and effort to make sure the data conversion they undertake ever meets its ultimate goals.

"One of the biggest issues is, What is my business rationale for doing this?" Poe says. "What am I potentially going to use the data for? The answers to those types of questions will determine not only how you get the information to the Web, but how you will use it once it is there."

Conversion Strategies and Formats

Organizations can use any of a number of different methods to Web-enable legacy data, Poe notes, ranging from quick, relatively inexpensive and somewhat simple processes to more complex formats that allow much greater flexibility and data use going forward.

"The first and easiest way is to rasterize and stick a picture out there," Poe says, calling that "quick and cheap and usually not very useful." Next, companies can scan text using OCR or ICR and then convert it to a widely-used format such as HTML (Hypertext Markup Language) or PDF (Adobe's Portable Document Format). That, Poe says, can be more useful than straight scanning, but offers its own trade-offs. "Getting accurate OCR can be medium-difficult to very, very expensive," he comments, adding that even slightly inaccurate data- in the right circumstances-can prove to be very costly. "If the originals are in English, you can usually get some very accurate OCR, but that is not always the case," he says. "Plus to get 99.9% accuracy can be very expensive, and if you are doing something like rate sheets, getting a wrong number on a rate can cost you a lot of money-it is not like missing a "v" in some word processing text."

Perhaps more importantly, OCR offers little ability to add intelligence to documents or data, thus in many cases reducing or eliminating the opportunity to do full-text searches, higher level indexing, and full cross-referencing.

In defense of the PDF format in particular, Rick Bess, director of product management for Adobe's Acrobat product group, notes that for the right type of document, PDF offers several advantages. First, he says, PDF allows for content locking and security, meaning that viewers of a document cannot change its original nature. That plays well into the electronic distribution of copyrighted works, maintenance manuals that are fairly standard and do not change over time, as well as white papers and position papers that companies want to publish electronically. "If you want to post things in final form and preserve it, this is the way to go," Bess says. Another PDF advantage, he argues, is nearly universal access to online documents using Adobe reader software. Bess says there are more than 350 million Acrobat readers on the Internet today, and notes that companies such as Cisco Systems, IBM, and Microsoft post tens of thousands of PDF documents on their computers on a regular basis.

The next step up from PDF or HTML formatting usually leads to use of the Extensible Markup Language (XML) for Web-enabled content. Mark Gross, president of Data Conversion Laboratory, says nearly all the legacy data he sees converted to the Web today is done using XML. A key advantage to XML, he says, is the ability to "repurpose" data for use in multiple formats once a conversion has been done. Through the use of granular tags that describe to computers the type and nature of the data (e.g., this field is a date) rather than what the actual data is (e.g., July 4, 1776), data that is XML-tagged can be fully indexed, sorted, categorized, cross-referenced, and separated from the whole to be used in many other formats or documents. A key project that Data Conversion Laboratory is working on is Web-enabling documents in the Library of Congress, and Gross says the company has already converted over 1 million pages of documents, all with full-text search capabilities based on dates, historical figures, or any of a number of other search criteria contained within any page of text.

Frank Gilbane, editor of the Gilbane Report, a publication that focuses on document management technologies, argues that both the power and popularity of XML is itself leading to a surge in data conversion from legacy systems to the Web. "The phenomenal acceptance of XML is encouraging people to do more legacy conversion than they would otherwise," Gilbane says, adding that "enterprises are not looking for just Web content management. They are looking for enterprise content management. You have to integrate it into all the other systems you have, and nobody would even think of doing that without XML."

Pre-Conversion Considerations

Regardless of the format used, there are crucial considerations that accompany any legacy-to-Web data conversion. One is the state of the legacy data itself. "What kind of shape are your originals in?" Poe asks. "What kind of variations do you have in format? You may think everything is 8.5 by 11 until you open the file and find out it is not. You will find old photocopies on thermal paper that have faded. There may be a photograph on a half-screen that has almost no contrast to scan. Contracts and letters may have coffee stains or be torn in the corner."

P.G. Bartlett, vice president of Marketing for Arbortext, says people are also sometimes surprised to find their documents have no real structure whatsoever -or at least no consistent structure from one document to the next. "The number one problem is that people go into it saying we have rules for how our authors are supposed to write, so therefore our information is already very well structured," he says. "The rule is it is always worse than you think. The tools do not enforce structure. The same author over time will do things differently, and no two authors will do things the same all the time."

Another critical consideration is the scale of a conversion. "People tend to want to do it all," Gross says, with the usual result being overly ambitious goals along with underestimated project timelines and costs-a recipe for failure in any project. Commenting on the critical nature of preparation in data conversion projects, Gross quotes Abraham Lincoln: "If I had eight hours to chop down a tree, I'd spend six sharpening my ax."

Organizations also often underestimate the time and cost involved in fixing errors from inaccurate data conversion, Gross adds. Using the example of a 5,000 page project, Gross calculates that spending only five minutes per page to fix errors would cost an organization some 25,000 minutes, or 417 hours, to get the data correct. At seven hours per day that cost would be 59.6 days of full-time productivity lost.

Another challenge comes in deciding what data to convert and what to leave behind-or at least to leave for a later conversion. Gilbane suggests using analytic software to determine the types of content people use most regularly as well as to establish patterns of use, which are then used to select the most popular or most critical data for conversion. Bartlett calls data conversion "a process of discovery," adding that "the discovery always comes later, when it is more expensive to make a change." Arbortext convenes groups of end-users before the conversion process, asking them to suggest ways in which they think they may use the converted content or data. Familiarity with the business goals of the organization doing the conversion is also important. "You have to have someone who understands not only the conversion process, but also what you want to accomplish," Bartlett says. "You don't just go in there and say 'let's start converting our content and see where it leads us.'"

Mark Ruport, president and CEO of Optika, adds two more challenges companies face when converting legacy data to Web content. One is network capacity. "You are dealing with latency issues relative to retrieval," Ruport says. "People are used to using the Web for HTML files, and it may take longer to download your images. You may end up putting up something that your customers are not satisfied with." Ruport's second point concerns database tuning and administration. "You are opening up a lot more users to access the information in different ways," he points out. "Someone in a local division of accounting can now see it, so you may need more keys and indexes. The way you set up your database becomes the critical factor in delivering information to the desktop quickly. Until you get it up and running, you may not understand whether you have the right infrastructure."

"Unstoppable" Momentum

Ruport says the main driver behind any conversion project is "the intrinsic value of the information the customer wants to access," and that that factor remains critical regardless of the size, industry focus, geographic location or any other attribute of the organization doing the conversion. Within larger organizations, companies tend to approach data conversion at a departmental rather than enterprise-wide level, picking one or two key areas that are believed to offer the fastest or largest ROI first, and using those projects as benchmarks for further conversion efforts. Bess points to the Internal Revenue Service as an example of an organization that has eagerly embraced the Web-enablement of legacy information-tax filing forms in this case-something that has become so popular that Adobe sees a pronounced spike in the number of requests for its Acrobat reader every tax season. Bartlett warns that many organizations still underestimate the scope of conversion projects, particularly not stopping to consider that much of the legacy data that exists today is relatively unstructured and will prove costly and difficult to convert to uniform content accessible via the Web.

That said, Ruport sums up by saying that even with all the considerations that must go into a successful legacy-to-Web conversion, the trend to Web-enabling data is strong, growing, and "unstoppable."

"We have seen people have grandiose plans of putting stuff on the Web and it was really overkill," he says. "They spent a lot of money putting up stuff that was not needed. The first thing they want to do is make everything accessible to everybody. Then they do a cost-benefit analysis and quickly do triage-if it doesn't affect the bottom line, impact customer satisfaction, have fast ROI, or help pay the bills faster, it may not have to be done. The most important data rises to the top."

"The movement to the Web is unstoppable," he continues. "It is the best medium out there. People will find faster ways to retrieve information, we will have faster queries and more precise ways of focusing on what is available, but every major movement will be headed towards the Web. It is unstoppable."