Thursday, December 21, 2006

The lowdown on Office Web Components

I recently mentioned the possibility of exploring MS Office Web Components as a replacement for server-side Excel automation. It turns out that for my particular problem, it didn't present a solution.

The problem workflow:
Client sends an Excel file as an email attachment. The spreadsheet contains tabular data. Depending on the particular data domain, it could be a phone directory, a report describing the status of their various projects, or whatever. The important thing to note is that they store this stuff in Excel, and that's what they give us as source material. Our Perl scripts then need to create HTML/PHP based on the input data, or possibly build SQL update commands to populate their web database. Web pages and SQL command files are then copied onto our local server for testing, and once everything is proofed, it gets sent by FTP to their live server.

The key problem:
Extracting data from Excel on a web server is "difficult."

Specifically, the programmer who first wrote this suite of automated scripts used Excel's Automation COM objects to do the data extraction. These COM objects actually control Excel, open/run it on the server, hide the application's GUI, open workbooks, create data ranges (eg. A1:E256) and then get that range as an array of values. Once it's done, it closes Excel. Or does it?

When running automation through a web server, IIS manages its own garbage collection. So when you close Excel and delete the reference to the application object, that should be the end of it - the process should terminate. But IIS doesn't actually delete the object until it collects garbage, so you end up with a zombie Excel process which must either be killed by fancy sysinternals programming, or just reboot the server at that point. If I'm around, I kill all the Excel zombies; if I'm not around, and nobody else knows how to kill, then someone has to reboot the server.

So how would you do Excel-like things on a server? MS took care of half of the problem with Web Components, which give a lot of the same functionality as the office programs. So for instance, you can web script the Spreadsheet component and use it to populate a sheet with data and dynamic formulae, and then export it to an Excel file or XML spreadsheet. But you can't do the opposite, ie. open an existing sheet to get the data from it, which is what I need.

So I moved on and discovered a new third option, but it required making a small change to the workflow. Rather than directly uploading an Excel file to the server for processing, the account executive who receives the Excel file from the client would first have to save as a delimited text file. This way, Perl on the server could just open it up and read it line-by-line. It would cut out a lot of the problems we've had in the past with having to wrap everything in UsedRange sub calls, which prevented the script from looping beyond the part of the sheet that had been filled in and off into infinity whenever the client forgot to place their END token at the end of the data. But it introduced more tedious work and could easily be scripted.

Introducing: Windows Script Host. You may have heard of it and wondered what it was. It's basically MS's answer to AppleScript. It allows you to use VBscript or JScript, along with whatever ActiveX/COM objects that you have access to locally, to build quick and easy scripts. If you need an interactive GUI, you can build the scripts into an HTML Application, which gets a .hta extension, is double-click runnable from Explorer, and runs inside whatever Internet Explorer monstrosity you happen to have. Then you can do the usual HTML form scripting operations, button onClick methods, etc. The script itself is pretty easy: create an Excel.Application instance, use its Workbooks.Open(filepath) method to return a handle to the Excel .xls file, then call its SaveAs(destination, filetype) method and pass it the proper type number for tabbed text. Now the AE can upload the text file rather than the Excel file.

Possible improvements: If processes that could be done just as easily in VBScript or JScript are currently done in Perl on the server, then the whole process could be done on local hosts. But there might be regular expression parsing or things like that, that remain better done in Perl, in which case it might be better to move the local script into a button onClick event in the web form page, with special security permissions on the local hosts allowing the script to start up Excel locally, do the conversion to text, save it with the original Excel file, and switch the form's file upload input to reflect the change so that the text file gets uploaded. That'd be pretty sweet.

Either way, my team has plenty of refactoring to do in the new year. If you find yourself in a similar situation, where you have a system that "works" but could certainly be done better, and you can't risk breaking the whole thing and starting from scratch, then check out Martin Fowler's excellent book Refactoring: Improving the Design of Existing Code

UPDATE: I forgot to mention that the Office Web Components are going by the wayside, which was another reason to not get into using them, even if they had been able to do the job.

Labels: , ,

Monday, December 18, 2006

Why Excel OLE Automation through the web is Bad

In the category of "sub-optimal things that I have to maintain," one of the ugly ones that rears its head from time to time is Excel automation. We have a client that sends us data tables in the format they're comfortable with, which happens to be MS Excel spreadsheets. We then have a legacy Perl script that uses a Win32::OLE connection to the Excel.Application COM object to open the Excel file, which first gets uploaded to the server through a plain old CGI multipart form attachment.

The issue at hand, as described in MS KB 257757, Considerations for server-side Automation of Office, is that running Excel server-side through IIS causes an instance of Excel to be spawned within the memory space of IIS, or something to that effect. Whatever is really going on at the low-level, the end effect is that an instance of Excel is left running on the server, even after you go through the normal procedure of calling the Quit() method and setting the application reference to nothing/null. And with the number of files that we process, we get a lot of zombies on our test server. Occasionally I get notified that an update process returned a blank page rather than a confirmation, and I check out the process monitor, and sure enough there are a bunch of Excels that aren't doing anything other than hogging resources and preventing Excel from actually running. Normally you can't even end those processes, since the IIS user spawned them, so you have to either reboot or use a fancy non-standard "kill" script.

If you ever find yourself in a similar situation, there are a few ways around it. I've found that running the script on the server via Scheduled Tasks allows Excel to quit. So you can set up a task that runs your script periodically, checking for the existence of the uploaded file. Your processing won't be instantaneous, but it'll be more reliable. There's also a possibility of using Office Web Components, which includes a COM object for a Spreadsheet with a similar object model to Excel Automation. I'm actually going to check that out to see if a) we have it already on the server, and b) if it can open Excel files on the server the same way that the Automation does, and if so, I'll convert our scripts to use that rather than Automation.

Tuesday, December 05, 2006

This Is What The Web Sounds Like

"CLiCk, Speak" is an extension for the Firefox web browser that provides a simple text-to-speech interface for web pages, intended for sighted users. If you don't already have Firefox, you're missing out! Click the link toward the bottom of the right-side column of this page to download Firefox, then download the extension, and listen to the web.

It'll give you a nice feel for some of the accessibility issues that you'll run into when designing sites, as you can listen to a simplified version of what a real screen reader would speak, minus the cues about what HTML element is being spoken. It's also handy as a productivity tool, freeing your eyes to do other things while you listen to web pages being read to you.

I haven't posted much lately because work has been crazy with a capital K. I've been promoted to Manager of Programming and placed in charge of the Web Production team, where I still handle dynamic web site and application programming in addition to project management, and now dynamic print job compilation. In the last few months I've learned more about PDFs than I ever intended, as well as gotten my first exposure to using Quark XPress for production work. I've had to hire a new programmer/designer, and we're in the midst of hiring for a HTML coder position.

Since I've picked up the lead on all of our web projects, I've become aware of many issues that I hope to write about when time permits. Among them:
  • Why you shouldn't perform OLE Automation tasks through web scripts (eg uploading Excel files to a CGI script that grabs data ranges and processes them)
  • Using dynamic evaluation functions as a quick and dirty substitute for the ActiveRecord pattern
  • Why performing multiple queries against Access databases in classic ASP through ADODB doesn't scale well, and how to improve performance as much as possible (see above)
  • How to discover hidden .Net potential
More about CLiCk, Speak

Labels: ,