The organization I work for uses a third party service to manage it's eBay listings. This service like all programs written by man is not perfect. The development at this organization is largely concerned with unit tests and integration tests I have learned from several outings with various employees that work on the large .Net code base. The largest burden of testing however, comes in the form of acceptance and user testing.
It has been my experience that designing large distributed pieces of software is a difficult task to get right; even more difficult is proper validation testing within distributed environments.
Ensuring a software stack is cooperating properly when multiple servers, processes, languages, and groups of programmers involved is an undertaking in itself.
The proper approach for monitoring such an infrastructure would require certain design traits, most notably the ability to test real data against the live environment and consistently ensuring reliable output.
This methodology also forces the capabilities of such a system to be well documented and defined from start to finish, especially if you hope to test all customer facing aspects of the system.
Last week we attempted to update pricing on about 138,000 listings, our systems generated a report that was uploaded to the service in question; a reply to our upload of 'success' was retrieved shortly thereafter.
Over the weekend I received several emails from the CEO of our company complaining that the prices we talked about changing had not been changed. After opening a ticket with the companies support we received this response a few days later:
Hi [name],Lets see what we can infer about this organizations infrastructure and design by this email.
After engineering investigated they saw that one of the services thatupdates price wasn't running. So anytime a price was updated, therest of our revise process wouldn't pick it up unless something elsein the listing was changed, like an item specific or template change.I don't have an explanation beyond that or why, but the service wasrestarted and if you only update prices going forward, the system willpick those up.
However, if you run into anything like this again let me know and I'llsend it back up to engineering to have them check out the servicesinstead of just revise activity.
Sorry for the inconvenience here.
First, it's obvious that engineering is a completely separate department than high-level support; this wouldn't be concerning if it wasn't for the fact that this "problem resolution" is missing a key concept of any resolution: the assurance of the problem not occurring again.
Second, the differencing algorithm relies on some sort of passive service that monitors for changes (my guess) and marks certain items for update. At some point the service became complex enough that there was a need to silo the data components from eachother such as pricing. The infrastructure design for detecting changes and responding to them is passive in nature; meaning, a failure is by virtue of the operations silent.
And I guess this is where we get to the part where I share my perception of this situation. It's my belief that this shouldn't really have happened in the first place, and what's ultimately made me write about it is the fact that it's now happened at least 4 times to us; and we're only a single customer!
What can be done to prevent this sort of thing, better testing, better design, a different corporate structure? Sure, maybe some of those things, but mistakes will happen.
There has been a shift in the last 5 years in which unit tests have been touted as the holy grail to application robustness and conformity to specification. I don't doubt that this application conforms to specification, however, it has failed in one of the most basic ways possible. Worst of all the customer, as is the case all too often, is the one finding the glitch.
Unit tests, automatic builds, integration testing, and system tests are all great concepts, but I will argue not in that order. Because business requirements always outpace the speed of development I will state very clearly here; make System Tests your highest priority, make them robust, sophisticated, and capable of interacting with a live system. This may require designing the system in such a way where continuous monitoring and testing is possible, but this is essential in complicated, distributed environments. Often dev teams never get around to writing unit tests, or even worse they write the unit tests, then the unit-test code base becomes larger than the applications code base and becomes out of date or discarded. This is why I advocate thinking about and planning for system tests first and making them a requirement. If system tests are planned for last it's very possible you may simply not get to them.
In one of my next blog posts I hope to outline this approach to a distributed project that I'm working on.
I'll detail and outline the challenges and the careful attention given to different areas of the application that are inherently susceptible to silent failures. Event driven systems like the one that will be discussed are often the most likely to experience deadlocks, and silent failures.
No comments:
Post a Comment