For IT systems, "failure" is always an accompany. People in general probably think this job 'must be exhausting' or 'must be a heavy load', but it isn't really. Once I start working on them, adrenaline is secreted, my brain is awakened, don't know why, but I can explore the cause of the failure intently even though the middle of the night. And when I have done, I feel very refreshing. It's a healthy work unexpectedly. (Although, I lose the aspiration, right after that.)
Then, when overcome these "emergency" together, the knowledges of heretofore will be organized, and the teamwork as an organization will also be fostered. Also, it is not bad from the viewpoint of technical training. (Of course, we cannot withstand if the occurrence was too frequent...)
(Please don't tell anyone...)
The following Business Process definition is a Workflow to manage emergency issues which are called 'Trouble Ticket'. When a Trouble Ticket is opened, it will be notified to people who concerned, and people in charge will be assigned. Directors or people who have [Data Viewing permission] among employees will be monitoring the situation in real time. (And on the Enterprise Social networking, chatting will be frequent, such as '"What frequency" do you mean by "intermittently"?' or 'Who will be the "Part of Users"?'...)
In addition, the final deliverable of this Business Process is "Final Report". Reports will be created by each of both of Technical team and Web Team.
[Failure Corresponding flow]
[Failure Corresponding flow; '1. Open Trouble Ticket' screen]
By the way, Questetra, Inc. is providing 'Cloud Computing Environment' (SaaS) to our customers, as a company that classified in Package software services. (G3913 < G39. INFORMATION SERVICES < G. INFORMATION AND COMMUNICATIONS, by Japan Standard Industrial Classification) In short, we are a 'Cloud service provider'.
Nowadays, it seems natural demand that 'a Computing environment which is available from anywhere anytime' for the users. However, for current Cloud service providers, 24/7 is very hard to achieve. For it is a mechanism that is considered to be "running" finally after a variety of software on many devices to interact with each other successfully, there will be a failure that would occur unavoidably. (Power supply, Network, physical failure of the server machine...) We cannot stop System failure to occur several times in the year by all means. (SaaS Service-Level and Service Failure logs of Questetra.)
Nevertheless, the cases that we must rush to so-called "on-site" was gone, as the virtualization technology has evolved today. Or also, works of so-called 'recovery on mistakes made by human error' was almost gone, because the automation technology for configuration tasks been established. 5, 10, 20 years, if we look macroscopically, it will be realized 24/7 at reasonable cost, someday.
P.S.
Although it is just a digression already, exploring the cause of failure begins with tenacious confirmation work. Even though I cannot explain in the way that is clear to everyone, but after all, it would be the works to investigate the failure cause for each separate element of "IT systems". However, there are so 'many elements'. That is, each element is categorized either of
- A. Communication network group (including external system)
- B. Hardware group
- C. Software group
- D. Data group
There will be a failure which skills and knowledges in our company are not enough to handle, like 'Leap second bug'.
[Data Items List]
[Free Download]
- Business Template: Failure Corresponding flow
- Fault! Report It Through Workflow! (2011-10-03)
- Appropriate Concurrent Processing to Sharing Failure Information Quickly (2012-08-06)
- A Workflow That Helps Teamwork in a Time of Crisis (2010-11-06)
<<Related Articles>>