Incident State population through CIs/Services

ccottet · Post by **ccottet** » 09 Nov 2012, 20:18

Hi,

I'm currently evaluating OTRS & OTRS:ITSM capabilities for supporting my company with implementing ITIL processes throughout our IS/IT operations. Until now, I have been able to play around conveniently within OTRS configuration to tailor it to our needs but I'm now getting stuck with some conceptual issues behind incident state population through configuration items and between configuration items & services.

Due to a lack of a clean documentation regarding various types of links meaning & how OTRS processes these links to propagate incident state, I've been playing around with different links between CIs & CIs as well as between CIs & Services and I think I got to understand how the system works. Unfortunately the behaviours I got to identify do not seem to be inline with my expectations. Therefore, by explaining you how I understood the system to be working and how I expect it should be working, I hope you'll be able to help me define if I'm facing here a misunderstanding of the OTRS concepts, something that can be addressed with tweaking OTRS configuration or if I'm looking at an unsupported feature of OTRS.

In order to explain the incident state computations, I'll use some schematic representations with the following notations:
- CI & SE stand respectively for Configuration Item & Service
- Each type of link is identified with a 2 letter abbreviation:
** AT: Alternative To
** CT: Connected To
** RT: Relevant To
** DO: Depends On/Required For
** IN: Includes/Part of
- Bi-directional links are represented like this <-XX-> where XX is the link type abbreviation
- Uni-directional links are represented like this --XX-> where XX stands for the link type
- R,Y,G are used to identify respective incident states Red (Incident), Yellow (Warning), Green (Operational)
- X => Y is used to identify an implication of a cause (X) to its consequence (Y)
- Implication causes are necessarily incident state and they are represented using the incident state abbreviation (R, Y, G)
- Implication consequences can either be one of the incident state abbreviations (R, Y, G) or the special value - which means that the cause (on the left side of the implication) does not have any impact on the consequence side (on the right side of the implication)
- In case of incident state conflicts, the highest level takes precedence over the others. For instance if due to some its links a service incident state should be put to both Red and Yellow, then Red takes precedence ...

Here is an example of a complete computation schema:

Code: Select all

CI <-DO-- SE
R    =>    R
Y    =>    Y
G    =>    G

The above schema reads as follows. It shows incident states impacts for a Service which depends on a Configuration Item:
- If configuration item is in state Incident (Red), then the Service state should have state Incident as well
- If configuration item is in state Warning (Yellow), then the Service state should have state Warning at least (it might be state Incident if some other link forces it)
- If configuration item is in state Operational (Green), then the Service state should have state Operational except if other link forces it in a Warning or Incident state.

Basic Propagation Method as understood by me:

Code: Select all

CI <-DO-- SE      CI --DO-> SE      CI <-AT-> SE      CI <-RT-> SE
R    =>    R      R    =>    R      R    =>    -      R    =>    -
Y    =>    Y      Y    =>    Y      Y    =>    -      Y    =>    -
G    =>    G      G    =>    G      G    =>    -      G    =>    -

SE <-DO-- CI      SE --DO-> CI      SE <-AT-> CI      SE <-RT-> CI
R    =>    -      R    =>    -      R    =>    -      R    =>    -
Y    =>    -      Y    =>    -      Y    =>    -      Y    =>    -
G    =>    -      G    =>    -      G    =>    -      G    =>    -

CI <-DO-- CI      CI --DO-> CI      CI --IN-> CI      CI <-IN-- CI
R    =>    Y      R    =>    Y      R    =>    -      R    =>    -
Y    =>    Y      Y    =>    Y      Y    =>    -      Y    =>    -
G    =>    G      G    =>    G      G    =>    -      G    =>    -

CI <-AT-> CI      CI <-CT-> CI      CI <-RT-> CI
R    =>    -      R    =>    -      R    =>    -
Y    =>    -      Y    =>    -      Y    =>    -
G    =>    -      G    =>    -      G    =>    -

In a summarized way, it means that only the Depends On/Required For links do matter (which is somewhat logic as it is configured as such in ITSM::Core::IncidentLinkType) and their direction do not matter, only the kind of object link will change the behavior.

Basic Propagation Method Analysis:

If the system really works and can only work this way, then I have an issue with it. Indeed, for instance, let's say I have 3 services (SE1, SE2, SE3) operated respectively thanks to applications A, B & C (CI1, CI2 & CI3). Applications A & B are hosted on 2 distinct virtual servers (CI4 & CI5). The virtual servers are part of an ESX infrastructure (CI6). Application C is hosted on a physical server CI7. The ESX infrastructure & the physical server are both accessible through our LAN (CI8).

The logical way I would represent this in the CMDB is:

Code: Select all

(ESX  ) CI6 --DO-> CI8 (LAN  )
(Ph Se) CI7 --DO-> CI8 (LAN  )
(ESX  ) CI6 --IN-> CI4 (VM  1)
(ESX  ) CI6 --IN-> CI5 (VM  2)

(Ser 1) SE1 --DO-> CI1 (App A)
(App A) CI1 --DO-> CI4 (VM  1)

(Ser 2) SE2 --DO-> CI2 (App B) 
(App B) CI2 --DO-> CI5 (VM  2)

(Ser 3) SE3 --DO-> CI3 (App C) 
(App C) CI3 --DO-> CI7 (Ph Se)

Now, let's say service desk received an incident claim about service 1. The service guy quickly looks into the incident and sees that application A seems to be down. Then, he sets CI1 state to Incident, links it to the ticket and passes on the incident ticket to the application maintenance team. As a result of the service desk action, SE1, CI1 are now Red and CI4 is Yellow. All the rest is green. Up to now, I would say I'm rather OK with how it goes.

Then, the application maintenance guy looks into the incident and sees that the application can't run because the server (CI4) is down. He switches CI4 state to Red, links it to the ticket and moves the ticket to the datacenter operations team. Looking at various states now, SE1, CI1 and CI4 are Red, all the rest is green. I can live with that despite I would be quite happy to see already if possible that the root cause for the incident MIGHT be the ESX infrastructure as the virtual server is part of it, so in my opinion CI6 should be yellow but it's green today.

Now, the operations team gets the ticket and sees that the cause of the incident is actually a defect on the RAM modules of the ESX infrastructure. Therefore they set CI6 state to Red and link it to the ticket. With this update, SE1, CI1, CI4 and CI6 are Red but all the rest is green. Here I can't be ok with that as if the ESX infrastructure is down, then all virtual servers including CI5 should be at least with a warning (or even better in my view, they should be Red) and therefore through the CI5->CI2->SE2 link chain, Service 2 state should be impacted. It's not the case today as I have used an Includes/Parts Of link type for dealing with the ESX infrastructure.

Let's say now I change my links configuration in the following way:

Code: Select all

(ESX  ) CI6 <-DO-- CI4 (VM  1)
(ESX  ) CI6 <-DO-- CI5 (VM  2)

When service desks registers the incident at 1st support level, they switch CI1 state to Red. As a result everything else is now Yellow. Indeed, we have complex propagation chains such as this one: CI1 --DO-> CI4 --DO-> CI6 --DO-> CI8 <-DO-- CI7 <-DO-- CI3 <-DO-- SE3. I don't get how this information could be relevant for anybody working with incident resolution. Indeed, what are the chances that Application A failure is due to a LAN issue ... not so much actually !

Based on this initial analysis and using other potential use cases scenarii evaluation, we have tried to identify what should be for us the perfect incident state propagation method.

Target Propagation Method:

Code: Select all

CI <-DO-- SE      CI --DO-> SE      CI <-AT-> SE      CI <-RT-> SE
R    =>    R      R    =>    -      R    =>    -      R    =>    Y
Y    =>    Y      Y    =>    -      Y    =>    -      Y    =>    Y
G    =>    G      G    =>    -      G    =>    -      G    =>    G

SE <-DO-- CI      SE --DO-> CI      SE <-AT-> CI      SE <-RT-> CI
R    =>    R      R    =>    -      R    =>    -      R    =>    Y
Y    =>    Y      Y    =>    -      Y    =>    -      Y    =>    Y
G    =>    G      G    =>    -      G    =>    -      G    =>    G

CI <-DO-- CI      CI --DO-> CI      CI --IN-> CI      CI <-IN-- CI
R    =>    R      R    =>    -      R    =>    R      R    =>    Y
Y    =>    Y      Y    =>    -      Y    =>    Y      Y    =>    Y
G    =>    G      G    =>    -      G    =>    G      G    =>    G

CI <-AT-> CI      CI <-CT-> CI      CI <-RT-> CI
R    =>    -      R    =>    Y      R    =>    Y
Y    =>    -      Y    =>    Y      Y    =>    Y
G    =>    -      G    =>    G      G    =>    G

Target Propagation Method Analysis:
Back with the initial configuration (with virtual servers being included in the ESX rather than depending on it).

When service desk tags the application as not being operational anymore, only service 1 & application A are Red, all the rest is green.

When the application maintenance team spots that the issue is related to the virtual server instance, they move it to state incident. As a result, the ESX platform gets a warning state which propagates all way up to Service 2 through virtual server 2 and application B. However, LAN, physical server, application C and Service 3 remain untouched. That's exactly the behavior I'm expecting.

The operations team tag the ESX platform Red, now, all virtual servers get Red and this state is propagated up to Service 1 and Service 2 while the whole Service 3 branch down to the LAN is perfectly fine.

Well that was quite a long post but I think such complex topic deserves it. So would anybody know if what I've written here is complete misguidance/insanity, or if it is feasible in OTRS either straight out of the box (with some configuration tweaking) or if I should look into code tweaking as well.

Thanks in advance to any member of the community who will be brave enough to have read the post until its end and who intends to reply to it.

Best regards,

Cyril

richieri · Post by **richieri** » 28 Nov 2013, 14:39

Thanks by this post!

mehdibelabbas · Post by **mehdibelabbas** » 10 Jan 2014, 12:44

Thanks ccot for your work. i will try to understund your post. i m now try to understund the otrs links and the behavior after putting the relationship to my CI. i don't have a clear idea about it.

Post by **crythias** » 10 Jan 2014, 15:18

It would have helped to begin with clearly summarizing a question that can be answered that would address an issue. It's really not fair to read a tome and see no question at the beginning and no question at the end (Essentially, "After reading War & Peace, do you see my problem?").

Znuny and ((OTRS)) Community Edition

Incident State population through CIs/Services

Incident State population through CIs/Services

Re: Incident State population through CIs/Services

Re: Incident State population through CIs/Services

Re: Incident State population through CIs/Services