all time cloud
I stop writing about operational failures only, potentially this blog will go on to write about “hot” topics within IT operations. It will still stay focused on operations, as I am an ops guy.
Good reason for telling, why I stopped writing for the last weeks: I wondered where cloud computing will go to!
We do quite a lot of different cloud projects and right now it seems that either there is no space left to deal with clouds or on the other hand side there is still a lack of experience out there on all sides of business. This is just a short draft about my ongoing thoughts, discussion welcome!
Cloud topics on business side:
why do we still believe, that it is as easy as writtin in the prospect? Haven’t we learned from all former proposed functionality? Yepp there is high potential to get it done and delivered in a smarter and more cost sensitive way, but at what risk and cost? And how does operations look like afterwards?
Cloud topics on IT side:
Clouds is nothing we can pass by. Clouds have to be worked with, IT has to understand pros and cons of clouds and how to live with them for the next decade. Clouds are neither friends nor enemies, they are a new way of delivering services to customers, more service based than ever before. Clouds are not VMWare and are not xen or kvm, clouds are a business case thou IT has to understand business and business methodology otherwise they will deliver virtualization. Not bad at all but only a few percent of cloud power.
As IT I would strongly recommend not to put to much pressure on compliance, legal and data security. There exist several organisations covering that topic and it is max. a question of weeks or months to get it fully done. Secondly there are already SAS70 ready solutions out there and other standards are met too, if you cover that topic it is OK, cause it is a risk, but nothing more. Using compliance as IT against cloud will mark you as the one securing your own office place …
Cloud topics on operations side:
clouds mean to no longer be the prime operations partner. To be honest, then thinking about all the complexity getting more and more clouds can even help you reducing YOUR complexity and getting things done. Yeah, number of systems will potentially go down, lot will be delivered out of cloud, partly you will act as an cloud offer. BUT, this is good news, you can transform yourself from an 24/7 operator to a platform architect handling tons of tons of tons of different systems without dealing with the day2day problems, they are within the cloud handeled by others 🙂
On Demand FTEs
You definitely know that story: You came in on a smooth Monday morning and the first you hear is a lack of resources. But why? How could it happen that you are running out of resources over weekend? Private accidents? One of your technicians felt in love with a girl and will not come back? Or is it just the simple fact that project Nr 30 has kicked off? (and indeed 10 of them are prio 1 :-))
So how do you want to plan scenarios like that? Whenever you build a new resource (or hire, maybe the better word) you should keep in mind to build resources internally you will need on a long term approach. So ask the following:
- Is the resource for day2day operations?
- Is the FTE still needed after successful close of project X and deprovisioning of current platform?
- How do you secure to take over knowledge after the external FTE has gone?
- Is it easier/faster to get that special skill/knowhow via a service or do we really talk about FTE?
- Is it in or out of budget and how to handle?
- How fast will you get that person, is it within demand time?
- Can you upgrade internally?
I want to point out one aspect, the knowledge transfer: Whenever you get in an external, you should be able to manage him more or less the same way you manage your employees. An external is as good as he gets direction, empathy and loyalty, even if you need him/her for a longer time.
Second point: Keep in mind that the external will leave the company after day X. So prepare your organization to take over the knowhow, transfer the experience within appropriate time. Otherwise you will not be able to successfully deprovision the external party. Lack of expertise, lack of knowhow after the external has left will directly fall back to you as the internally responsible person.
I tend to say put a minimum 50 % of all your externals within the existing tasks and support your internal FTEs to bring themselve upfront onto the new technology. After the switch only 50% of your current stuff has to be “upgraded”. This can be done by your already transformed employees so operations and the new fancy stuff is well organized and understood by your employees.
I know that it is quite hard to bring in externals under pressure and on the one hand side bring them upfront on the old – to be replaced – services and on the other hand use the same employees which train/support the externals to bring a new project up and running. The more the less you will reach the goal. Maybe not the first time, maybe even not the second or third time, but your employees will learn to live with permanent transformation and if they understand the message the next project will survive and knowhow transfer as a core rule for bringing in “on demand FTEs” will be accepted.
Ops is only in focus if it fails
You know that story, you work as good as you are and you will never get that funky stuff until your apps get public attention. So how do you mostly get attention? By failure as nearly all major projects of operations only affect people (in a bad manner) if something wrong happens. (a user will never see any change in his day to day live by getting a new IP address, but what happens if exactly that IP isn’t working …., just a silly change of an address with potential fatal attraction).
So is this either good or bad to only get money on demand? How does business think? Definitily, you – as an IT leader – has plenty of options to inverent but mainly two:
- accept and push your budget
- start building awareness via BCM or quality improvement programmes
The first one is the more reactive, you ask for more money, you don’t get it. If anything fails you always have the option to push back and say something like “I asked you for more money to mitigate those risks but I did not get it …” So your direct Head is responsible for your perosnal fiasco …
The second is much more aggressive. You try to offer solutions for potential scenarios, you interact directly with all related departmens and they will start putting pressure on the CEO to get money to mitigate/diminish their business risks (you hear, it is no longer an IT risk, it has changed to a business risk) You drive the BCM initiative and get the money via the involved departments, everybody happy but:
- BCM never stops, don’t forget to block money/resources in your budget for the upcoming years
- after a BCM initiative is before a BCM initiative, risks/business changes and you should discuss those changes in an open minded environment.
- a failure after such an initiative could potentially risk your job, so be aware to make sure that you get what you promised!
There will always be some risks you never can mitigate that easy, but most of them should fit into a BCM initiative.
So how to start? Talk to the departmens, understand their demands, get their OK for BCM and start your BCM lobbying parties … after a while make a good presentation at C* level and declare how and when and what happens, if BCM will not start. Sounds easy, is easy, but it will need you and the awareness of all involved departments. It is not a fight IT versus board, it is the “we make it better” party!
But what to do if it still fails? Open dialogue, fresh information about your plan to react and a fast closing project after you have installed some work arounds are the only way to get back respect and trust.
Threaten Business
I still wonder why so many IT people do argue that they talk to business, involve the business and respond to the business but if you ask them how and how often, you get no concrete answer.
Currently I got a query of a colleague asking me why IT isn’t interested in getting direct involvement via a project steering board from the business. The IT directors are part of an internal change project leading to better response times, lean IT and massive cost reduction. Those are all business related values and if business is neither involved nor partnering the project nobody will ever know what happened to the project!
The way to a good understanding of both sides is a long and stony way. Both sides have to learn to talk to each other, get understanding for their situation and their attitudes and behaviour. The goal is to convert the business culture and IT culture to a corporate culture. Sounds easy? Isn’t it: potentially tons of arguments will stress, disrupt or even destroy that idea, plenty of (personal) interests will have to be aligned with the business idea itself. IT has to understand that business should be a partner and sponsor, business has to understand that it needs a strong IT.
So stop living beneath within the building!
Deliver and own services
Mostly technicians seem to have an incredible understanding about service delivery. For them this means that they own and control the whole delivery chain, beginning by each stored bit and byte going over to the databases, the apps, the network, the associated (and hopefully existing) security, the frontend, the user training and and and … and if possible please forget documentation, we know what we do 🙂
But world changes, even now we stand on another step forward within a realy service oriented, clouded environment and the more you think about clouds, the more you have to dematerialize service delivery. It is not a bunch of servers connected with a bunch of network devices, secured with a bunch of security appliances which creates the service, the service is much more and the goal of modern IT should not be to deliver hardware-related stuff to non IT staff. For them IT does not matter (btw. thanks to the great book, Nicolas Carr) they just want to use. And non IT thinks different, they think – as we intend to say – emotional not rational about IT; either it works appropriate and the service desk is OK or it is not sufficient delivered. And they think in terms of economy.
An IT service delivered to the non IT people should be competitive in terms of service and pricing and it should be interoperable and portable. As we know, lots of offers out their try to do so and making a deeper view into it offers incredible stuff …
And what happens now? All the cloud offers, IaaS, PaaS, SaaS, internal and external, private, enterprise or public, shortly the clouds offer new innovative services with much more speed, power, resources and economies of scale.
Why should I continue maintaining my own hardware, software … if I do commodity stuff? It will be more expensive, more to integrate, more to maintain … so my resources are secured but nobody knows for how long.
Right now there exist only a few real issues for not going to a cloud:
- cloud to cloud data exchange is still lacking true interoperable and useable security
- the right size, if you have reached a size where you can gain profit from the top discounts too the gap within money will be closed.
- Real-Time, if you need Real Time you will have to build it for yourself (now)
- compliance: potentially, especially in the financial industry you will not be allowed to move your user data out of the computing country.
So the goal or mission of the IT of the (near) future will be, to aggregate the service delivery which is spread over the world. The IT will take care of
- interoperability
- portability
- first level
- combined service catalogue
- economics
If so, what should you change today? Yes maybe ITIL, but ITIL should be no more than it is, a goo practice. Use, what’s useable for you but think about your service definition and how to get that deep into the organization that you know which IT is needed and why. Acting as an account manager and understanding the own company as the key customer would potentially help. Leave IT behind, think in solutions and services and deliver them in time and with appropriate objectives (nobody asks who delivers, so if it looks like you but it comes trom anywhere else ….)
So please start thinking about the tremendous change what will happen in the near future and don’t repeat the last 20 years standard opstake: you can’t deliver and own all services within a more and more complex IT world.
SLA Mania
Even if processes do not or only fragmented exist, people start writing weird SLAs with implicit and explicit definitions of services without any service catalogue, portfolio or even without any understanding of what differs between infrastructure, app and a service …
What happens then? They start to build their opinion like “Hey, we have an SLA framework … ” or “Hey, we have a service catalogue ..” Asking them about how services are defined, build, maintained, monitored and reported results in answers like “this ist still done on a manual and on demand basis … ” or “… we are working on that topic …” or “hey dude, we still have to go a long way, keep on waiting …”
To be clear, thinking about SLAs is quite important to the organization and defining a structured aproach (aka SLA framework) is a really good way to go but:
- if your service support is still foggy, a major SLA part will be foggy too
- if your ownerships are not defined, a major SLA part will be undefined
- if you do not have any process about what and why is an offered service, you will define either customers wishlist or a very technical perspective
We will be able to list 100 or more additional blamings, but it is not our goal to do so. The goal should be the answer on what’s a better or more appropriate way?
I would suggest the following:
- Know your Service Support processes
- Get an Service Responsible/Representative in
- Know what you will be able to monitor and report
- Define your Service level internally, know what you are able to deliver (aka OLA)
- If possible show numbers, eco values mostly help by talking to your (even internal) customer
- Only define achievable goals/metrices
- Only define countable goals/metrices, at best they should be monitored and reported automatically by systems/machines
Start by defining the most valuable (business) processes for you customer and always think in terms of your customer and his potential end-2-end view. And – maybe the essential point – communicate the SLA to all engineers which are needed to run the service. Defining is step1, running is 2 and running is king!
You can use frameworks like ITIL, CobiT and others to get an understanding of what and how you should define within an SLA and how the process can look like (how to come to a service portfolio, catalogue and the associated SLAs) but potentially this will be pretty to much at first step and business pressure will not allow you to run a one-year-project just for a bunch of SLAs.
So keep in mind, go as defined as possible, know what you can deliver and how to measure, agree on internally first, know your support processes, escalations and at last go to your customer. Assure, that your customer is familiar with those processes too and start defining the mutual understanding of the service (the SLA). At end communicate the results to your IT department and start implementing. And don’t forget to monitor not only the values, monitor the SLA too… (is it still valid, useful, anything to define better …)
All the best with your SLAs!
Test is evil
It does not depend on whether we talk about testing for software, hardware, configuration all of them or processes. You will either have no/less time, resources, money or the will to do so. Interesting to see that the more they talk about why they cannot test, the more they misuse time in talking about it.
And QA/test is part of the management team. If you don’t live QA and testing, why should your employees then do so? If you don’t give them time to make all the houskeeping stuff and keep code clean and tidy why should they use their extratime for doing so?
2 things on testing, do it whenever you change anything relevant for you, your IT, your department, your business! Second: If you test think about how and why? It does not make as much sense to test the resilience of a known working cluster instead of testing the new code and how business processes are built in.
This leads me to the last point, the acceptance criteria. If you start a project, think about how the desired goal should look like. I don’t talk about the gui, I talk about the functionality. At end test, whether functionality is met or not. It is very often really hard to find correct numbers and values for acceptance criterias but the longer you run your QA, the better you will be.
QA consists of many more, an adequate test environment (near live), systems and routines, unit testing ,functional testing, integrational testing, load testing and and and. But if you don’t start, you will never come to that point of thinking about which test you need when and why.
Continual improvement is the key to successful QA!
Seniority
Yes you know, you will need them. You will need that crazy guys talking about systems, services, processes and tasks you potentially have never heard about before. And yes, they will let you know who the Know How owner is. And yes, you will need them cause:
- your Know How can only be scaled up if you have people which already transformed an idea to a working business.
- your Know How can only be pushed onto next level if you have people in place which have already seen the next level.
- there is compliance demand out there and you need people who know how to work with.
- and you will need technical, organizational and even cultural guidance.
This should never be an anthem on old, hard working before, employees. Seniority means that those people have seen different company cultures, different levels of expertise and they know how to transform. Potentially 3 years are enough, maybe 20 or more maybe they never reach seniority level. It really depends on the person, it’s personality, attitude, behaviour and culture (those famous ABC).
You will not need hundreds of Seniors, maybe one or two but you will have to enable them to transfer their knowledge to the teams and to you.
So keep in mind, even if you are the founder, funder, owner or whatever CxO, you will need to have an ear for them, their ideas, their expertise. If you can enable them, they will enable your organization based on the values and objectives you showed them as being important for your business model.
Not having Seniors means trying to create the wheel an extra time and loosing time, resources, nerves and money. Young rebels are good, focused your rebels make the business.
Our own datacenter is the best
Potentially no, but there is still a sub reality out there showing IT Operators or Facility Managers that owning their own datacenter is much cheaper, cooler, greaner, leaner, and whatever even if they just want to run 50 servers.
To be honest it could make sense, depending on your location, some geographic stuff and depending on the growth. But even google did not start by building their own datacenters. They rented before and now step by step they migrate to their owns – because they have reached a size of interest for having their own ones!
And if you think about your datacenter think about:
- who’s operating the facility
- who keeps care on USV, Diesel …
- who keeps care on getting all licence stuff done?
- who cleans the filters
- who is responsible for CCU & friends
- who is the cabling expert?
- who is the power expert?
- who is responsible for rack planing and provisioning …
and all of that in a twenty-four-seven environment …. ? Do you really want to be the Facility guy PLUS the Ops Leader? Can you combine those two or run them in a professional manner? Will you get headcounts for an electric expert as an IT Ops Leader? Tons of questions and nearly all – especially if you think about associated risks – should lead to the decision to NOT rund your own DC before you reached critical size.
The next interesting topic which always happens: If we still do not have the critical size, why don’t we do some shared hosting too? Because you do not have the skillset to do this? Because you are an internal service unit and you are not set up to offer on market prize on the external market? Because your SLAs are not that strong? Not really, potentially no, the main answer is: Because it is not your business!
So when do you reach the critical size? It does not depend on the number of systems, it depends on strategic and economic questions
- is running own datacenters a potential USP and can I run it cheaper/equal market?
- Sum of mass discount smaller than savings by own DC (economic value)
- extra flexibility needed (keep care! flex against price, and flex. against coolness)
The how and when will vary and potentially we will have to work on a new definition of critical size with regards to cloud computing, the price modell and the new datacenters (generation IV) which should reduce costs too.
Borders for prod. environment
A good developer will potentially never be a good operator and vice versa. But there is a grey area of work behind, mostly within transition from development to production environment, technically speaking the test, validation and pre/near-live environment.
The question is, who is responsible for that environment and why and at what level?
Traditionally development feels capable or resonsible for that equipment arguing that it is their effort to get it running. I would bring in another party/role named Quality Assurance (QA). Those should run at minimum the test environment and the associated tests (functional, integrational, scenarios, plans, procedures, acceptance criteria ….)
Next is the load test which is the gate to the production environment. As it is the last border it is top prio fo operations to be responsible for that system and test. If it passes, it is live. Who runs the live system? Ops, so they must do their loadtest job too. Otherwise they will get software live and they never had chance before to put an eye on it.
And – don’t forget – being responsible for it does not mean doing the whole job, the loadtest itself can be potentially run by QA – adviced by operations. But the establishing of the near live env, the evaluation of the result, and the sign-off should always be part of operations – no qa, no dev.
Another reason why ops should do so: If you do your loadtest on a near live, you will be able to get a bunch of numbers which should show your overall requests by system as a benchmark for capacity planing of the live environment. As we know those numbers change release by release (more functions coming in, complexity decrease/increase …). So not being responsible for those tests and letting dev things done means not being proactive in terms of environmental capacity.As a result you will put reactive forces on the live environment all time, binding resources and not fulfilling the goal of a good operations framework.

6 comments