Ensuring High Availability in our OTT Video Platform

September 14, 2013

Web service failure is an unavoidable part of the Internet world. Any Internet-based platform which is not designed to withstand failures will eventually fail. As an OTT-Platform provider, this applies to us well. So when it came to design our platform, the architects included availability as a key design goal.

While building robust services with proper development practices and adding service redundancy for each service was a clear engineering directive, feedback on service availability and performance was deemed equally important.

We had a choice of going with existing third-party service monitoring framework or build our own. A quick analysis clearly indicated that the flexibility attained by building our own infrastructure was worth the effort. We separated the problem into two sub-tasks: 1) Monitoring the servers and 2) Monitoring the individual services.

The Server Monitoring framework essentially monitors the raw parameters at each server in our infrastructure. So we get fine-grained information about server parameters like CPU / Memory / IO / Processes and have a rule-engine that generates alerts for our operators.

Service monitoring involves checking the health of individual service. This system enables monitoring of arbitrary type of services (HTTP / FTP / Custom Protocols) with its plugin-based architecture. Our services are themselves instrumented with custom monitor responses which enable us to collect fine-grained performance data from them. So we are not just able to check for unavailable or erroneous services, but the system can also generate early warnings based on reported service parameters. Our early experience indicates that even simple early warnings like increased service latency, are extremely useful in ensuring high availability.

With the impressive availability improvement enabled by our monitoring and early warning system, any new server or service launched in our platform now comes with an embedded monitoring component. Stay tuned for more insights!