Thursday, June 14, 2007

Quantifying Security using Metrics

There is a great deal of valuable data that can be gained from the penetration test element of an assessment. Knowing whether or not your perimeter, for example, is secure and validating that knowledge is important. Looking at the scope of the assessments that I have recently been working on I think most people are looking to gain more than just an initial validation of their existing security controls.

Any good penetration test or vulnerability assessment will deliver a set of results that will include a listing of vulnerabilities and assign a risk value to those vulnerabilities along with some remediation measures. Most of these assessments will use the following calculations or variants thereof: Risk = Threat x Vulnerability or Risk = (Threat x Vulnerability) x Impact and Annual Loss Expectancy = Single Loss Expectancy x Annualized Rate of Occurrence. These calculations, while a great means of assessing risk, are usually applied to a risk management lifecycle that is circular and repeating .This, in my opinion, creates an environment in which risk is not being managed but is really just being identified and fixed and re-identified. This creates a reactive rather than proactive environment. Also, these calculations rely on values that are hard, if not impossible to define in all but the simplest of situations or scenarios.

Lets assume that after the installation of an IDS/IPS at your perimeter, you detected hotbar/spybar spyware traffic indicating that users on managed stations had local admin rights in order to be able to install the malware. Using this information you were able to assess your current controls and determine why they had admin rights and take steps to fix it. While the product has proved it’s worth, it was a reactive response to an issue. Ideally being able to define a set of metrics by which to measure the security of the host’s configuration would allow you to better define, assess and improve the security controls you have in place. By tracking this data over time you can move towards a more proactive environment.

Using the example of Host Configuration Management I would look to achieve the following:

A Benchmark score for Workstations/Laptops/Servers. - This allows you to standardize configurations and characterize the degree of lockdown applied to the OS.

The percentage of Workstations/Laptops/Servers using the standard build image. - This allows you to measure the conformance of systems in your environment to the standard build.

The percentage of systems in compliance with the standard configuration. - This shows how many systems conform to the standard build requirements regardless of how the system was build (manual, image, etc..)

The network services ratio. - This identifies potential ingress points on the hosts. Tracking unnecessary/vulnerable services that should be disabled allows you to determine the number or percentage of systems deviating from your standard build as well as potential ingress points or vulnerable systems. This data can also be applied to your patch management processes in order to prioritize which systems require immediate patching, etc...

The percentage of systems that are remotely managed. - This would identify the systems that can be administered remotely and are subject to patch management and anti-malware controls.

The percentage of critical systems actively being monitored. - This helps identify the extent of uptime and monitoring controls in place.

The number/percentage of systems logging events remotely. - This determines how many systems are forwarding security event data to a central log server.

The number/percentage of systems using NTP server for time synchronization. - This and the previous two metrics are important for Incident Handling response. When an incident occurs and the event information needs to be accessed, having this data in a central location ensures access and integrity of those events. Time synchronization is important when reconstruction a sequence of events in the correct order.

The response time to (re)configure a system in an emergency. -This tracks the response time to reconfigure a set of systems in event of a zero-day attack or incident. This should ideally be organized by OS, Department, location.

Having a set of metrics that are easy to gather, repeatable, can be expressed as a number or percentage and are relevant to your environment, will help with analysis and allow you to become far more proactive.

Quantitative metrics like these can be applied to multiple areas of control, including the results of a penetration test of vulnerability assessment. Some metrics that would have immediate value would be:

Perimeter Security (Anti-virus/spam/malware, Firewalls, IDS/IPS) and Threats/Attacks (Events and Incidents).
Coverage and Control (Vuln/patch management, AV management, Host management). These determine effectiveness and success of your existing security program.
Availability and Reliability (Uptime, recovery, change control).
Application/Web application security.
Penetration Testing/Vulnerability assessments. These can provide valuable data but need to be defined by your environment. Identifying and defining issues by departments, looking at the difficulty of the exploit (remote or requires local access, etc...), assessing the impact of the vulnerability in terms of your existing security controls (defense in depth).

These are all predominantly technical in nature but the same methodology could be applied to assessing user awareness and compliance. I think that regardless of what you decide to have assessed, looking to gain valuable and repeatable metrics from results should be the outcome.

A great read on Security Metrics, and where most of the above content is from, is Andrew Jaquith's book Security Metrics. It's an excellent read and is extremely relevant in today's maturing security environments.

dean de beer


NoticeBored said...

Hi Dean. I support the need for security metrics to help manage information security proactively (addressing risks before they materialize) as well as reacting to incidents (improving controls after the fact). However, I don't really understand how the metrics you suggested will help me be more proactive: what do they tell me to do? I'd also quarrel with your assertion that they are simple to measure - if you think through what you would actually need to do in practice to measure those metrics, and to re-measure them periodically, you'll soon see the costs and difficulties mounting.

That said, I'm interested in the metrics book you referenced. Can you tell us more about it, please?

Kind regards,

dean de beer said...

Hi Gary,

Thanks for the response. Good questions too.

I don’t disagree that metrics can be difficult to gather and measure, especially if you don’t have the existing controls or infrastructure in place. When I say that metrics should be simple to measure, perhaps a better word would be to “gather” although a good metric should be easy to measure too. Nor do I argue the fact that the gathering, correlation and interpretation of good metrics takes time and commitment but as with any worthwhile endeavor the initial cost to implement will be far outweighed by the improvement to the controls being measured. Today most security programs are maturing to the point of having this data available in some form. Perhaps it is not actively being gathered. Ideally this data collection should be automated in some form. Also, the ease of gathering the data might affect the regularity that the data is used to generate a result, so would whether or not the actual creation of the metric could be automated or would need a bit of spreadsheet manipulation, interviews, policy review, etc…

Gathering of the raw data aside, I think that the metrics I suggested for Host Configuration Management are not that difficult to gather. There are numerous pc management/tracking tools that can give this sort of data. Of course this is dependent on the existing infrastructure and controls in place. In order to gather enough meaningful data the coverage of the security control being measured needs to be defined, basically how effectively does a particular control cover the resources in question? The greater the coverage the more data available for use. Also what level of control does the organization have over the application and management of the security controls being applied?

Without good coverage and control an organization cannot be expected to generate valuable metrics, let alone generate or collect the data easily.

Since a lot of smaller enterprise environments might not have a standard image that they push out but might instead rely on other controls in their defense-in-depth security program, lets look at Patch and Vulnerability Management and how metrics can be used to proactively improve your security posture. I’ll try and relate this to a vulnerability assessment/pen test too since that is what the blog originally discussed.

The use of some sort of patch management software is becoming commonplace (WSUS, etc) this is an easy place to gather data for analyzing the effectiveness of your patch management program. That and your vulnerability scanner software can provide additional data for correlation.

A good metric might be the following:
% of systems exploitable using a known vulnerability (that has a patch available) – this data could be gathered either from a comprehensive pen test or a vulnerability scan. This data could then be compared to your existing patch management procedures and policies that you have in place. This metric can be further refined by some secondary variables such as type of device (server, workstation, etc..) and/or department. By using this data and looking at how much patch coverage you achieve in one, two or five days you can begin to assess the effectiveness of your patching program and where improvements, if any, need to be made. This data could be graphed in many ways such as a “You Are Here” graphic. This type of graph would require a baseline. In this case the baseline would be the overall installed patch level/coverage. The idea here is to improve the patching process to the extent where Time to Patch > Time to Exploit (or time to release of exploit).

I hope that helps somewhat. Let me know if I’m way off base here.

Here is a link to the book on amazon and a link to Andrew’s website,