In the research team we've been recently discussing ways of measuring malware prevalence. It is not as simple as it may seem!

We started with the assumption that what matters is dispersion across industries only. If say Cisco and Orrick (a law firm head-quartered in SF) get infected we would think the attack is wider-ranging than if it affected, say Boeing and Airbus (same industry but locations far apart).

Having only one variable lets us abstract the problem. Imagine we have N balls that we place in 3 urns. When are the balls evenly spread out? Of course when each urn has roughly N/3 balls. But how to compare the case when three urns have N/6, N/3 and N/2 balls and one in which the spread is 0.4N, 0.4N and 0.2N?

Notice that a naive solution where we look only at the number of non-empty urns is clearly incorrect. A case of (N-2, 1, 1) and (N/3, N/3, N/3) would deliver the same score but for large N this far from what our intuition would tell us.

One way to think of it is to **test the hypothesis** that the configuration of the balls comes from a multinomial distribution with probabilities 1/k. We could then use the chi-square statistic to test how far a given configuration is from that theoretical model. The statistic would be then our measure: the greater it is the more concentrated the malware is.

Another is to use a so called** Herfindahl index** used to measure the level of monopoly in an industry by the anti-trust authorities. In our example of three urns we would calculate ratios u1/N, u2/N and u3/N where ui is the number of balls in urn i. The score would be then the sum of these ratios squared.

Yet another way of approaching this problem would be to use the notion of **entropy** which is supposed to reflect the "randomness" of a variable. High entropy means that a variable is very random. For example if we toss a fair coin each outcome has equal probability. This is as random as it can get!:) Now if the coin was biased and came out heads 90% of the time the outcome of the toss would be "less random". In our case we can define our random variable to be "the industry in which an attack occurs". The probability of industry k being attacked is P_k = # attacks in industry k / total # attacks. If an APT is prevalent it would attack all industries in a roughly equal proportion - that would correspond to our variable being very random and having high entropy.

The exact formula for entropy of a discrete random variable (which is our case) is the sum of - P_i*log(P_i) where i is the industry index. See here a nicely formatted output (this is taken from Wikipedia, only the last formulation is of interest here.

This also gives us a nice scale with entropy being 0 if all attacks are concentrated in one industry and log_b(n) if they are equally spread.

What measure do you think is right?