Devising ML Metrics

Mar 1, 2024
8 min read
Dan Hendrycks
Thomas Woodside

Devising ML Metrics

Metrics drive the ML field. As such, if we want to influence the field or popularize new subfields, we must define the metrics that correlate with progress on the problems we care about. Formalizing these metrics into benchmarks will be crucial to capturing the attention of researchers and driving progress.

Building good benchmarks is difficult, in large part because benchmarks exhibit many of the properties that produce power law outcomes. First, implicit in every benchmark design are a large number of multiplicative processes. If even one facet (e.g. ease of use, cost to evaluate, connection of the benchmark with a real problem, tractability, difficulty to game, feasibility of developing new methods to improve the state-of-the-art, etc.) of the benchmark is deficient, it may entirely prevent the benchmark from having impact. Second, benchmarks face strong preferential attachment dynamics: the most used benchmarks are the most likely to be used further. Finally, benchmarks are inherently located on the edge of chaos, since their design requires the researcher to effectively concretize a nebulous notion into a single number.

Properties of good benchmarks

Clear Evaluation

Perhaps the most important quality of a benchmark is having clear evaluation. To do this, the following properties are useful:

  • Use a large enough dataset that it’s possible to reliably detect a 1% improvement. The more difficult it is to detect small amounts of progress, the more difficult it becomes to make iterative progress on a benchmark. In practice, this usually means having at least a few thousand test set examples.
  • Use a standard metric for the field. For instance, if a benchmark uses cross entropy loss rather than accuracy, it needs to have a very good reason for doing so. It will be seen as nonstandard and will be less likely to be used. There are good reasons for not using the cross entropy for most tasks (e.g., “what does a cross entropy loss of 1.8 mean?”), and recognizing that there is substantial implicit wisdom in precedents is necessary for creating benchmarks. Non-imitative tendencies (similarly, overconfidence in the reach of one’s own intellect) markedly increase the probability of an unsuccessful benchmark.
  • Include clear floors and ceilings for the benchmark. Many benchmarks use accuracy, mainly because the benchmark has a clear floor and ceiling. If metrics do not have clear ceilings, researchers will be wary of working on them for fear that the benchmark has already reached its ceiling. Lastly, if applicable it’s useful to include human performance as well as random performance, as this helps orient people trying to understand the benchmark. Contributions towards the benchmark need to be immediately communicable to those who aren’t familiar with the benchmark.
  • A benchmark should be objective and comparable across time. For this reason, it’s counterproductive to update a benchmark frequently. Benchmarks should be designed such that they do not need to be updated.
  • Benchmark performance should be described with as few numbers as possible. A benchmark might be divided into many different subcategories that are qualitatively different. Some researchers instinctively resist averaging performance across the subcategories into an overall “average” metric. This is a very bad idea, because having a single number makes it much more likely that the benchmark will actually get used. In some cases, researchers do not want to report eight different numbers, and would rather report just one. A person’s short-term memory can’t store a dozen subtly different numbers.
  • It’s nice if it’s possible for a ML model to do better than expert performance on the benchmark. For instance, humans aren’t that good at quickly detecting novel biological phenomena (they must be aware of many thousands of species), so a future model’s performance can exceed human performance. This makes for a more interesting task.

Minimal barriers to entry

Having too many barriers to entry can sink a benchmark. If researchers do not enjoy using a benchmark, this severely diminishes its usefulness, even if the benchmark pertains to something very important. We present some pointers for removing barriers to entry:

  • Avoid expanding many modalities at once. A benchmark combining RL, NLP, and CV is unlikely to be used, because few researchers have skills in all three of these areas. In some cases, multiple modalities makes sense, but one should be wary of doing this unless necessary.
  • Make good software packages and codebases that people can easily use without much training. This means writing clean code and having good documentation. If researchers can’t understand how to use a benchmark, they might just give up.
  • In the original benchmark paper, demonstrate that new methods are capable of making progress on it. To do this, include a trivial baseline method, as well as a modified method that performs better. Doing this is also a great way to test out a benchmark and discover barriers to entry that might not be obvious without actually using the benchmark. Spend time eating your own cooking.
  • Design the measure so that performance is considered low. If performance is already at 90%, researchers will assume there isn’t much of a problem that they can make progress on. Instead, zoom in on the structures that are most difficult for current systems, and remove parts of the benchmark that are already solved.
  • As mentioned in our last post, using human feedback in evaluations of a benchmark has serious limitations. First, it is quite expensive to collect, and for academics requires IRB approval. This creates major barriers to entry. More broadly, using human feedback indicates that the central problem has not been shaped into a useful microcosm. Relying on humans to decide performance indicates that there is not a well-formed idea of what performance actually is, which can sink a benchmark. Of course, humans usually need to help create the benchmark itself; but after its creation, more human input should not be needed.
  • Be careful about the software packages used. Sometimes widely-used packages are banned in some industry labs (e.g. for security reasons). Being aware of this is important, since it may restrict the audience of the benchmark.

The process of building benchmarks

The most difficult part of designing a benchmark is concretizing a nebulous idea into a metric. Doing this first requires thinking of an idea to be tested, which is difficult in itself, but thinking of how to concretize it is even more difficult. The aim is to create a microcosm that has good properties and is both useful for the broader problem and simple enough that it can be concretized and improved. The task requires foresight: where will the machine learning field go next? What is becoming possible that previously wasn’t?

Existing successful benchmarks are likely to have useful features, even if it’s not easily apparent why they’re useful (perhaps even to the creators of the benchmark). If you don’t know why a benchmark is evaluated in a certain way, do not assume there is no good reason. Successful benchmarks have stood the test of time, and so should be emulated to the extent possible. To give one example, the high number of categories used in ImageNet turned out to be important to ensure performance was less gameable, though this was not obvious until a decade after its original list.

Being ready to design a benchmark requires understanding the community that the benchmark is being designed for. This likely means doing research in the area, talking to others doing research, and absorbing what it’s like to use a benchmark. It’s not wise to try to swoop into a field without knowing anything about how researchers in the field approach their work. Set aside a large chunk of time to humble yourself to more effectively serve the community you aim to mobilize. An additional benefit is that listening to researchers’ intuitions on a given problem might give ideas for how to concretize it. One way of doing this is to collect their phrases for “what they mean” by a particular fuzzy concept, and put them into a word cloud. When designing a benchmark, ask: “what are they trying to get at? how does this benchmark idea address a large share of the words/phrases in the word cloud?”

Designing benchmarks means constantly having nebulous goals for new benchmarks swirling around while researching. Most ideas will likely not be good, or won’t be ready for a benchmark, so pursuing the first benchmark idea one comes up with is not a good strategy. Having a number of benchmark ideas in mind is useful in case a new advancement makes an idea suddenly feasible. When building a benchmark, temper your expectations. From an outside view, it is quite difficult to design a good metric, or else it would be easy to write highly impactful papers.

The internet has a vast amount of data that can be collected. If it appears necessary to spend a large amount of money (millions of dollars) on a benchmark, it makes sense to spend more time scouring the internet (note that this is different for applications projects, such as self-driving cars). If there is nothing relevant on the internet, this may indicate that the idea might not be that good or the data you seek isn’t useful, as it has not been interesting enough to appear anywhere on the internet. Note that once a good idea is found, it is usually mostly grunt work to generate the dataset; the majority of the intellectual difficulty of benchmark design arises from finding an idea that will work well as a benchmark, not collecting the data. These two are conflated, leading researchers to incorrectly assume they can easily create something useful to mobilize the community.

One useful heuristic for designing benchmarks is to try to include many sources of variability. For instance, including multiple types of adversarial attacks, multiple environments, using different writers and so on is very useful because it allows for making progress in a number of dimensions at once. Beware of believing that there are more dimensions than they are: for instance, procedurally generated data may appear to have many dimensions (“we have infinitely many attacks!”), but if there is a simple-to-describe generating process for data, there are not many dimensions to it. Adding a random number generator to choose the coefficient for a particular piece of data only adds a single new dimension, not infinitely many. For this reason, data on the internet can be quite useful, since it is often generated with many unique generating processes that add additional structural complexity (rather than merely increasing entropy). Concretely, a mathematics benchmark with handwritten word problems has many more structures and sources of variability than algorithmically generated mathematics problems. Even though with the latter one can have infinitely many distinct problems, the underlying data generating process is much less complex. Aim for multiple sources of variability, ideas, or structures to make the optimized metric less gameable and more likely to capture more real-world properties.

Lastly, after developing a benchmark, it’s necessary to maintain the area to make sure that researchers are doing things correctly. Any benchmark is of course susceptible to being gamed (some more than others): researchers can create methods that exploit some peculiarity of the benchmark rather than make progress towards the overall problem. Peer review can be a good safeguard against this, as reviewers might recognize that the approach is gaming the benchmark.  However, sometimes reviewers don’t know any better, and that’s why it’s often necessary to reduce the effect by producing high-quality research towards the benchmark. Of course, it’s understood that benchmarks might eventually be solved, but this doesn’t mean that the problem itself is solved. In such cases, new benchmarks are necessary to expose gaps in the original benchmark.

In summary, successful benchmarks are not simply driven by an army of dataset annotators, they are driven by intellectual humility, attentiveness to numerous (often implicit) factors, and original ideas.

Dan Hendrycks is the Executive Director of the Center for AI Safety. Dan contributed the GELU activation function, the main baseline for OOD detection, and benchmarks for robustness (ImageNet-C) and large language models (MMLU, MATH).

Thomas Woodside contributed to drafting this post in 2022 when he was CAIS’s first employee. He now works at an AI policy think tank in Washington, DC. His website is here.