State of What Art?

I will review a recently published paper (https://arxiv.org/abs/2401.00595) that addresses the challenges of evaluating and comparing large language models (LLMs). The paper highlights that the current practice of assessing LLMs by assigning multiple tasks is unreliable. This is because the ranking of the models changes when we modify the instructions of the tasks. I'll delve into the importance of this issue, and propose alternative ways to evaluate LLMs based on their practical applications.