Programmatic and model-based evaluations
Duties in CURIE are assorted and have ground-truth annotations in blended and heterogeneous type, e.g., as JSONs, latex equations, YAML recordsdata, or free-form textual content. Evaluating free-form era is difficult as a result of solutions are sometimes descriptive, and even when a format is specified, as in most of our instances, the response to every discipline can have differing kinds. For instance, supplies grid factors could generally be specified as “[p, q, r]” and at different instances as “p × q × r”. Therefore, along with the programmatic analysis metrics, similar to ROUGE-L, intersection-over-inion (used for BIOGR), and identity ratio (utilized in PDB), we suggest two model-based analysis metrics.
(1) LMScore: Prompts an LLM asking how carefully the predictions match floor reality on a 3-point scale: “good” if the prediction has few minor errors, “okay” if there are various minor errors, and “dangerous” if there are main errors. We contemplate the weighted common of the log-likelihood scores of the tokens to provide a last confidence.
(2) LLMSim: Is used for retrieval duties the place we ask the mannequin to exhaustively extract many particulars, e.g., descriptors, properties and values of supplies from a analysis doc, and supply as output an unordered record of dictionaries or data. We use a chain-of-thought (CoT) immediate that asks the LLM to have a look at every ground-truth document and establish the anticipated data that appropriately match every discipline (key) and worth of the bottom reality. As soon as we match the ground-truth data with predicted data, we will then measure precision and recall for the retrieval job, and compute the mean average precision, recall and F1 scores throughout all paperwork.