Evaluating progress of LLMs on scientific problem-solving

Programmatic and model-based evaluations

Duties in CURIE are assorted and have ground-truth annotations in blended and heterogeneous type, e.g., as JSONs, latex equations, YAML recordsdata, or free-form textual content. Evaluating free-form era is difficult as a result of solutions are sometimes descriptive, and even when a format is specified, as in most of our instances, the response to every discipline can have differing kinds. For instance, supplies grid factors could generally be specified as “[p, q, r]” and at different instances as “p × q × r”. Therefore, along with the programmatic analysis metrics, similar to ROUGE-L, intersection-over-inion (used for BIOGR), and identity ratio (utilized in PDB), we suggest two model-based analysis metrics.

(1) LMScore: Prompts an LLM asking how carefully the predictions match floor reality on a 3-point scale: “good” if the prediction has few minor errors, “okay” if there are various minor errors, and “dangerous” if there are main errors. We contemplate the weighted common of the log-likelihood scores of the tokens to provide a last confidence.

(2) LLMSim: Is used for retrieval duties the place we ask the mannequin to exhaustively extract many particulars, e.g., descriptors, properties and values of supplies from a analysis doc, and supply as output an unordered record of dictionaries or data. We use a chain-of-thought (CoT) immediate that asks the LLM to have a look at every ground-truth document and establish the anticipated data that appropriately match every discipline (key) and worth of the bottom reality. As soon as we match the ground-truth data with predicted data, we will then measure precision and recall for the retrieval job, and compute the mean average precision, recall and F1 scores throughout all paperwork.

Source link

Science behind the care: Clinical research at OSU-CHS

Trump OSTP director calls for return to ‘gold-standard science’

Three UC San Diego Researchers Elected to the National Academy of Sciences

Commentary: Does Volvo’s Chinese ownership threaten US national security?

FHRAI raises red flag over Agoda’s commission practices and GST compliance issues, ET TravelWorld

Mystery of body in wetsuit found in reservoir puzzles police

Skype announces it will close in May

WarThunder – I Joined The Swedish AirForce

Most Popular

Commentary: Does Volvo’s Chinese ownership threaten US national security?

FHRAI raises red flag over Agoda’s commission practices and GST compliance issues, ET TravelWorld

Mystery of body in wetsuit found in reservoir puzzles police

Our Picks

Robert Irwin lands in trouble as mom Terri ‘disappointed’ about mystery woman

Six Ukrainian soldiers killed in Russian missile strike during training exercise

Connie Francis on her 1963 song going viral — ‘What’s that?’

Subscribe to our newsletter

Evaluating progress of LLMs on scientific problem-solving

Programmatic and model-based evaluations

Related Posts

Subscribe to our newsletter