Adding the benchmark back into the training process doesn’t mean you get an LLM the can weed out irrelevant data, what you get is an LLM that can pass the new metric and you have to design a new metric with different semantic patterns to actually know if it’s “eliminating red herrings”.
Adding the benchmark back into the training process doesn’t mean you get an LLM the can weed out irrelevant data, what you get is an LLM that can pass the new metric and you have to design a new metric with different semantic patterns to actually know if it’s “eliminating red herrings”.