Scientists propose new system to standardize automatic evaluation of NLP systems by adding humans in the loop.