Abstract: Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training ...
Abstract: In many real-world applications, sorting is a crucial data structure. Sorting algorithms are methods for rearranging a collection of unsorted items into a desired format or order. A lot of ...
This project investigates token quality from a noisy-label perspective and propose a generic token cleaning pipeline for SFT tasks. Our method filters out uninformative tokens while preserving those ...