core

This is the core module that will include everything needed for the semantic cleaning

Loading parameters

If you want the check the code you’ll need your Huggingface tokens. You can do it using login or by loading the tokens from a file.

My tokens are in a jason file with is loadded to a Parameters class

## This is for colab integration - uncomment the lines below
# from google.colab import drive
# drive.mount('/content/drive')

from reinautils import Parameters

if os.path.isfile('/content/drive/MyDrive/tokens.json'):
  params=Parameters().from_json ('/content/drive/MyDrive/tokens.json')

Lets do some imports

os.environ["TOKENIZERS_PARALLELISM"]="True"

Define a function for the data preprocessing

source

preprocess_data

 preprocess_data (dataset:datasets.arrow_dataset.Dataset,
                  splits:Union[str,List[str]]=None, schema:str='')

Preprocesses the dataset by merging selected keys into a formatted string.

Args: dataset: A HuggingFace Dataset. splits: The specific splits of the dataset to preprocess. Defaults to all splits. schema: A string defining how to format the merged string. It should contain keys from the dataset encapsulated in {}. Example: “:{user} :{response}”, where ‘user’ and ‘response’ are keys in the dataset.

Returns: The processed Dataset with an additional “_merged” field containing the formatted strings.

Define a function to compute the embeddings

source

mean_pooling

 mean_pooling (model_output, attention_mask)

Mean Pooling - Take attention mask into account for correct averaging

source

compute_embeddings

 compute_embeddings (data:datasets.arrow_dataset.Dataset,
                     embedding_model:torch.nn.modules.module.Module,
                     tokenizer, batch_size:int=8, num_workers:int=1,
                     dataset_feature:str='_merged')

Compute sentence embeddings using an embedding model.

Args: data: A list of dictionary containing tokenized text. embedding_model: A callable model that returns embeddings for input tokens. batch_size: The number of samples per batch. num_workers: The number of worker processes for data loading. dataset_feature : The name of the feature to tokenize in the dataset Returns: A numpy array of embeddings for the input data.

This function will do the deduplication

source

deduplicate_embeddings

 deduplicate_embeddings (embedded, embedded2=None, epsilon=0.01,
                         batch_size=20000)

Perform deduplication on the provided embeddings. If a second set of embeddings is provided, return the indices of embeddings in the second set that are duplicates of embeddings in the first set.

Args: embedded1: A numpy array or PyTorch tensor holding the embeddings of the first set. embedded2: A numpy array or PyTorch tensor holding the embeddings of the second set (optional). epsilon: The maximum distance for two embeddings to be considered duplicates (using cosine similarity). batch_size: The size of the batches to process at a time.

Note: The embeddings must be L2 normalized.

Returns: If a second set of embeddings is provided, a tensor of indices of the second set that are duplicates of the first set. If a second set of embeddings is not provided, a tensor of indices that should be deleted due to duplication in the first set.

And in this function we will combine everythin

source

deduplicate_dataset

 deduplicate_dataset (dataset:datasets.arrow_dataset.Dataset,
                      model:torch.nn.modules.module.Module, tokenizer,
                      epsilon:float=0.01, model_batch_size:int=64,
                      deduplication_batch_size:int=20000,
                      num_workers:int=16, dataset_feature:str='')

Deduplicate data in a dataset based on the embeddings computed by a given model.

Args: dataset: Dataset to be deduplicated. model: Model to compute embeddings. epsilon: Threshold for cosine similarity to consider embeddings as duplicates. model_batch_size: Batch size for the model. deduplication_batch_size: Batch size for deduplication process. num_workers: Number of worker processes for data loading. dataset_feature: Feature in the dataset to use for deduplication.

Returns: Deduplicated dataset.

Now let’s test it all together

data = load_dataset("0-hero/OIG-small-chip2")

Load and preprocess the data

We will do the test using a dataset from Huggingface : 0-hero/OIG-small-chip2

data = load_dataset("0-hero/OIG-small-chip2")
_ = preprocess_data(data,schema = "<human>:{user} <bot>:{chip2}")
data['train']['_merged'][0]

"<human>:I've heard that it's a good idea to have a will. What is a will?\n\n <bot>:A will is a legal document that specifies how your property should be distributed after you die. It can also specify who should care for any children or other dependents you may have. It's important to make sure that your will is valid and up-to-date, since the laws governing wills vary from state to state."

Load the tokenizer and model

As a model for the semantic embedding we’ll use sentence-transformers/all-mpnet-base-v2

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').to('cuda' if torch.cuda.is_available() else 'cpu')

/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

Run De-duplication

# For testing we will skip this if we don't have a GPU
if torch.cuda.is_available():
  deduplicated = deduplicate_dataset(
      dataset = load_dataset("0-hero/OIG-small-chip2"), 
      model = model, 
      tokenizer = tokenizer,
      epsilon = 1e-2, 
      model_batch_size = 64, 
      deduplication_batch_size = 20000, 
      num_workers = 16,
      dataset_feature = ''
  )
  print (f"cleaned:{(1-list(deduplicated.num_rows.values())[0]/list(data.num_rows.values())[0])*100:.2f}:%")
else:
  print ("No cuda available. Skipped")

cleaned:100.00:%