## This is for colab integration - uncomment the lines below
# from google.colab import drive
# drive.mount('/content/drive')
core
Loading parameters
If you want the check the code you’ll need your Huggingface tokens. You can do it using login or by loading the tokens from a file.
My tokens are in a jason file with is loadded to a Parameters class
from reinautils import Parameters
if os.path.isfile('/content/drive/MyDrive/tokens.json'):
=Parameters().from_json ('/content/drive/MyDrive/tokens.json') params
Lets do some imports
"TOKENIZERS_PARALLELISM"]="True" os.environ[
Define a function for the data preprocessing
preprocess_data
preprocess_data (dataset:datasets.arrow_dataset.Dataset, splits:Union[str,List[str]]=None, schema:str='')
Preprocesses the dataset by merging selected keys into a formatted string.
Args: dataset: A HuggingFace Dataset. splits: The specific splits of the dataset to preprocess. Defaults to all splits. schema: A string defining how to format the merged string. It should contain keys from the dataset encapsulated in {}. Example: “
Returns: The processed Dataset with an additional “_merged” field containing the formatted strings.
Define a function to compute the embeddings
mean_pooling
mean_pooling (model_output, attention_mask)
Mean Pooling - Take attention mask into account for correct averaging
compute_embeddings
compute_embeddings (data:datasets.arrow_dataset.Dataset, embedding_model:torch.nn.modules.module.Module, tokenizer, batch_size:int=8, num_workers:int=1, dataset_feature:str='_merged')
Compute sentence embeddings using an embedding model.
Args: data: A list of dictionary containing tokenized text. embedding_model: A callable model that returns embeddings for input tokens. batch_size: The number of samples per batch. num_workers: The number of worker processes for data loading. dataset_feature : The name of the feature to tokenize in the dataset Returns: A numpy array of embeddings for the input data.
This function will do the deduplication
deduplicate_embeddings
deduplicate_embeddings (embedded, embedded2=None, epsilon=0.01, batch_size=20000)
Perform deduplication on the provided embeddings. If a second set of embeddings is provided, return the indices of embeddings in the second set that are duplicates of embeddings in the first set.
Args: embedded1: A numpy array or PyTorch tensor holding the embeddings of the first set. embedded2: A numpy array or PyTorch tensor holding the embeddings of the second set (optional). epsilon: The maximum distance for two embeddings to be considered duplicates (using cosine similarity). batch_size: The size of the batches to process at a time.
Note: The embeddings must be L2 normalized.
Returns: If a second set of embeddings is provided, a tensor of indices of the second set that are duplicates of the first set. If a second set of embeddings is not provided, a tensor of indices that should be deleted due to duplication in the first set.
And in this function we will combine everythin
deduplicate_dataset
deduplicate_dataset (dataset:datasets.arrow_dataset.Dataset, model:torch.nn.modules.module.Module, tokenizer, epsilon:float=0.01, model_batch_size:int=64, deduplication_batch_size:int=20000, num_workers:int=16, dataset_feature:str='')
Deduplicate data in a dataset based on the embeddings computed by a given model.
Args: dataset: Dataset to be deduplicated. model: Model to compute embeddings. epsilon: Threshold for cosine similarity to consider embeddings as duplicates. model_batch_size: Batch size for the model. deduplication_batch_size: Batch size for deduplication process. num_workers: Number of worker processes for data loading. dataset_feature: Feature in the dataset to use for deduplication.
Returns: Deduplicated dataset.
Now let’s test it all together
= load_dataset("0-hero/OIG-small-chip2") data
Load and preprocess the data
We will do the test using a dataset from Huggingface : 0-hero/OIG-small-chip2
= load_dataset("0-hero/OIG-small-chip2")
data = preprocess_data(data,schema = "<human>:{user} <bot>:{chip2}")
_ 'train']['_merged'][0] data[
"<human>:I've heard that it's a good idea to have a will. What is a will?\n\n <bot>:A will is a legal document that specifies how your property should be distributed after you die. It can also specify who should care for any children or other dependents you may have. It's important to make sure that your will is valid and up-to-date, since the laws governing wills vary from state to state."
Load the tokenizer and model
As a model for the semantic embedding we’ll use sentence-transformers/all-mpnet-base-v2
= AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
tokenizer = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').to('cuda' if torch.cuda.is_available() else 'cpu') model
/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Run De-duplication
# For testing we will skip this if we don't have a GPU
if torch.cuda.is_available():
= deduplicate_dataset(
deduplicated = load_dataset("0-hero/OIG-small-chip2"),
dataset = model,
model = tokenizer,
tokenizer = 1e-2,
epsilon = 64,
model_batch_size = 20000,
deduplication_batch_size = 16,
num_workers = ''
dataset_feature
)print (f"cleaned:{(1-list(deduplicated.num_rows.values())[0]/list(data.num_rows.values())[0])*100:.2f}:%")
else:
print ("No cuda available. Skipped")
cleaned:100.00:%