from datasets import load_dataset
import numpy as np
from transformers import AutoTokenizer, AutoModel
from semantic_cleaning import preprocess_data,compute_embeddings, deduplicate_embeddings, deduplicate_dataset
semantic-cleaning
Tools for semantic cleaning of a test dataset
Install
pip install semantic-cleaning
How to use
Processing a dataset to get a sentence for QA or comment and response etc.
= load_dataset("0-hero/OIG-small-chip2")
data = preprocess_data(data,schema = ":{user} :{chip2}")
_ 'train']['_merged'][0] data[
2
Compute the embadding fot the sentences
= AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
tokenizer = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').to('cuda')
model = compute_embeddings(data = data, embedding_model = model, tokenizer = tokenizer, batch_size = 64, num_workers =16, dataset_feature = '_merged'): embedding
We can get the indicis of all the duplicated lines with the folowing command:
= deduplicate_embeddings(embedded =embeddeing, epsilon=1e-2, batch_size=20000) to_delete
You could also find duplication between two datasets or splits like this:
= deduplicate_embeddings(embedded =embeddeing, embedded2 =embeddeing2, epsilon=1e-2, batch_size=20000) to_delete
The full process could be run like this
= deduplicate_dataset(
deduplicated = data['train'],
dataset = model,
model = tokenizer,
tokenizer = 1e-2,
epsilon = 64,
model_batch_size = 20000,
deduplication_batch_size = 16,
num_workers = '_merged'
dataset_feature
)print (f"cleaned:{(1-len(deduplicated)/len(data['train']))*100:.2f}:%")
And deduplicated can be pushed back to the hub or saved on local drive
Command-Line Interface
The semantic cleaning module also includes a command-line interface that can be used to deduplicate datasets:
python semantic-cleaning.py \
--model_path "sentence-transformers/all-mpnet-base-v2" \
--tokenizer_path "sentence-transformers/all-mpnet-base-v2" \
--dataset_path "0-hero/OIG-small-chip2" \
--output_path "./deduplicated_imdb"
The following arguments are available:
- –dataset_path: Path to the dataset to be deduplicated.
- –model_path: The model checkpoint for embeddings. Should be a path or model id in HuggingFace model hub.
- –tokenizer_path: The tokenizer to be used.
- –epsilon: Threshold for cosine similarity to consider embeddings as duplicates.
- –model_batch_size: Batch size for the model.
- –deduplication_batch_size: Batch size for the deduplication process.
- –num_workers: Number of worker processes for data loading.
- –dataset_feature: Feature in the dataset to be used for deduplication.
- –output_path: Path to save the deduplicated dataset. Can be a local path or a HuggingFace dataset repository.
- –hub_repo: Repository on the Hugging Face hub to push the dataset.
- –hub_token: HuggingFace Hub token to push the dataset to the Hub. Required when hub_repo is provided.
- –device: Device to use for computations (e.g., ‘cpu’, ‘cuda’, ‘cuda:1’). If not provided, it will use CUDA if available, otherwise CPU.
You can use the –help flag to get a description of all options:
python semantic-cleaning.py --help