Masterclass
Raw text data harvested from sources like the web contains a vast amount of noise alongside valuable information. Documents might be truncated, contain mostly navigation links, consist of machine-generated logs, include extensive boilerplate text, or simply be irrelevant to the target language or domain. Feeding such low-quality data directly into LLM pre-training can degrade performance, waste computational resources, and potentially introduce undesirable biases or safety concerns. Therefore, implementing effective quality filtering strategies is a significant first step in the preprocessing pipeline.
These strategies range from simple heuristics to more sophisticated model-based approaches. Often, a combination of methods provides the best results, forming a multi-stage filtering process.
Heuristics are rules of thumb derived from observations about typical low-quality text. They are often computationally inexpensive and effective at removing egregious examples of noise.
Document Length: Very short documents often lack meaningful content (e.g., just a title, a header/footer snippet, or an error message). Conversely, extremely long documents might indicate parsing errors, concatenated files, or non-prose content. Common practice involves setting minimum and maximum character or word count thresholds.
# Example: Simple length filtering
MIN_CHARS = 100
MAX_CHARS = 100000
def filter_by_length(text):
char_count = len(text)
# Basic word count approximation
word_count = len(text.split())
if char_count < MIN_CHARS or char_count > MAX_CHARS:
return False
# Add word count checks if desired
# if word_count < MIN_WORDS or word_count > MAX_WORDS:
# return False
return True
document = "This is a sample document..."
if filter_by_length(document):
# Process document
pass
The specific thresholds (like MIN_CHARS
, MAX_CHARS
) are hyperparameters that often require tuning based on the dataset characteristics and downstream goals.
Symbol and Punctuation Ratios: Text dominated by non-alphanumeric characters often represents source code, configuration files, tables, or corrupted text. Calculating the ratio of alphabetic characters to total characters, or the ratio of specific symbols (like #
, {
, }
), can help identify such content.
# Example: Filtering based on alphabetic character ratio
MIN_ALPHA_RATIO = 0.7
def filter_by_alpha_ratio(text):
if not text: # Handle empty strings
return False
alpha_chars = sum(c.isalpha() for c in text)
total_chars = len(text)
ratio = alpha_chars / total_chars
return ratio >= MIN_ALPHA_RATIO
code_snippet = "if (x > 0) { y = x * 2; } // Example code"
prose = "This sentence contains mostly alphabetic characters."
print(f"Code snippet passes filter: {filter_by_alpha_ratio(code_snippet)}")
print(f"Prose passes filter: {filter_by_alpha_ratio(prose)}")
# Expected Output:
# Code snippet passes filter: False
# Prose passes filter: True
Presence of "Bad Words" or Boilerplate Phrases: Creating lists of specific words or phrases commonly found in low-quality web content (e.g., "javascript is required", "terms of use", error messages, placeholder text like "lorem ipsum") allows for direct filtering. Regular expressions can be used to match patterns associated with boilerplate, like long sequences of navigation links or copyright notices. Care must be taken not to make these lists overly aggressive, as they might inadvertently filter out valid content.
Repetitive Content: Documents containing excessive repetition (e.g., the same line or paragraph repeated many times) are usually low-quality. This can be detected by looking for repeated lines, paragraphs, or even long n-grams. Calculating the document's compression ratio (using standard algorithms like gzip) can also serve as a proxy; highly repetitive documents compress exceptionally well.
These methods often leverage statistical properties of text or pre-trained models to assess quality.
Language Identification: For monolingual or specific multilingual models, identifying the language of each document is essential. Libraries like fastText
or pycld3
provide pre-trained language identification models. Documents not matching the target language(s) or those where the model assigns a low confidence score can be filtered out.
# import language_identifier_lib
# model = language_identifier_lib.load_model()
TARGET_LANG = 'en'
MIN_CONFIDENCE = 0.90
def filter_by_language(text):
# detected_lang, confidence = model.predict(text)
# return detected_lang == TARGET_LANG and confidence >= MIN_CONFIDENCE
# Dummy implementation for illustration:
if "esto es español" in text:
return False # Simulating non-target language
if "confidence low" in text:
return False # Simulating low confidence
return True # Assume English otherwise
spanish_text = "esto es español"
low_confidence_text = "mixed signals confidence low maybe"
english_text = "This is primarily English text."
print(f"Spanish text passes: {filter_by_language(spanish_text)}")
print(f"Low confidence text passes: {filter_by_language(low_confidence_text)}")
print(f"English text passes: {filter_by_language(english_text)}")
# Expected Output:
# Spanish text passes: False
# Low confidence text passes: False
# English text passes: True
```
Perplexity Filtering: A language model trained on a high-quality reference corpus (e.g., Wikipedia) assigns lower perplexity (or higher probability) to text similar to its training data. We can use such a model to score documents from our raw dataset. Documents exceeding a certain perplexity threshold are likely dissimilar to the reference corpus and potentially lower quality. This requires access to a suitable quality-reference LM and can be computationally intensive.
import torch
# from transformers import AutoModelForCausalLM, AutoTokenizer
# Assume model and tokenizer are loaded
# model = AutoModelForCausalLM.from_pretrained("gpt2") # Example
# tokenizer = AutoTokenizer.from_pretrained("gpt2")
PERPLEXITY_THRESHOLD = 50 # Example threshold
def calculate_perplexity(text, model, tokenizer):
# Placeholder: Real implementation involves encoding, passing through model,
# calculating loss, and converting to perplexity.
# inputs = tokenizer(text, return_tensors="pt")
# with torch.no_grad():
# outputs = model(**inputs, labels=inputs["input_ids"])
# loss = outputs.loss
# perplexity = torch.exp(loss)
# return perplexity.item()
# Dummy implementation:
if "gibberish" in text:
return 100.0
else:
return 20.0
# high_quality_text = "This is well-formed text."
# low_quality_text = "asdf qwer gibberish zxcv."
# if calculate_perplexity(document, model, tokenizer) < PERPLEXITY_THRESHOLD:
# # Keep document
# pass
```
No single filtering strategy is perfect. Effective data cleaning typically involves chaining multiple filters together in a pipeline. A common approach starts with less computationally expensive heuristics (length, language ID) to quickly remove obvious noise, followed by more resource-intensive methods like perplexity or classifier-based filtering on the remaining data.
A typical multi-stage filtering pipeline for text data.
Successfully applying these quality filtering strategies is fundamental to preparing the massive, yet often messy, datasets required for building high-performing large language models. It lays the groundwork for subsequent preprocessing steps like normalization and deduplication.
Was this section helpful?
© 2025 ApX Machine Learning