Text preprocessing steps are consolidated into a reusable pipeline. Building a structured process is beneficial for applying the same transformations consistently to training, validation, and new prediction data. The focus is on creating a pipeline for text data, transforming raw sentences into padded integer sequences suitable for an embedding layer followed by recurrent layers like LSTMs or GRUs.Defining the Pipeline GoalOur objective is to create a function or class that takes a list of raw text documents (e.g., sentences or paragraphs) and outputs a numerical tensor where each document is represented as a sequence of integers, padded to a uniform length.Essential ToolsWe'll primarily use tools available within common deep learning frameworks. For instance, TensorFlow/Keras provides the tensorflow.keras.preprocessing.text.Tokenizer class and the tensorflow.keras.preprocessing.sequence.pad_sequences function, which handle most of the heavy lifting. If you're using PyTorch, libraries like torchtext offer similar functionalities. In this example, we'll illustrate using the TensorFlow/Keras utilities.Pipeline StepsThe core steps involved in our text preparation pipeline are:Fit Tokenizer: Analyze the corpus of text documents to build a vocabulary and map words to unique integer indices. This step is typically done only on the training data to avoid data leakage from validation or test sets.Integer Encode: Convert each text document into its corresponding sequence of integer indices based on the fitted vocabulary.Pad Sequences: Ensure all integer sequences have the same length by adding padding tokens (usually 0) either at the beginning ('pre') or end ('post') of shorter sequences. Longer sequences might be truncated.Here's a flow:digraph G { rankdir=LR; node [shape=box, style=rounded, fontname="sans-serif", color="#495057", fillcolor="#e9ecef", style="filled, rounded"]; edge [color="#868e96", fontname="sans-serif"]; raw_text [label="List of Raw Text Strings"]; tokenizer [label="Tokenizer\n(Fit on Training Data)", shape= Mrecord, fillcolor="#d0bfff"]; integer_seqs [label="List of Integer Sequences\n(Variable Length)", fillcolor="#a5d8ff"]; padding [label="Padding\n(Pad/Truncate)", shape= Mrecord, fillcolor="#96f2d7"]; padded_tensor [label="Padded Integer Tensor\n(Batch, Max Length)", shape=cylinder, fillcolor="#ffec99"]; raw_text -> tokenizer [label=" fit_on_texts "]; tokenizer -> integer_seqs [label=" texts_to_sequences "]; integer_seqs -> padding; padding -> padded_tensor [label=" pad_sequences "]; }The text data preparation pipeline transforms raw text into a fixed-size numerical tensor suitable for RNN input.Implementing the PipelineLet's walk through the implementation using TensorFlow/Keras.1. Sample DataFirst, let's define some sample text data we want to process.# Sample corpus of text documents corpus = [ "Recurrent networks process sequences.", "LSTMs handle long dependencies.", "Padding makes sequences uniform.", "Embeddings represent words numerically." ] # New data to be processed later using the *same* fitted pipeline new_data = [ "GRUs are another type of recurrent network.", "Uniform sequence length is needed for batching." ]2. Tokenization and Vocabulary BuildingWe initialize a Tokenizer and fit it on our corpus. The num_words argument limits the vocabulary size to the most frequent words. Setting oov_token ensures that words encountered later, which were not in the initial fitting corpus, are assigned a special "out-of-vocabulary" token.import numpy as np from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences # Configuration vocab_size = 20 # Max number of words to keep oov_tok = "<OOV>" # Token for out-of-vocabulary words # Initialize and fit the tokenizer tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok) tokenizer.fit_on_texts(corpus) # Display the learned word index (vocabulary) word_index = tokenizer.word_index print("Word Index:") # Display a subset for brevity print({k: word_index[k] for k in list(word_index)[:15]})This fitting process builds the internal vocabulary. word_index contains the mapping from words to integers. Notice that num_words controls the vocabulary size used during encoding, not the size of word_index itself. The tokenizer reserves index 0 and assigns the OOV token index 1.3. Integer EncodingNow, we use the fitted tokenizer to convert the text sentences into sequences of integers.# Convert corpus to integer sequences sequences = tokenizer.texts_to_sequences(corpus) print("\nOriginal Corpus Integer Sequences:") for seq in sequences: print(seq)Each list of integers corresponds to a sentence in the corpus, using the indices learned in the previous step.4. Padding SequencesSequences currently have different lengths. We use pad_sequences to make them uniform. maxlen defines the target length. padding='post' adds padding (0s) at the end, while truncating='post' removes elements from the end if a sequence is too long. 'pre' is also a common choice, especially for LSTMs/GRUs, as it keeps the most recent information at the end of the sequence.# Padding configuration max_sequence_len = 10 # Define a maximum length padding_type = 'post' truncation_type = 'post' # Pad the sequences padded_sequences = pad_sequences(sequences, maxlen=max_sequence_len, padding=padding_type, truncating=truncation_type) print("\nPadded Corpus Sequences (Tensor Shape: {}):".format(padded_sequences.shape)) print(padded_sequences)The output padded_sequences is now a NumPy array (or TensorFlow tensor) with shape (number_of_samples, max_sequence_len). This is the format expected by an embedding layer in your RNN model. Index 0 is the default padding value.5. Processing New DataA significant advantage of this pipeline is applying the same transformations to new, unseen data using the already fitted tokenizer.# Convert new data to integer sequences using the *same* tokenizer new_sequences = tokenizer.texts_to_sequences(new_data) print("\nNew Data Integer Sequences:") for seq in new_sequences: print(seq) # Notice the OOV token (index 1) for words not in the original fit # Pad the new sequences using the *same* parameters new_padded_sequences = pad_sequences(new_sequences, maxlen=max_sequence_len, padding=padding_type, truncating=truncation_type) print("\nPadded New Data Sequences (Tensor Shape: {}):".format(new_padded_sequences.shape)) print(new_padded_sequences)Notice how words like "grus" or "batching", which were not in the original corpus, are mapped to the oov_token index (1). This ensures consistent processing.Encapsulating the PipelineFor better reusability, you can wrap these steps into a function or a class. Here's a functional approach:def create_text_pipeline(training_texts, max_vocab_size=10000, oov_token="<OOV>"): """Fits a tokenizer on training texts.""" tokenizer = Tokenizer(num_words=max_vocab_size, oov_token=oov_token) tokenizer.fit_on_texts(training_texts) return tokenizer def preprocess_texts(texts, tokenizer, max_len, padding='post', truncating='post'): """Applies tokenization and padding using a fitted tokenizer.""" sequences = tokenizer.texts_to_sequences(texts) padded = pad_sequences(sequences, maxlen=max_len, padding=padding, truncating=truncating) return padded # --- Usage Example --- # 1. Create pipeline (fit tokenizer) on training data train_corpus = [ "This is the first document.", "This document is the second document.", "And this is the third one.", "Is this the first document?", ] pipeline_tokenizer = create_text_pipeline(train_corpus, max_vocab_size=100) pipeline_maxlen = 10 # Set based on analysis or requirement # 2. Process training data train_padded = preprocess_texts(train_corpus, pipeline_tokenizer, pipeline_maxlen) print("\n--- Pipeline Usage ---") print("Processed Training Data Shape:", train_padded.shape) # print(train_padded) # Optionally print the array # 3. Process new/validation/test data validation_data = ["This is another document to process."] validation_padded = preprocess_texts(validation_data, pipeline_tokenizer, pipeline_maxlen) print("Processed Validation Data Shape:", validation_padded.shape) print(validation_padded)Final NotesInput to Model: The padded_sequences tensor is now ready. The typical next step in a model is an Embedding layer. This layer will take these integer sequences and convert each integer into a dense vector representation. If you use padding value 0 (the default), configure the Embedding layer with mask_zero=True (in Keras) or ensure subsequent RNN layers handle masking appropriately, so the model learns to ignore these padded steps.Vocabulary Size and maxlen: Choosing vocab_size and max_sequence_len involves trade-offs. Larger values capture more information but increase model size, memory usage, and computation. Analyze your data (e.g., sequence length distribution) to make informed choices.Padding/Truncating Strategy: 'pre' vs. 'post' padding/truncating can sometimes impact performance, especially for tasks where recent information is more important. Experimentation might be needed.Integration: This preprocessing logic should be applied consistently whenever you feed data to your model – during training, evaluation, and prediction. Frameworks like tf.data or PyTorch DataLoader help integrate such preprocessing steps efficiently into your data loading workflow.This practical pipeline provides a standard and repeatable way to prepare your text data, forming a fundamental step before training sophisticated sequence models like LSTMs and GRUs.