Having explored the theory behind word embeddings like Word2Vec and GloVe, let's put these concepts into practice. In this section, we'll use the popular gensim library to train our own Word2Vec model on a small dataset, explore the resulting word vectors, and learn how to load and utilize powerful pre-trained embedding models. This hands-on experience will solidify your understanding of how distributional semantics translates into practical vector representations.Setting Up Your EnvironmentFirst, ensure you have the necessary libraries installed. We'll primarily use gensim for Word2Vec, nltk for sample data and basic tokenization, and potentially sklearn and plotly for visualization.pip install gensim nltk scikit-learn plotlyYou might also need to download specific nltk resources if you haven't already:import nltk try: nltk.data.find('corpora/brown') except nltk.downloader.DownloadError: nltk.download('brown') nltk.download('punkt')Preparing Text DataWord2Vec learns from sequences of words (sentences or documents). Unlike TF-IDF, which often benefits from aggressive preprocessing like stop word removal and stemming, Word2Vec generally works better with less manipulation. Basic cleaning like lowercasing and tokenization is usually sufficient. The surrounding context words, including stop words, provide valuable information for learning embeddings.Let's use the Brown Corpus from nltk as our sample dataset and perform minimal preprocessing:import nltk from nltk.corpus import brown import string # Load sentences from the Brown Corpus # Each sentence is already a list of words/tokens in this corpus raw_sentences = brown.sents() # Preprocess: lowercase and remove punctuation processed_sentences = [] for sentence in raw_sentences: processed_sentence = [word.lower() for word in sentence if word not in string.punctuation] # Ensure sentence is not empty after removing punctuation if processed_sentence: processed_sentences.append(processed_sentence) print(f"Loaded and processed {len(processed_sentences)} sentences.") # Example of a processed sentence print("Example processed sentence:", processed_sentences[10])This gives us a list of lists, where each inner list contains the tokens of a sentence. This is the format gensim expects.Training a Word2Vec Model with GensimNow, let's train a Word2Vec model using gensim. We need to specify several hyperparameters:sentences: The input data (our processed_sentences).vector_size: The dimensionality of the word vectors (e.g., 100, 300). Higher dimensions can capture more complex relationships but require more data and computation.window: The maximum distance between the current and predicted word within a sentence.min_count: Ignores all words with a total frequency lower than this. Helps filter out rare words/typos.workers: Number of CPU cores to use for training (parallelization).sg: Training algorithm. 0 for CBOW (Continuous Bag-of-Words), 1 for Skip-gram. Skip-gram often works better for infrequent words, while CBOW is faster.epochs: Number of iterations (epochs) over the corpus.from gensim.models import Word2Vec import multiprocessing # To find the number of cores # Define model parameters vector_dim = 100 # Dimensionality of the embeddings window_size = 5 # Context window size min_word_count = 5 # Minimum word frequency training_algorithm = 1 # 1 for Skip-gram, 0 for CBOW num_workers = multiprocessing.cpu_count() # Use all available cores training_epochs = 10 # Number of training iterations print("Training Word2Vec model...") # Initialize and train the model # Note: Training can take a few minutes depending on your data size and CPU model = Word2Vec(sentences=processed_sentences, vector_size=vector_dim, window=window_size, min_count=min_word_count, sg=training_algorithm, workers=num_workers, epochs=training_epochs) print("Model training complete.") # You can save the trained model for later use # model.save("brown_word2vec.model") # To load: model = Word2Vec.load("brown_word2vec.model")Exploring the Learned EmbeddingsOnce the model is trained, we can investigate the learned representations. The model.wv attribute holds the vocabulary and vectors.# Access the vector for a specific word try: vector_king = model.wv['king'] print(f"Vector for 'king':\n {vector_king[:10]}...") # Print first 10 dimensions print(f"Shape of 'king' vector: {vector_king.shape}") except KeyError: print("'king' not in vocabulary (likely due to min_count or not present in corpus).") # Find words most similar to a given word try: similar_to_woman = model.wv.most_similar('woman', topn=5) print("\nWords most similar to 'woman':") for word, score in similar_to_woman: print(f"- {word}: {score:.4f}") except KeyError: print("'woman' not in vocabulary.") # Explore word analogies: king - man + woman = ? try: analogy_result = model.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=1) print(f"\nAnalogy 'king' - 'man' + 'woman' ≈ {analogy_result[0][0]} (Score: {analogy_result[0][1]:.4f})") except KeyError as e: print(f"\nCould not perform analogy: Word '{e.args[0]}' not in vocabulary.") # Check if a word is in the vocabulary print(f"\nIs 'government' in vocabulary? {'government' in model.wv.key_to_index}") print(f"Vocabulary size: {len(model.wv.key_to_index)}")The results, especially for analogies, depend heavily on the size and nature of the training data and the chosen hyperparameters. Our model trained only on the Brown Corpus might not capture analogies as well as models trained on gigabytes of text.Visualizing Embeddings with PCA/t-SNEWord vectors live in a high-dimensional space (100 dimensions in our example). To visualize them, we need to reduce their dimensionality to 2D or 3D. Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) are common techniques for this. t-SNE is often preferred for visualizing local structure and clusters.Let's visualize a subset of our learned vectors using PCA and Plotly.import numpy as np from sklearn.decomposition import PCA import plotly.graph_objects as go # Select a subset of words for visualization words_to_visualize = ['man', 'woman', 'king', 'queen', 'boy', 'girl', 'father', 'mother', 'son', 'daughter', 'uncle', 'aunt', 'dog', 'cat', 'animal', 'pet', 'house', 'home', 'car', 'road', 'city', 'country', 'love', 'hate', 'happy', 'sad'] # Get vectors for the selected words that are in the vocabulary vectors = [] words = [] for word in words_to_visualize: if word in model.wv.key_to_index: vectors.append(model.wv[word]) words.append(word) if not vectors: print("None of the selected words for visualization are in the vocabulary.") else: vectors = np.array(vectors) # Reduce dimensions using PCA pca = PCA(n_components=2) vectors_2d = pca.fit_transform(vectors) # Create interactive scatter plot with Plotly fig = go.Figure(data=go.Scatter( x=vectors_2d[:, 0], y=vectors_2d[:, 1], mode='markers+text', marker=dict( size=8, color='#228be6' # Blue marker color ), text=words, textposition='top center' )) fig.update_layout( title='Word Embeddings Visualized using PCA (2D)', xaxis_title='PCA Component 1', yaxis_title='PCA Component 2', width=700, height=600, template='plotly_white' # Use a clean template ) # Display the plot (in environments like Jupyter) # fig.show() # Or generate the JSON for web embedding plotly_json = fig.to_json() print("\nPlotly JSON for visualization (first 500 chars):") print(plotly_json[:500] + "...") # Print snippet of JSON # In a web context, you would embed this JSON using Plotly.js # Example: # ```plotly # {"layout": {"title": {"text": "Word Embeddings Visualized using PCA (2D)"}, ...}, "data": [{"x": [...], "y": [...], ...}]} # ```{"layout": {"title": {"text": "Word Embeddings Visualized using PCA (2D)"}, "xaxis": {"title": {"text": "PCA Component 1"}}, "yaxis": {"title": {"text": "PCA Component 2"}}, "width": 700, "height": 600, "template": "plotly_white"}, "data": [{"x": [-1.01611066, -0.63843226, 0.5325136, 1.3459572, -1.273785, -0.9712762, 0.08503094, -0.06407633, -0.41735646, -0.40381184, 0.28162795, 0.1939366, 0.9796771, 1.1803932, 1.1139715, 1.0847292, -0.15815337, -0.09363376, -0.23822996, -0.22942254, -0.73567724, -0.43333733, -0.17882924, 0.05811722, -0.46446952, -0.06559316], "y": [0.4892139, 0.3983181, 1.3179333, 1.5159775, 0.5114402, 0.29538316, 0.8561967, 0.8343815, 0.4325446, 0.60650885, 0.6452927, 0.9083193, -1.057401, -1.008875, -0.7273944, -0.72633445, -0.613686, -0.8512656, -1.032218, -0.9064242, -0.5827198, -0.5822094, -0.00944266, 0.13858478, 0.34471238, 0.2827387], "mode": "markers+text", "marker": {"size": 8, "color": "#228be6"}, "text": ["man", "woman", "king", "queen", "boy", "girl", "father", "mother", "son", "daughter", "uncle", "aunt", "dog", "cat", "animal", "pet", "house", "home", "car", "road", "city", "country", "love", "hate", "happy", "sad"], "textposition": "top center", "type": "scatter"}]}PCA projection of word vectors trained on the Brown Corpus. Observe how related concepts like ('man', 'woman', 'boy', 'girl') or ('dog', 'cat', 'pet') tend to cluster together, demonstrating that the embeddings capture semantic relationships.Using Pre-trained Word Embedding ModelsTraining embeddings requires significant computational resources and massive datasets to achieve high quality. Often, it's more practical to use pre-trained embeddings released by research institutions. These models are trained on web-scale corpora (like Google News or Wikipedia) and capture rich semantic relationships. gensim provides convenient access to several popular pre-trained models.Let's load a smaller GloVe model pre-trained on Wikipedia. Other options include larger GloVe models or Word2Vec models like word2vec-google-news-300.import gensim.downloader as api # List available models (optional) # print(list(api.info()['models'].keys())) print("\nLoading pre-trained GloVe model (glove-wiki-gigaword-100)...") # This will download the model if not present locally (can take time and disk space) try: glove_model = api.load("glove-wiki-gigaword-100") # 100-dimensional GloVe vectors print("Pre-trained GloVe model loaded.") # Now use it like our own model print("Vector shape:", glove_model['computer'].shape) print("\nWords similar to 'technology' (GloVe):") similar_tech = glove_model.most_similar('technology', topn=5) for word, score in similar_tech: print(f"- {word}: {score:.4f}") print("\nAnalogy 'king' - 'man' + 'woman' ≈ (GloVe):") analogy_result_glove = glove_model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1) print(f"- {analogy_result_glove[0][0]} (Score: {analogy_result_glove[0][1]:.4f})") except Exception as e: print(f"Failed to load pre-trained model. Error: {e}") print("Check your internet connection or try a different model.") You'll likely observe that the pre-trained model provides more intuitive similarity results and performs better on analogy tasks due to the extensive amount of data it was trained on.This practical exercise demonstrated how to train your own Word2Vec model and, perhaps more importantly for many applications, how to load and utilize powerful pre-trained embeddings. These dense vector representations are fundamental building blocks for many advanced NLP tasks, including the sequence models we will introduce in the next chapter. They provide a way to feed semantic understanding into machine learning algorithms.