Once your machine learning model is serialized, the next task is to load it into your running FastAPI application so it can be used to make predictions. The way you load your model can significantly impact your application's startup time, memory usage, and the latency of the first prediction request. Let's examine common strategies for loading models within FastAPI.
The most straightforward approach is often to load the model when the application first starts. This means the model is ready in memory before the first request arrives, ensuring consistent prediction latency. The primary cost is a potentially slower application startup time, as loading the model becomes part of the initialization process.
A simple method is to load the model into a global variable in your main application file.
# main.py
import joblib
from fastapi import FastAPI
app = FastAPI()
# Load the model when the module is imported (at startup)
try:
model = joblib.load("models/sentiment_model.pkl")
# You might also load related objects like vectorizers
# vectorizer = joblib.load("models/tfidf_vectorizer.pkl")
print("Model loaded successfully at startup.")
except FileNotFoundError:
print("Error: Model file not found. Ensure 'models/sentiment_model.pkl' exists.")
model = None # Handle the absence of the model gracefully
except Exception as e:
print(f"Error loading model: {e}")
model = None
@app.post("/predict")
async def predict_sentiment(text: str):
if model is None:
# Return an error if the model failed to load
raise HTTPException(status_code=503, detail="Model is not available")
# Assume preprocessing and prediction logic here
# features = vectorizer.transform([text])
# prediction = model.predict(features)
# return {"text": text, "sentiment_prediction": prediction[0]}
# Placeholder for demonstration
return {"text": text, "sentiment_prediction": "positive"} # Replace with actual logic
# Note: For simplicity, input/output validation with Pydantic is omitted here,
# but you should use it as covered in Chapter 2.
While simple, using global variables directly can sometimes make testing and managing application state more complex, especially in larger applications.
FastAPI provides lifespan
events (or the older startup
/shutdown
events) that allow you to run code before the application starts accepting requests and after it finishes. This is a cleaner place to load resources like ML models.
# main_lifespan.py
import joblib
from fastapi import FastAPI
from contextlib import asynccontextmanager
# Dictionary to hold application state, including the loaded model
app_state = {}
@asynccontextmanager
async def lifespan(app: FastAPI):
# Code to run before the application starts
print("Application startup: Loading model...")
try:
app_state["model"] = joblib.load("models/sentiment_model.pkl")
# Load other artifacts if needed
# app_state["vectorizer"] = joblib.load("models/tfidf_vectorizer.pkl")
print("Model loaded successfully.")
except FileNotFoundError:
print("Error: Model file not found.")
app_state["model"] = None
except Exception as e:
print(f"Error loading model during startup: {e}")
app_state["model"] = None
yield
# Code to run when the application is shutting down
print("Application shutdown: Cleaning up resources...")
app_state.clear()
app = FastAPI(lifespan=lifespan)
@app.post("/predict")
async def predict_sentiment(text: str):
model = app_state.get("model")
if model is None:
raise HTTPException(status_code=503, detail="Model is not available")
# Placeholder for prediction logic
# features = app_state["vectorizer"].transform([text])
# prediction = model.predict(features)
# return {"text": text, "sentiment_prediction": prediction[0]}
return {"text": text, "sentiment_prediction": "positive"} # Replace with actual logic
Using the lifespan
context manager is the recommended approach for managing resources that need to be initialized at startup and cleaned up at shutdown. It keeps model loading logic separate from the main application definition and request handling.
Alternatively, you might load the model only when the first prediction request arrives. This approach, often called "lazy loading," results in a faster application startup because the model isn't loaded immediately. However, the first request that triggers the model loading will experience higher latency.
To avoid reloading the model on every subsequent request, you typically combine lazy loading with some form of caching. Python's functools.lru_cache
is a convenient way to achieve this for functions that load resources.
# main_lazy.py
import joblib
from fastapi import FastAPI, HTTPException
from functools import lru_cache
app = FastAPI()
@lru_cache(maxsize=1) # Cache the result of this function
def get_model():
print("Attempting to load model (lazy)...")
try:
model = joblib.load("models/sentiment_model.pkl")
print("Model loaded successfully.")
return model
except FileNotFoundError:
print("Error: Model file not found during lazy load.")
return None # Indicate failure
except Exception as e:
print(f"Error lazy loading model: {e}")
return None
@app.post("/predict")
async def predict_sentiment(text: str):
model = get_model() # Function call triggers loading only once
if model is None:
raise HTTPException(status_code=503, detail="Model could not be loaded")
# Placeholder for prediction logic
# features = ... # Assuming vectorizer is also loaded, perhaps via another cached function
# prediction = model.predict(features)
# return {"text": text, "sentiment_prediction": prediction[0]}
return {"text": text, "sentiment_prediction": "positive"} # Replace with actual logic
lru_cache(maxsize=1)
ensures that get_model
is executed only once. Subsequent calls will return the cached result (the loaded model object or None
if loading failed) without re-executing the loading logic.
Lazy loading is beneficial if:
Comparison of model loading strategies: Loading at startup incurs delay initially but provides consistent request handling times. Lazy loading starts the application faster but adds latency to the first request that requires the model.
Regardless of the strategy, it's essential to handle potential errors during model loading gracefully. Common issues include the model file being missing, corrupted, or incompatible with the current library versions. Use try...except
blocks around your loading code.
If a model fails to load at startup, you might prevent the application from starting entirely or log the error and have prediction endpoints return an appropriate error status code (like 503 Service Unavailable
). If using lazy loading, the endpoint that triggers the load should handle the failure, perhaps by logging the error and returning a 503
status.
Machine learning models, especially deep learning models, can consume significant amounts of memory. Keep this in mind when choosing a loading strategy and deploying your application. Loading large models at startup increases the application's baseline memory footprint. Ensure your deployment environment has sufficient RAM to accommodate the model(s) you intend to serve.
In the following sections, we will build upon these loading techniques as we create the actual prediction endpoints and explore how FastAPI's dependency injection system can further refine how models are provided to your request handlers.
© 2025 ApX Machine Learning