In this article, you will learn how to use scikit-LLM’s text summarization feature to handle large volumes of text in machine learning pipelines.
Topics we will cover include:
- How to build a custom scikit-learn-compatible transformer that wraps a Hugging Face summarization model.
- How to integrate LLM-driven text summarization into a scikit-learn Pipeline for data preprocessing.
- How to chain summarization, TF-IDF vectorization, and a classifier into a single end-to-end pipeline.
Text Summarization with Scikit-LLM
Image by Editor
Introduction
In a previous post, we introduced scikit-LLM, a library that bridges the gap between traditional machine learning models and modern large language models (LLMs). In particular, we showcased how to implement zero-shot and few-shot classification use cases with scikit-LLM.
Now, we attempt to answer the question: What if our downstream machine learning use case is hampered by massive amounts of text? To address this challenge, we will explore and use summarizers: another powerful feature of this library that distills long texts into succinct summaries. Let’s see how, by implementing a data preparation pipeline that incorporates this process!
Initial Setup
The first step is to make sure you have scikit-LLM installed — replace “pip” with “!pip” if you are working in a cloud notebook environment:
Note that by default, scikit-LLM resorts to OpenAI language models, which can be expensive to run repeatedly, or whose number of uses may be very limited under a free OpenAI account. Alternatively, you can use free Hugging Face pre-trained models for summarization, like sshleifer/distilbart-cnn-12-6. In such a case, make sure you also install Hugging Face’s Transformers library, to be able to load Hugging Face models in your program.
pip install transformers==4.37.2 |
LLM-Driven Text Summarization Pipeline
The following class definition encompasses the logic to load a pre-trained model (fit()) and apply inference on it, i.e. summarize input texts (transform()):
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
from sklearn.base import BaseEstimator, TransformerMixin from transformers import pipeline import torch class HuggingFaceSummarizer(BaseEstimator, TransformerMixin): def __init__(self, model_name="sshleifer/distilbart-cnn-12-6", max_length=40, min_length=10): self.model_name = model_name self.max_length = max_length self.min_length = min_length self.summarizer = None self.device = 0 if torch.cuda.is_available() else -1 def fit(self, X, y=None): # The fit() method should just load a pre-trained model into memory # device=0 targets free GPU if you are using a Colab/Kaggle notebook. if self.summarizer is None: self.summarizer = pipeline("summarization", model=self.model_name, device=self.device) return self def transform(self, X): # Ensure model is loaded if self.summarizer is None: self.summarizer = pipeline("summarization", model=self.model_name, device=self.device) # Process texts and extract summary strings results = self.summarizer( X, max_length=self.max_length, min_length=self.min_length, truncation=True ) return [res['summary_text'] for res in results] |
Importantly, the class we defined inherits from custom transformer classes: a necessary step to ensure Hugging Face models integrate smoothly with scikit-learn preprocessing and modeling tools.
For simplicity, say we will only summarize two text reviews that are part of a larger dataset for text classification. The two “long” texts (features) and the reviews’ sentiments (labels) could look like:
X_long_texts = [ "I've been using this vacuum cleaner for about three weeks now. At first, I struggled with the attachments, and the manual wasn't very clear. However, once I figured out how the motorized brush works, it easily picked up all the pet hair on my rugs. Overall, it's a solid machine, though a bit heavy to carry up the stairs.", "The delivery was delayed by four days, which was incredibly frustrating because I needed it for a weekend trip. When the backpack finally arrived, the zipper snagged immediately. I tried to fix it, but the fabric feels cheap and flimsy. I will definitely be returning this and asking for a full refund.", ] y_labels = ["positive", "negative"] |
The real magic happens next. We define a pipeline that brings together our data preprocessing — namely, LLM-driven summarization — and the training of a classifier. In a real scenario, you will need far more than two training examples to build a proper classifier, of course, but the point here is to illustrate how text summarization can reduce the dimensionality of text data:
from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression # 1. Define the Pipeline # Naming the variable 'classification_pipeline' avoids possible conflict with transformers.pipeline function classification_pipeline = Pipeline([ ('summarizer', HuggingFaceSummarizer(max_length=30, min_length=10)), ('vectorizer', TfidfVectorizer()), # Used to encode build numerical text representations, needed for ML ('classifier', LogisticRegression()) ]) |
Once the pipeline has been defined, here’s how to run it:
# 2. Train the Pipeline # This downloads the model, summarizes the long texts on the GPU, # vectorizes the short summaries, and trains a classifier. classification_pipeline.fit(X_long_texts, y_labels) print("Pipeline trained successfully on summarized reviews!") |
That’s all! Try adapting the code above to a real, labeled text dataset for binary sentiment classification, and see how it works in practice.
Before we wrap up, if you are curious about what the summarized texts look like, you can inspect the output directly:
[" Overall, it's a solid machine, though a bit heavy to carry up the stairs . At first, I struggled with the attachments,", ' The delivery was delayed by four days, which was incredibly frustrating . The zipper snagged immediately . The fabric feels cheap and flimsy .'] |
The summaries are, of course, far from the quality you would get from ChatGPT or Google Gemini — the model we used is a free, lightweight pre-trained model, after all. That said, choosing more powerful models will certainly yield better results.
Summary
We bridged the gap between classic machine learning modeling and advanced text processing via pre-trained large language models, thanks to scikit-LLM: a library that leverages the best of both worlds.