In this article, you will learn how to use scikit-LLM’s text summarization feature to handle large volumes of text in machine learning pipelines.

Topics we will cover include:

  • How to build a custom scikit-learn-compatible transformer that wraps a Hugging Face summarization model.
  • How to integrate LLM-driven text summarization into a scikit-learn Pipeline for data preprocessing.
  • How to chain summarization, TF-IDF vectorization, and a classifier into a single end-to-end pipeline.

Text Summarization with Scikit-LLM
Image by Editor

Introduction

In a previous post, we introduced scikit-LLM, a library that bridges the gap between traditional machine learning models and modern large language models (LLMs). In particular, we showcased how to implement zero-shot and few-shot classification use cases with scikit-LLM.

Now, we attempt to answer the question: What if our downstream machine learning use case is hampered by massive amounts of text? To address this challenge, we will explore and use summarizers: another powerful feature of this library that distills long texts into succinct summaries. Let’s see how, by implementing a data preparation pipeline that incorporates this process!

Initial Setup

The first step is to make sure you have scikit-LLM installed — replace “pip” with “!pip” if you are working in a cloud notebook environment:

Note that by default, scikit-LLM resorts to OpenAI language models, which can be expensive to run repeatedly, or whose number of uses may be very limited under a free OpenAI account. Alternatively, you can use free Hugging Face pre-trained models for summarization, like sshleifer/distilbart-cnn-12-6. In such a case, make sure you also install Hugging Face’s Transformers library, to be able to load Hugging Face models in your program.

LLM-Driven Text Summarization Pipeline

The following class definition encompasses the logic to load a pre-trained model (fit()) and apply inference on it, i.e. summarize input texts (transform()):

Importantly, the class we defined inherits from custom transformer classes: a necessary step to ensure Hugging Face models integrate smoothly with scikit-learn preprocessing and modeling tools.

For simplicity, say we will only summarize two text reviews that are part of a larger dataset for text classification. The two “long” texts (features) and the reviews’ sentiments (labels) could look like:

The real magic happens next. We define a pipeline that brings together our data preprocessing — namely, LLM-driven summarization — and the training of a classifier. In a real scenario, you will need far more than two training examples to build a proper classifier, of course, but the point here is to illustrate how text summarization can reduce the dimensionality of text data:

Once the pipeline has been defined, here’s how to run it:

That’s all! Try adapting the code above to a real, labeled text dataset for binary sentiment classification, and see how it works in practice.

Before we wrap up, if you are curious about what the summarized texts look like, you can inspect the output directly:

The summaries are, of course, far from the quality you would get from ChatGPT or Google Gemini — the model we used is a free, lightweight pre-trained model, after all. That said, choosing more powerful models will certainly yield better results.

Summary

We bridged the gap between classic machine learning modeling and advanced text processing via pre-trained large language models, thanks to scikit-LLM: a library that leverages the best of both worlds.