使用Scikit-LLM进行文本摘要
本文介绍如何使用Scikit-LLM库中的文本摘要功能,通过构建自定义转换器集成Hugging Face的预训练摘要模型,并将其嵌入scikit-learn流水线中,实现从长文本到分类的端到端流程。
文章情报
要点
- Scikit-LLM桥接传统机器学习与大语言模型,提供零样本和少样本分类及文本摘要功能。
- 自定义HuggingFaceSummarizer类继承自BaseEstimator和TransformerMixin,可加载预训练摘要模型并生成摘要。
- 通过流水线将摘要、TF-IDF向量化和逻辑回归分类器串联,实现数据预处理与模型训练一体化。
- 示例中使用轻量级模型distilbart-cnn-12-6,摘要质量有限但可自由替换为更强模型。
为什么重要
这条新闻值得关注,因为Scikit-LLM桥接传统机器学习与大语言模型,提供零样本和少样本分类及文本摘要功能。
技术影响
可能影响模型选型、推理成本、产品能力和评测基准。
文本摘要与Scikit-LLM
在之前的文章中,我们介绍了Scikit-LLM——一个连接传统机器学习模型与大型语言模型(LLM)的库,并展示了如何使用它进行零样本和少样本分类。现在,我们将探讨另一个强大功能:文本摘要。当下游任务被大量文本数据困扰时,摘要可以压缩长文本为简洁要点,从而提升处理效率。本文将指导你构建一个包含摘要步骤的数据准备流水线。
初始设置
首先安装Scikit-LLM(如在云端笔记本环境中,请将pip替换为!pip):
pip install scikit-llm默认情况下,Scikit-LLM使用OpenAI模型,这可能成本较高。作为替代,你可以使用Hugging Face的免费预训练摘要模型,例如sshleifer/distilbart-cnn-12-6。此时还需安装Transformers库:
pip install transformers==4.37.2自定义摘要转换器
以下类定义封装了加载预训练模型(fit)和应用推理(transform)的逻辑:
from sklearn.base import BaseEstimator, TransformerMixin
from transformers import pipeline
import torch
class HuggingFaceSummarizer(BaseEstimator, TransformerMixin):
def __init__(self, model_name="sshleifer/distilbart-cnn-12-6", max_length=40, min_length=10):
self.model_name = model_name
self.max_length = max_length
self.min_length = min_length
self.summarizer = None
self.device = 0 if torch.cuda.is_available() else -1
def fit(self, X, y=None):
if self.summarizer is None:
self.summarizer = pipeline("summarization", model=self.model_name, device=self.device)
return self
def transform(self, X):
if self.summarizer is None:
self.summarizer = pipeline("summarization", model=self.model_name, device=self.device)
results = self.summarizer(X, max_length=self.max_length, min_length=self.min_length, truncation=True)
return [res['summary_text'] for res in results]该类继承自scikit-learn的自定义转换器基类,确保与流水线无缝集成。fit方法仅加载模型,transform方法对输入文本列表生成摘要。
构建端到端流水线
假设我们有两个长文本及其对应的情感标签:
X_long_texts = [
"I've been using this vacuum cleaner for about three weeks now. At first, I struggled with the attachments, and the manual wasn't very clear. However, once I figured out how the motorized brush works, it easily picked up all the pet hair on my rugs. Overall, it's a solid machine, though a bit heavy to carry up the stairs.",
"The delivery was delayed by four days, which was incredibly frustrating because I needed it for a weekend trip. When the backpack finally arrived, the zipper snagged immediately. I tried to fix it, but the fabric feels cheap and flimsy. I will definitely be returning this and asking for a full refund."
]
y_labels = ["positive", "negative"]现在创建一个流水线,将摘要、TF-IDF向量化和逻辑回归分类器串联起来:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
classification_pipeline = Pipeline([
('summarizer', HuggingFaceSummarizer(max_length=30, min_length=10)),
('vectorizer', TfidfVectorizer()),
('classifier', LogisticRegression())
])训练流水线:
classification_pipeline.fit(X_long_texts, y_labels)
print("Pipeline trained successfully on summarized reviews!")训练过程中,流水线会自动下载摘要模型、对长文本进行摘要、将摘要向量化,并训练分类器。查看生成的摘要:
['Overall, it\'s a solid machine, though a bit heavy to carry up the stairs. At first, I struggled with the attachments,', 'The delivery was delayed by four days, which was incredibly frustrating. The zipper snagged immediately. The fabric feels cheap and flimsy.']本例中使用的轻量级模型distilbart-cnn-12-6生成的摘要质量有限,但你可以轻松替换为更强大的模型(如OpenAI的GPT系列)以获得更好的结果。
总结
借助Scikit-LLM,你可以将LLM驱动的文本摘要无缝整合到传统的scikit-learn工作流中。这种方法在保留关键信息的同时大幅减少了文本维度,从而提升下游机器学习任务的效率。尝试将上述代码应用于实际的情感分类数据集,体验LLM与传统机器学习的融合之力。