AI News HubLIVE
站内改写

使用Scikit-LLM进行文本摘要

本文介绍如何使用Scikit-LLM库中的文本摘要功能,通过构建自定义转换器集成Hugging Face的预训练摘要模型,并将其嵌入scikit-learn流水线中,实现从长文本到分类的端到端流程。

文章情报

工程师进阶

要点

  • Scikit-LLM桥接传统机器学习与大语言模型,提供零样本和少样本分类及文本摘要功能。
  • 自定义HuggingFaceSummarizer类继承自BaseEstimator和TransformerMixin,可加载预训练摘要模型并生成摘要。
  • 通过流水线将摘要、TF-IDF向量化和逻辑回归分类器串联,实现数据预处理与模型训练一体化。
  • 示例中使用轻量级模型distilbart-cnn-12-6,摘要质量有限但可自由替换为更强模型。

为什么重要

这条新闻值得关注,因为Scikit-LLM桥接传统机器学习与大语言模型,提供零样本和少样本分类及文本摘要功能。

技术影响

可能影响模型选型、推理成本、产品能力和评测基准。

文本摘要与Scikit-LLM

在之前的文章中,我们介绍了Scikit-LLM——一个连接传统机器学习模型与大型语言模型(LLM)的库,并展示了如何使用它进行零样本和少样本分类。现在,我们将探讨另一个强大功能:文本摘要。当下游任务被大量文本数据困扰时,摘要可以压缩长文本为简洁要点,从而提升处理效率。本文将指导你构建一个包含摘要步骤的数据准备流水线。

初始设置

首先安装Scikit-LLM(如在云端笔记本环境中,请将pip替换为!pip):

pip install scikit-llm

默认情况下,Scikit-LLM使用OpenAI模型,这可能成本较高。作为替代,你可以使用Hugging Face的免费预训练摘要模型,例如sshleifer/distilbart-cnn-12-6。此时还需安装Transformers库:

pip install transformers==4.37.2

自定义摘要转换器

以下类定义封装了加载预训练模型(fit)和应用推理(transform)的逻辑:

from sklearn.base import BaseEstimator, TransformerMixin
from transformers import pipeline
import torch

class HuggingFaceSummarizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name="sshleifer/distilbart-cnn-12-6", max_length=40, min_length=10):
        self.model_name = model_name
        self.max_length = max_length
        self.min_length = min_length
        self.summarizer = None
        self.device = 0 if torch.cuda.is_available() else -1

    def fit(self, X, y=None):
        if self.summarizer is None:
            self.summarizer = pipeline("summarization", model=self.model_name, device=self.device)
        return self

    def transform(self, X):
        if self.summarizer is None:
            self.summarizer = pipeline("summarization", model=self.model_name, device=self.device)
        results = self.summarizer(X, max_length=self.max_length, min_length=self.min_length, truncation=True)
        return [res['summary_text'] for res in results]

该类继承自scikit-learn的自定义转换器基类,确保与流水线无缝集成。fit方法仅加载模型,transform方法对输入文本列表生成摘要。

构建端到端流水线

假设我们有两个长文本及其对应的情感标签:

X_long_texts = [
    "I've been using this vacuum cleaner for about three weeks now. At first, I struggled with the attachments, and the manual wasn't very clear. However, once I figured out how the motorized brush works, it easily picked up all the pet hair on my rugs. Overall, it's a solid machine, though a bit heavy to carry up the stairs.",
    "The delivery was delayed by four days, which was incredibly frustrating because I needed it for a weekend trip. When the backpack finally arrived, the zipper snagged immediately. I tried to fix it, but the fabric feels cheap and flimsy. I will definitely be returning this and asking for a full refund."
]
y_labels = ["positive", "negative"]

现在创建一个流水线,将摘要、TF-IDF向量化和逻辑回归分类器串联起来:

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

classification_pipeline = Pipeline([
    ('summarizer', HuggingFaceSummarizer(max_length=30, min_length=10)),
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LogisticRegression())
])

训练流水线:

classification_pipeline.fit(X_long_texts, y_labels)
print("Pipeline trained successfully on summarized reviews!")

训练过程中,流水线会自动下载摘要模型、对长文本进行摘要、将摘要向量化,并训练分类器。查看生成的摘要:

['Overall, it\'s a solid machine, though a bit heavy to carry up the stairs. At first, I struggled with the attachments,', 'The delivery was delayed by four days, which was incredibly frustrating. The zipper snagged immediately. The fabric feels cheap and flimsy.']

本例中使用的轻量级模型distilbart-cnn-12-6生成的摘要质量有限,但你可以轻松替换为更强大的模型(如OpenAI的GPT系列)以获得更好的结果。

总结

借助Scikit-LLM,你可以将LLM驱动的文本摘要无缝整合到传统的scikit-learn工作流中。这种方法在保留关键信息的同时大幅减少了文本维度,从而提升下游机器学习任务的效率。尝试将上述代码应用于实际的情感分类数据集,体验LLM与传统机器学习的融合之力。