AI News HubLIVE
站内改写

使用Scikit-LLM進行文本摘要

本文介紹如何使用Scikit-LLM庫中的文本摘要功能,透過構建自定義轉換器整合Hugging Face的預訓練摘要模型,並將其嵌入scikit-learn流水線中,實現從長文本到分類的端到端流程。

文章情報

工程師進階

要點

  • Scikit-LLM橋接傳統機器學習與大語言模型,提供零樣本和少樣本分類及文本摘要功能。
  • 自定義HuggingFaceSummarizer類繼承自BaseEstimator和TransformerMixin,可載入預訓練摘要模型並生成摘要。
  • 透過流水線將摘要、TF-IDF向量化和邏輯迴歸分類器串聯,實現資料預處理與模型訓練一體化。
  • 示例中使用輕量級模型distilbart-cnn-12-6,摘要質量有限但可自由替換為更強模型。

為什麼重要

這條新聞值得關注,因為Scikit-LLM橋接傳統機器學習與大語言模型,提供零樣本和少樣本分類及文本摘要功能。

技術影響

可能影響模型選型、推理成本、產品能力和評測基準。

文本摘要與Scikit-LLM

在之前的文章中,我們介紹了Scikit-LLM——一個連線傳統機器學習模型與大型語言模型(LLM)的庫,並展示瞭如何使用它進行零樣本和少樣本分類。現在,我們將探討另一個強大功能:文本摘要。當下遊任務被大量文本資料困擾時,摘要可以壓縮長文本為簡潔要點,從而提升處理效率。本文將指導你構建一個包含摘要步驟的資料準備流水線。

初始設定

首先安裝Scikit-LLM(如在雲端筆記本環境中,請將pip替換為!pip):

pip install scikit-llm

預設情況下,Scikit-LLM使用OpenAI模型,這可能成本較高。作為替代,你可以使用Hugging Face的免費預訓練摘要模型,例如sshleifer/distilbart-cnn-12-6。此時還需安裝Transformers庫:

pip install transformers==4.37.2

自定義摘要轉換器

以下類定義封裝了載入預訓練模型(fit)和應用推理(transform)的邏輯:

from sklearn.base import BaseEstimator, TransformerMixin
from transformers import pipeline
import torch

class HuggingFaceSummarizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name="sshleifer/distilbart-cnn-12-6", max_length=40, min_length=10):
        self.model_name = model_name
        self.max_length = max_length
        self.min_length = min_length
        self.summarizer = None
        self.device = 0 if torch.cuda.is_available() else -1

    def fit(self, X, y=None):
        if self.summarizer is None:
            self.summarizer = pipeline("summarization", model=self.model_name, device=self.device)
        return self

    def transform(self, X):
        if self.summarizer is None:
            self.summarizer = pipeline("summarization", model=self.model_name, device=self.device)
        results = self.summarizer(X, max_length=self.max_length, min_length=self.min_length, truncation=True)
        return [res['summary_text'] for res in results]

該類繼承自scikit-learn的自定義轉換器基類,確保與流水線無縫整合。fit方法僅載入模型,transform方法對輸入文本列表生成摘要。

構建端到端流水線

假設我們有兩個長文本及其對應的情感標籤:

X_long_texts = [
    "I've been using this vacuum cleaner for about three weeks now. At first, I struggled with the attachments, and the manual wasn't very clear. However, once I figured out how the motorized brush works, it easily picked up all the pet hair on my rugs. Overall, it's a solid machine, though a bit heavy to carry up the stairs.",
    "The delivery was delayed by four days, which was incredibly frustrating because I needed it for a weekend trip. When the backpack finally arrived, the zipper snagged immediately. I tried to fix it, but the fabric feels cheap and flimsy. I will definitely be returning this and asking for a full refund."
]
y_labels = ["positive", "negative"]

現在建立一個流水線,將摘要、TF-IDF向量化和邏輯迴歸分類器串聯起來:

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

classification_pipeline = Pipeline([
    ('summarizer', HuggingFaceSummarizer(max_length=30, min_length=10)),
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LogisticRegression())
])

訓練流水線:

classification_pipeline.fit(X_long_texts, y_labels)
print("Pipeline trained successfully on summarized reviews!")

訓練過程中,流水線會自動下載摘要模型、對長文本進行摘要、將摘要向量化,並訓練分類器。檢視生成的摘要:

['Overall, it\'s a solid machine, though a bit heavy to carry up the stairs. At first, I struggled with the attachments,', 'The delivery was delayed by four days, which was incredibly frustrating. The zipper snagged immediately. The fabric feels cheap and flimsy.']

本例中使用的輕量級模型distilbart-cnn-12-6生成的摘要質量有限,但你可以輕鬆替換為更強大的模型(如OpenAI的GPT系列)以獲得更好的結果。

總結

藉助Scikit-LLM,你可以將LLM驅動的文本摘要無縫整合到傳統的scikit-learn工作流中。這種方法在保留關鍵資訊的同時大幅減少了文本維度,從而提升下游機器學習任務的效率。嘗試將上述程式碼應用於實際的情感分類資料集,體驗LLM與傳統機器學習的融合之力。