使用Scikit-LLM進行文本摘要
本文介紹如何使用Scikit-LLM庫中的文本摘要功能,透過構建自定義轉換器整合Hugging Face的預訓練摘要模型,並將其嵌入scikit-learn流水線中,實現從長文本到分類的端到端流程。
文章情報
要點
- Scikit-LLM橋接傳統機器學習與大語言模型,提供零樣本和少樣本分類及文本摘要功能。
- 自定義HuggingFaceSummarizer類繼承自BaseEstimator和TransformerMixin,可載入預訓練摘要模型並生成摘要。
- 透過流水線將摘要、TF-IDF向量化和邏輯迴歸分類器串聯,實現資料預處理與模型訓練一體化。
- 示例中使用輕量級模型distilbart-cnn-12-6,摘要質量有限但可自由替換為更強模型。
為什麼重要
這條新聞值得關注,因為Scikit-LLM橋接傳統機器學習與大語言模型,提供零樣本和少樣本分類及文本摘要功能。
技術影響
可能影響模型選型、推理成本、產品能力和評測基準。
文本摘要與Scikit-LLM
在之前的文章中,我們介紹了Scikit-LLM——一個連線傳統機器學習模型與大型語言模型(LLM)的庫,並展示瞭如何使用它進行零樣本和少樣本分類。現在,我們將探討另一個強大功能:文本摘要。當下遊任務被大量文本資料困擾時,摘要可以壓縮長文本為簡潔要點,從而提升處理效率。本文將指導你構建一個包含摘要步驟的資料準備流水線。
初始設定
首先安裝Scikit-LLM(如在雲端筆記本環境中,請將pip替換為!pip):
pip install scikit-llm預設情況下,Scikit-LLM使用OpenAI模型,這可能成本較高。作為替代,你可以使用Hugging Face的免費預訓練摘要模型,例如sshleifer/distilbart-cnn-12-6。此時還需安裝Transformers庫:
pip install transformers==4.37.2自定義摘要轉換器
以下類定義封裝了載入預訓練模型(fit)和應用推理(transform)的邏輯:
from sklearn.base import BaseEstimator, TransformerMixin
from transformers import pipeline
import torch
class HuggingFaceSummarizer(BaseEstimator, TransformerMixin):
def __init__(self, model_name="sshleifer/distilbart-cnn-12-6", max_length=40, min_length=10):
self.model_name = model_name
self.max_length = max_length
self.min_length = min_length
self.summarizer = None
self.device = 0 if torch.cuda.is_available() else -1
def fit(self, X, y=None):
if self.summarizer is None:
self.summarizer = pipeline("summarization", model=self.model_name, device=self.device)
return self
def transform(self, X):
if self.summarizer is None:
self.summarizer = pipeline("summarization", model=self.model_name, device=self.device)
results = self.summarizer(X, max_length=self.max_length, min_length=self.min_length, truncation=True)
return [res['summary_text'] for res in results]該類繼承自scikit-learn的自定義轉換器基類,確保與流水線無縫整合。fit方法僅載入模型,transform方法對輸入文本列表生成摘要。
構建端到端流水線
假設我們有兩個長文本及其對應的情感標籤:
X_long_texts = [
"I've been using this vacuum cleaner for about three weeks now. At first, I struggled with the attachments, and the manual wasn't very clear. However, once I figured out how the motorized brush works, it easily picked up all the pet hair on my rugs. Overall, it's a solid machine, though a bit heavy to carry up the stairs.",
"The delivery was delayed by four days, which was incredibly frustrating because I needed it for a weekend trip. When the backpack finally arrived, the zipper snagged immediately. I tried to fix it, but the fabric feels cheap and flimsy. I will definitely be returning this and asking for a full refund."
]
y_labels = ["positive", "negative"]現在建立一個流水線,將摘要、TF-IDF向量化和邏輯迴歸分類器串聯起來:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
classification_pipeline = Pipeline([
('summarizer', HuggingFaceSummarizer(max_length=30, min_length=10)),
('vectorizer', TfidfVectorizer()),
('classifier', LogisticRegression())
])訓練流水線:
classification_pipeline.fit(X_long_texts, y_labels)
print("Pipeline trained successfully on summarized reviews!")訓練過程中,流水線會自動下載摘要模型、對長文本進行摘要、將摘要向量化,並訓練分類器。檢視生成的摘要:
['Overall, it\'s a solid machine, though a bit heavy to carry up the stairs. At first, I struggled with the attachments,', 'The delivery was delayed by four days, which was incredibly frustrating. The zipper snagged immediately. The fabric feels cheap and flimsy.']本例中使用的輕量級模型distilbart-cnn-12-6生成的摘要質量有限,但你可以輕鬆替換為更強大的模型(如OpenAI的GPT系列)以獲得更好的結果。
總結
藉助Scikit-LLM,你可以將LLM驅動的文本摘要無縫整合到傳統的scikit-learn工作流中。這種方法在保留關鍵資訊的同時大幅減少了文本維度,從而提升下游機器學習任務的效率。嘗試將上述程式碼應用於實際的情感分類資料集,體驗LLM與傳統機器學習的融合之力。