Inference quickstart
This quick start assumes that you've deployed a pcore cluster and have authenticated with the pipeline CLI (guide here.
# Make sure you're on the pre release version of pipeline
pip install --pre pipeline-ai
This guide demonstrates uploading and running GPTNeo to your pcore cluster. The Pipeline python SDK is used to create a basic inference pipeline which will then be uploaded:
from httpx import Response
from pipeline import Pipeline, Variable, pipeline_function, pipeline_model
from pipeline.v3 import upload_pipeline
@pipeline_model
class PipelineGPTNeo:
def __init__(self):
self.model = None
self.tokenizer = None
self.device = None
@pipeline_function(run_once=True, on_startup=True)
def load(self):
import torch
from transformers import GPT2Tokenizer, GPTNeoForCausalLM
print(
"Loading GPT-Neo model...",
flush=True,
)
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M").to(
self.device
)
self.tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-125M")
@pipeline_function
def predict(self, input_data: str) -> str:
input_ids = self.tokenizer(input_data, return_tensors="pt").input_ids.to(
self.model.device
)
gen_tokens = self.model.generate(
input_ids,
do_sample=True,
temperature=0.9,
max_length=100,
)
gen_text = self.tokenizer.batch_decode(gen_tokens)[0]
return gen_text
with Pipeline("gptneo") as builder:
in_1 = Variable(str, is_input=True)
builder.add_variables(in_1)
gpt_neo = PipelineGPTNeo()
gpt_neo.load()
out_str = gpt_neo.predict(in_1)
builder.output(out_str)
gpt_neo_pipeline = Pipeline.get_pipeline("gptneo")
upload_resonse: Response = upload_pipeline(gpt_neo_pipeline)
print(f"Uploaded GPTNeo, server response: {upload_resonse.text}")
Save this file with a name of your choosing, in this guide it was saved as upload_neo.py
. Run it in your command line by entering:
python upload_neo.py
A pipeline ID will be returned, and assuming this is your first upload then the ID will be 1 as is the case in this guide. Run the model via the follow code saved as run_neo.py
.
import datetime
from pipeline.v3 import run_pipeline
pipeline_id = "1"
print(f"Running GPTNeo pipeline... (pipeline_id: {pipeline_id})")
start_time = datetime.datetime.now()
result = run_pipeline(pipeline_id, "Hello, my name is")
end_time = datetime.datetime.now()
total_time = (end_time - start_time).total_seconds() * 1e3
print("Total time taken: %.3f ms, result: '%s'" % (total_time, result))
Run it with:
python run_neo.py
The model will take around 15-20s to cold start, and then the same time for inference on a CPU or 1.4s on an NVIDIA T4 instance.
Updated over 1 year ago