Inference quickstart

This quick start assumes that you've deployed a pcore cluster and have authenticated with the pipeline CLI (guide here.

# Make sure you're on the pre release version of pipeline
pip install --pre pipeline-ai

This guide demonstrates uploading and running GPTNeo to your pcore cluster. The Pipeline python SDK is used to create a basic inference pipeline which will then be uploaded:

from httpx import Response
from pipeline import Pipeline, Variable, pipeline_function, pipeline_model
from pipeline.v3 import upload_pipeline


@pipeline_model
class PipelineGPTNeo:
    def __init__(self):
        self.model = None
        self.tokenizer = None
        self.device = None

    @pipeline_function(run_once=True, on_startup=True)
    def load(self):
        import torch
        from transformers import GPT2Tokenizer, GPTNeoForCausalLM

        print(
            "Loading GPT-Neo model...",
            flush=True,
        )
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M").to(
            self.device
        )

        self.tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-125M")

    @pipeline_function
    def predict(self, input_data: str) -> str:
        input_ids = self.tokenizer(input_data, return_tensors="pt").input_ids.to(
            self.model.device
        )

        gen_tokens = self.model.generate(
            input_ids,
            do_sample=True,
            temperature=0.9,
            max_length=100,
        )
        gen_text = self.tokenizer.batch_decode(gen_tokens)[0]
        return gen_text


with Pipeline("gptneo") as builder:
    in_1 = Variable(str, is_input=True)
    builder.add_variables(in_1)

    gpt_neo = PipelineGPTNeo()
    gpt_neo.load()

    out_str = gpt_neo.predict(in_1)

    builder.output(out_str)


gpt_neo_pipeline = Pipeline.get_pipeline("gptneo")
upload_resonse: Response = upload_pipeline(gpt_neo_pipeline)
print(f"Uploaded GPTNeo, server response: {upload_resonse.text}")

Save this file with a name of your choosing, in this guide it was saved as upload_neo.py. Run it in your command line by entering:

python upload_neo.py

A pipeline ID will be returned, and assuming this is your first upload then the ID will be 1 as is the case in this guide. Run the model via the follow code saved as run_neo.py.

import datetime

from pipeline.v3 import run_pipeline

pipeline_id = "1"
print(f"Running GPTNeo pipeline... (pipeline_id: {pipeline_id})")

start_time = datetime.datetime.now()

result = run_pipeline(pipeline_id, "Hello, my name is")

end_time = datetime.datetime.now()

total_time = (end_time - start_time).total_seconds() * 1e3

print("Total time taken: %.3f ms, result: '%s'" % (total_time, result))

Run it with:

python run_neo.py

The model will take around 15-20s to cold start, and then the same time for inference on a CPU or 1.4s on an NVIDIA T4 instance.


What’s Next