Alaya NeW Cloud

Deploy an OpenAI-compatible inference service with vLLM on CCI

Install vLLM in a CCI instance, download the model, launch an OpenAI-compatible inference server, and verify with curl

This tutorial walks through deploying a large language model in an Alaya NeW CCI instance using a vLLM image. We use Baichuan2-7B-Chat as the example.

Prerequisites

  • Alaya NeW enterprise account; if not registered, see account registration.
  • Sufficient balance to cover an H800A × 1 CCI instance.

Step 1: Create a CCI instance

  1. Sign in to Alaya NeW and go to Product → Compute → Cloud Container Instance.

  2. Click New Cloud Container and configure:

    • Resource type: pick Cloud Container Instance — GPU — H800A — 1 card
    • Reference table:
    FieldDescriptionRequirementRequired
    Instance nameUnique identifierLetter-prefixed; letters / digits / - / _; 4–20 charsYes
    DescriptionFree-form notesNo
    RegionData centere.g. Beijing-3, Beijing-5Yes
    BillingPay-as-you-goYes
    ResourcesResource type / GPU / CPU / diskAs neededYes
    StorageMount NASOptionalNo
    ImagePublic or privateYes
    OtherEnv vars, auto-stop, auto-releaseNo
  3. Click Activate, confirm, and wait until the instance state becomes Running.

Step 2: Deploy the model

  1. Find the instance, click Web Connect to open a shell.

    Web Connect entry

  2. Install vLLM:

    pip install vllm
  3. Download the model:

    pip install modelscope
    modelscope download --model baichuan-inc/Baichuan2-7B-Chat --local_dir '/root/model/'
  4. Launch the OpenAI-compatible API server:

    python3 -m vllm.entrypoints.openai.api_server \
      --model /root/model/ \
      --host 0.0.0.0 \
      --port 8080 \
      --dtype auto \
      --trust-remote-code

    --model points at the path you just downloaded the model to; adjust if you used a different location.

    Server ready

  5. Open a second Web Connect shell and call the API:

    curl http://localhost:8080/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "prompt": "What is a Cloud Container Instance?",
        "max_tokens": 512
      }'

    Inference response

Expose the API externally

To call the model from outside the instance, click Open ports in the instance list to grab the external address mapped to port 8080, then replace localhost with that address.

See CCI port management.

Last updated on

Was this page helpful?

On this page