Skip to main content

Deploying Models with vLLM on Cloud Container Instances

Updated at: 2025-12-01 14:05:25

This section describes how to use the vLLM image on a Cloud Container Instance to deploy and invoke platform-provided large language models. This topic uses the deployment of the Baichuan2-7B-chat model as an example to illustrate the detailed procedure.

Prerequisites

  • You have obtained your Alaya NeW company account and password. If you need assistance or have not registered yet, you can complete the registration by following the instructions in User Registration.
  • Your company account has sufficient balance to use the Cloud Container Instance service. For the latest promotional details and pricing information, please Contact US.

Procedure

Step 1: Create a Cloud Container Instance

  1. Sign in to the Alaya NeW platform using your company account. Choose "Product" > "Computing" > "Cloud Container Instance" to open the Cloud Container Instance page.

  2. Choose [Create Cloud Container Instance] to open the instance creation page, and configure the instance name, description, AIDC name, and other parameters. In this example, configure the parameters as follows:

    • Resource Type:Select Cloud Container Instance – GPU H800A (1 GPU).

    • For other parameters, refer to the table below.

      Configuration ItemDescriptionRequirementsRequired
      Instance NameA unique identifier used to distinguish this Cloud Container Instance.Must start with a letter; supports letters, digits, hyphens (-), and underscores (_); length 4–20 characters.Yes
      Instance DescriptionA brief text description of the container’s purpose or usage.None.No
      AIDCThe data center used to support cloud container instance service.Select an available data center (for example, Beijing Region 3 or Beijing Region 5).Yes
      Payment MethodThe method for using data center resources.Select the supported payment method. Currently, Pay-As-You-Go is used.Yes
      Resource ConfigurationDetailed computing resource specifications, including resource type, GPU, compute resource, and disk configuration.Select resources that meet your requirements.Yes
      Storage ConfigurationOptional NAS storage that can be mounted to the Cloud Container Instance.Choose whether to mount NAS storage.No
      ImageYou can choose from public images (including base images and application images) or private images, depending on your needs.-Yes
      Other SettingsSupports configuring environment variables (key–value pairs), and enables auto-shutdown and auto-release for the Cloud Container Instance.-No
  3. After configuring the instance parameters, choose "Create Now". In the confirmation dialog box, review the configured parameters and choose "Confirm" to complete the creation of the Cloud Container Instance.

    You can view the created instances on the [Computing / Cloud Container Instance] page. When the instance status is "Running", the instance has been created successfully and is ready for use.

Step 2: Deploy the Model

  1. On the "Cloud Container Instance" page, open the "Container List" tab and locate the target instance. Choose the Web Connect icon on the right.

    alt text

  2. Run the following command to install vLLM:

    pip install vllm
  3. Run the following commands to install ModelScope and download the model:

    pip install modelscope
    ##Download using the CLI
    modelscope download --model baichuan-inc/Baichuan2-7B-Chat --local_dir '/root/model/'
  4. Run the following command to start the model inference service:

    python3 -m vllm.entrypoints.openai.api_server \
    --model /root/model/ \
    --host 0.0.0.0 \
    --port 8080 \
    --dtype auto \
    --trust-remote-code
    warning

    The --model parameter specifies the path to the model. Update this path based on the directory where you downloaded the model in the previous step.

    When output similar to the following appears, the model inference service has started successfully.

    image-20250921224851213

  5. You can reconnect through Web Connect and run the following command to invoke the model:

    curl http://localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
    "prompt": "What do you think a Cloud Container Instance is?",
    "max_tokens": 512
    }'

    image-20250921224812628