Skip to main content

Serverless

Updated at: 2025-11-18 10:38:25

Alaya NeW provides users with the ability to run custom code, manage data, and integrate applications without worrying about the operation and management of the underlying infrastructure. This greatly simplifies development and deployment processes, enabling developers to focus on core business logic, improve efficiency, and reduce operational complexity.

Users only need to provide a standard Kubernetes Manifest, without additional installation or configuration (aside from minimal required dependencies), to efficiently deploy GPU inference serverless services. When deploying GPU inference instances on VKS (Virtual Kubernetes Services) using standard Kubernetes Manifests, high availability is built in, and resources automatically scale in and out based on traffic. Scaling down to zero means no resource consumption or billing during idle periods, achieving GPU serverless for inference workloads.

info

Alaya NeW Serverless leverages Knative to provide manifest-based application deployment capabilities. For more information, see Knative.

Prerequisites

Preparation

  1. Download the Serverless dependency package, Download the Serverless dependency package and extract files: kourier.yamlserving-core-v1.17.0.yamlserving-crds.yamlknative-nginx.yaml.

  2. Download the sample manifest files, the manifest files are used to deploy the Serverless service. After downloading, extract them to the corresponding directory.

    tip

    These files are examples only, adjust configurations based on real business needs.

Operational Procedure

Install Component

  1. Run the following command to connect to the VKS, which is required for subsequent component and service deployment. For more details, see Use Virtual Kubernetes Services.

    export KUBECONFIG=[yourpath]
  2. Run the following command to apply the serving-crds.yaml configuration file in the VKS. This file installs the CRDs required by Knative.

    kubectl apply -f [serving-crds.yaml]
  3. Run the following command to apply the serving-core-v1.17.0.yaml resource configuration file in the VKS. This file installs the core resources required by Knative. Then run kubectl get pod -n knative-serving to confirm that the core Pods are in the Running state, as highlighted in green in the example below.

    kubectl apply -f [serving-core-v1.17.0.yaml]
  4. Run the following command to apply the kourier resource configuration file in the VKS. This file installs the networking components required by Knative. Then run the commands highlighted as ② and ③ in the figure below to ensure that the gateway Pods are in the Running state and the gateway Service is running. image

    kubectl apply -f [kourier.yaml]
    info

    The commands highlighted as ② and ③ above are:

    kubectl get pod -n [kourier-system]
    kubectl get svc -n [kourier-system]
    tip

    The above dependency files only need to be deployed once within the cluster and will take effect across the entire cluster.

  5. Run the following command to apply the knative-nginx resource configuration file in the VKS. This file is used to install the public access proxy. Then run kubectl get all -n knative-nginx to verify that the public proxy is in the Running state, as shown in the figure below. image

    kubectl apply -f [knative-nginx.yaml]

Deploy Services

As mentioned above, when deploying a service with Serverless, users only need to provide a standard Kubernetes Manifest to complete the deployment. The following example demonstrates deploying a qwen7b model and enabling automatic scaling through configuration parameters.

  1. Run the following command in a terminal to apply the serverless.yaml configuration file in the VKS and deploy the model service, as shown below.

    image

    kubectl apply -f [serverless.yaml]
    tip

    For descriptions of the annotated parameters in the configuration file, see the Appendix. You may adjust or modify these parameters according to your actual business requirements.

  2. Run the following command to view the Knative Service (ksvc) resources in the VKS and ensure that the service has been successfully deployed and is ready to accept external traffic.

    kubectl get ksvc -n [knative-qwen7b]
  3. (Optional) Run the following command to check the Knative Service (ksvc) Pods in the VKS cluster and verify whether the parameter (autoscaling.knative.dev/initial-scale: "1") has taken effect. From the figure, you can see that the configuration is applied correctly.

    image

    kubectl get pod -n [knative-qwen7b]
  4. (Optional) Wait for a period of time and run the previous command again to check the number of Pods. Because autoscaling.knative.dev/min-scale is configured in the configuration files, if there are no requests for a while, the number of Pods will scale down to 0, as shown below. image

Access service

Once the service has been deployed, it can be accessed from both inside and outside the cluster, as described below.

  1. Run the following command to obtain the Service name. Construct the target URL using service name + namespace name + service path, as shown in highlight ① below.

    kubectl get svc -n [kourier-system]
  2. Run the following command to obtain the local access domain name, as shown in highlight ② below.

    image

    kubectl get service.serving.knative.dev -n [knative-qwen7b]
  3. After obtaining the above information, run kubectl exec -it [pod name] -- bash to enter a running Pod inside the cluster, as shown in highlight ① below. Then run the "service status check script", as shown in highlight ②, to check the service status. Based on the results (highlighted in green), the service is callable.

    image

    Service status check script
    curl -X POST http://kourier-internal.kourier-system/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Host: knative-qwen7b-svc.knative-qwen7b.127.0.0.1.nip.io" \
    -d '{
    "model": "qwen7b",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "How to calculate 1 + 1."}
    ]
    }'

Uninstall Knative

  1. Delete all services deployed via Knative. Run the following command to remove the services defined in your_deploy.yaml.

    kubectl delete -f [your_deploy.yaml]
  2. Uninstall the Knative components in the reverse order of their dependencies. Run the following commands to delete the related resource configuration files.

    kubectl delete -f [knative-nginx.yaml]
    kubectl delete -f [kourier.yaml]
    kubectl delete -f [serving-core-v1.17.0.yaml]
    kubectl delete -f [serving-crds.yaml]

Appendix

In Knative, autoscaling is controlled through a set of custom resources and annotations. The following table describes some of the autoscaling parameters used in the configuration file.

ParameterYaml ConfigurationGlobal ConfigurationDefaultDescription
Autoscaler classautoscaling.knative.dev/classpod-autoscaler-classkpa.autoscaling.knative.devThe autoscaler implementation class.
Metricautoscaling.knative.dev/metric-parallelismThe metric to monitor, such as parallelism or rps.
Target utilization percentageautoscaling.knative.dev/target-utilization-percentagecontainer-parallelism-target-percentage70The percentage of the configured parallelism or rps target to maintain.
Target parallelism / requests per secondautoscaling.knative.dev/targetcontainer-parallelism-target-default/requests-per-second-target-default100/200The effective target value depends on the metric in use.*
Initial replica count at startuputoscaling.knative.dev/initial-scaleinitial-scale1The number of replicas when the service first starts.
Minimum replicasautoscaling.knative.dev/min-scalemin-scale0The minimum number of replicas while the service is running.
Maximum replicasautoscaling.knative.dev/max-scalemax-scale0The maximum number of replicas to which the service can scale.
Scale-down delayautoscaling.knative.dev/scale-down-delayscale-down-delay60How long to wait after traffic decreases before scaling down.
Pod retention period when scaling to zeroautoscaling.knative.dev/scale-to-zero-pod-retention-periodscale-to-zero-pod-retention-period0sThe time window that controls how long a Pod is retained.
Panic Window Percentageautoscaling.knative.dev/panic-window-percentagepanic-window-percentage10.0Specifies the percentage by which the panic window length is defined relative to the stable window length.
Panic Threshold Percentageautoscaling.knative.dev/panic-threshold-percentagepanic-threshold-percentage200.0The percentage of target capacity at which panic mode is triggered when actual load reaches this threshold.
info

*: When the metric is rps, the global configuration requests-per-second-target-default applies (default: 200). When the metric is parallelism, the global configuration container-parallelism-target-default applies (default: 100). In both cases, the corresponding YAML parameter is autoscaling.knative.dev/target.

Summary

Serverless provides users with an efficient and flexible way to deploy and manage modern cloud-native applications. It greatly simplifies development and deployment processes, enabling developers to focus on core business logic, improve efficiency, and reduce operational complexity.