Deploy Service(QuickStart)
Deploy base model interface.
POST
https://api.alayanew.com/api/serverless-infer/v1/deployment
Authorizations
Authorizations:StringHeaderRequired
用户可通过已获取Open API Key做验证,例如:plain Credential=[YOUR_AK],Signature=[YOUR_SK]。
Body
application/json
vksId:StringRequired
Vital Kubernetes Engine (VKS) Cluster ID.
namespace:StringRequired
Vital Kubernetes Engine (VKS) Namespace.
name:StringRequired
Service name.
servedName:List<String>Required
Internal model identifier.
modelId:StringRequired
Model ID.
backend:StringRequired
Backend service, e.g., vllm/sglang.
backendVersion:StringRequired
Backend service version.
backendArgs:Array[String]Required
Backend service arguments.
resource:ObjectRequired
Response
状态码:application/json
200
code:Int
Code is a common return value representing the execution result of the query operation.
0
-1
0 is the success identifier, indicating the operation completed successfully.
data:Object
msg:StringRequired
Returns error information when the code is -1.
cURL
Python
JavaScript
Go
Java
curl --location --request POST 'https://api.alayanew.com/api/serverless-infer/v1/deployment' --header 'Authorization:plain Credential=YOUR_AK,Signature=YOUR_SK' --header 'Content-Type: application/json' --data '
{
"vksId": "vcacb50arkk4",
"namespace": "default",
"name": "testvllm",
"servedName": [
"testvllm"
],
"modelId": "c486cdee-c316-4fc1-9f75-0d1741940f27",
"backend": "vllm",
"backendVersion": "0.9.0.1",
"backendArgs": [],
"resource": {
"workers": 2,
"cpu": 4,
"gpu": {
"count": 1,
"gpuType": "nvidia.com/gpu-l40s"
},
"mem": 10
}
} '
200
400
401
403
404
500
{
"code": "required,int, Status code (0 for success, others for failure)",
"data": {
"serviceId":""
},
"msg": "optional,string,"
}
后端类型不同时,您需关注对应的参数配置,详情如下所示。
- vLLM
- SGLang
版本号为0.9.0.1,启动时说明如下所示。
- 启动时自动指定参数 :
--enable-lora。 - 启动时自动配置环境变量:
"VLLM_ALLOW_RUNTIME_LORA_UPDATING" : true。 - 启动时默认端口号为:3000,
--port 30000。 - 当用户的
backend为vllm、backendVersion为0.9.0.1,且没有指定backendArgs信息时,启动时自动指定参数:--max-loras 16 --max-lora-rank 165.scale默认参数如下所示。
{
"max": 5,
"min": 1,
"idleTime": 60,
"rpsValue": 100,
"inFlightValue": null
}
版本号为0.4.6,启动时说明如下所示。
- scale默认参数如下所示。
{
"max": 5,
"min": 1,
"idleTime": 60,
"rpsValue": 100,
"inFlightValue": null
}
- 当用户的
backend为sglang、backendVersion为0.4.6,且没有指定backendArgs信息时,启动时自动指定参数(gpuCount = resource.workers * resource.gpu.count, 当GPU信息不存在时则没有这个参数):--tp ${gpuCount}。
tip
QuickStart模型在当前部署模式下不支持将实例数量缩减至零。如需调整资源规模,请保持实例数量不小于1。