BedrockAgentCoreControl / Client / create_evaluator

create_evaluator¶

BedrockAgentCoreControl.Client.create_evaluator(**kwargs)¶

Creates a custom evaluator for agent quality assessment. Custom evaluators use LLM-as-a-Judge configurations with user-defined prompts, rating scales, and model settings to evaluate agent performance at tool call, trace, or session levels.

Request Syntax

response = client.create_evaluator(
    clientToken='string',
    evaluatorName='string',
    description='string',
    evaluatorConfig={
        'llmAsAJudge': {
            'instructions': 'string',
            'ratingScale': {
                'numerical': [
                    {
                        'definition': 'string',
                        'value': 123.0,
                        'label': 'string'
                    },
                ],
                'categorical': [
                    {
                        'definition': 'string',
                        'label': 'string'
                    },
                ]
            },
            'modelConfig': {
                'bedrockEvaluatorModelConfig': {
                    'modelId': 'string',
                    'inferenceConfig': {
                        'maxTokens': 123,
                        'temperature': ...,
                        'topP': ...,
                        'stopSequences': [
                            'string',
                        ]
                    },
                    'additionalModelRequestFields': {...}|[...]|123|123.4|'string'|True|None
                }
            }
        }
    },
    level='TOOL_CALL'|'TRACE'|'SESSION'
)

Parameters:

clientToken (string) –
A unique, case-sensitive identifier to ensure that the API request completes no more than one time. If you don’t specify this field, a value is randomly generated for you. If this token matches a previous request, the service ignores the request, but doesn’t return an error. For more information, see Ensuring idempotency.

This field is autopopulated if not provided.
evaluatorName (string) –
[REQUIRED]

The name of the evaluator. Must be unique within your account.
description (string) – The description of the evaluator that explains its purpose and evaluation criteria.
evaluatorConfig (dict) –
[REQUIRED]

The configuration for the evaluator, including LLM-as-a-Judge settings with instructions, rating scale, and model configuration.

Note
This is a Tagged Union structure. Only one of the following top level keys can be set: llmAsAJudge.
- llmAsAJudge (dict) –
  
  The LLM-as-a-Judge configuration that uses a language model to evaluate agent performance based on custom instructions and rating scales.
  - instructions (string) – [REQUIRED]
    
    The evaluation instructions that guide the language model in assessing agent performance, including criteria and evaluation guidelines.
  - ratingScale (dict) – [REQUIRED]
    
    The rating scale that defines how the evaluator should score agent performance, either numerical or categorical.
    
    Note
    This is a Tagged Union structure. Only one of the following top level keys can be set: numerical, categorical.
    - numerical (list) –
      
      The numerical rating scale with defined score values and descriptions for quantitative evaluation.
      - (dict) –
        
        The definition of a numerical rating scale option that provides a numeric value with its description for evaluation scoring.
        
        definition (string) – [REQUIRED]
        
        The description that explains what this numerical rating represents and when it should be used.
        
        value (float) – [REQUIRED]
        
        The numerical value for this rating scale option.
        
        label (string) – [REQUIRED]
        
        The label or name that describes this numerical rating option.
    - categorical (list) –
      
      The categorical rating scale with named categories and definitions for qualitative evaluation.
      - (dict) –
        
        The definition of a categorical rating scale option that provides a named category with its description for evaluation scoring.
        
        definition (string) – [REQUIRED]
        
        The description that explains what this categorical rating represents and when it should be used.
        
        label (string) – [REQUIRED]
        
        The label or name of this categorical rating option.
  - modelConfig (dict) – [REQUIRED]
    
    The model configuration that specifies which foundation model to use and how to configure it for evaluation.
    
    Note
    This is a Tagged Union structure. Only one of the following top level keys can be set: bedrockEvaluatorModelConfig.
    - bedrockEvaluatorModelConfig (dict) –
      
      The Amazon Bedrock model configuration for evaluation.
      - modelId (string) – [REQUIRED]
        
        The identifier of the Amazon Bedrock model to use for evaluation. Must be a supported foundation model available in your region.
      - inferenceConfig (dict) –
        
        The inference configuration parameters that control model behavior during evaluation, including temperature, token limits, and sampling settings.
        
        maxTokens (integer) –
        
        The maximum number of tokens to generate in the model response during evaluation.
        
        temperature (float) –
        
        The temperature value that controls randomness in the model’s responses. Lower values produce more deterministic outputs.
        
        topP (float) –
        
        The top-p sampling parameter that controls the diversity of the model’s responses by limiting the cumulative probability of token choices.
        
        stopSequences (list) –
        
        The list of sequences that will cause the model to stop generating tokens when encountered.
        
        (string) –
      - additionalModelRequestFields (document) –
        
        Additional model-specific request fields to customize model behavior beyond the standard inference configuration.
level (string) –
[REQUIRED]

The evaluation level that determines the scope of evaluation. Valid values are TOOL_CALL for individual tool invocations, TRACE for single request-response interactions, or SESSION for entire conversation sessions.

Return type:

dict

Returns:

Response Syntax

{
    'evaluatorArn': 'string',
    'evaluatorId': 'string',
    'createdAt': datetime(2015, 1, 1),
    'status': 'ACTIVE'|'CREATING'|'CREATE_FAILED'|'UPDATING'|'UPDATE_FAILED'|'DELETING'
}

Response Structure

(dict) –
- evaluatorArn (string) –
  
  The Amazon Resource Name (ARN) of the created evaluator.
- evaluatorId (string) –
  
  The unique identifier of the created evaluator.
- createdAt (datetime) –
  
  The timestamp when the evaluator was created.
- status (string) –
  
  The status of the evaluator creation operation.

create_evaluator¶

Request Syntax

Note

Note

Note

Response Syntax

Response Structure

Exceptions