BedrockAgentCoreControl / Client / create_evaluator

create_evaluator

BedrockAgentCoreControl.Client.create_evaluator(**kwargs)

Creates a custom evaluator for agent quality assessment. Custom evaluators use LLM-as-a-Judge configurations with user-defined prompts, rating scales, and model settings to evaluate agent performance at tool call, trace, or session levels.

See also: AWS API Documentation

Request Syntax

response = client.create_evaluator(
    clientToken='string',
    evaluatorName='string',
    description='string',
    evaluatorConfig={
        'llmAsAJudge': {
            'instructions': 'string',
            'ratingScale': {
                'numerical': [
                    {
                        'definition': 'string',
                        'value': 123.0,
                        'label': 'string'
                    },
                ],
                'categorical': [
                    {
                        'definition': 'string',
                        'label': 'string'
                    },
                ]
            },
            'modelConfig': {
                'bedrockEvaluatorModelConfig': {
                    'modelId': 'string',
                    'inferenceConfig': {
                        'maxTokens': 123,
                        'temperature': ...,
                        'topP': ...,
                        'stopSequences': [
                            'string',
                        ]
                    },
                    'additionalModelRequestFields': {...}|[...]|123|123.4|'string'|True|None
                }
            }
        }
    },
    level='TOOL_CALL'|'TRACE'|'SESSION'
)
Parameters:
  • clientToken (string) –

    A unique, case-sensitive identifier to ensure that the API request completes no more than one time. If you don’t specify this field, a value is randomly generated for you. If this token matches a previous request, the service ignores the request, but doesn’t return an error. For more information, see Ensuring idempotency.

    This field is autopopulated if not provided.

  • evaluatorName (string) –

    [REQUIRED]

    The name of the evaluator. Must be unique within your account.

  • description (string) – The description of the evaluator that explains its purpose and evaluation criteria.

  • evaluatorConfig (dict) –

    [REQUIRED]

    The configuration for the evaluator, including LLM-as-a-Judge settings with instructions, rating scale, and model configuration.

    Note

    This is a Tagged Union structure. Only one of the following top level keys can be set: llmAsAJudge.

    • llmAsAJudge (dict) –

      The LLM-as-a-Judge configuration that uses a language model to evaluate agent performance based on custom instructions and rating scales.

      • instructions (string) – [REQUIRED]

        The evaluation instructions that guide the language model in assessing agent performance, including criteria and evaluation guidelines.

      • ratingScale (dict) – [REQUIRED]

        The rating scale that defines how the evaluator should score agent performance, either numerical or categorical.

        Note

        This is a Tagged Union structure. Only one of the following top level keys can be set: numerical, categorical.

        • numerical (list) –

          The numerical rating scale with defined score values and descriptions for quantitative evaluation.

          • (dict) –

            The definition of a numerical rating scale option that provides a numeric value with its description for evaluation scoring.

            • definition (string) – [REQUIRED]

              The description that explains what this numerical rating represents and when it should be used.

            • value (float) – [REQUIRED]

              The numerical value for this rating scale option.

            • label (string) – [REQUIRED]

              The label or name that describes this numerical rating option.

        • categorical (list) –

          The categorical rating scale with named categories and definitions for qualitative evaluation.

          • (dict) –

            The definition of a categorical rating scale option that provides a named category with its description for evaluation scoring.

            • definition (string) – [REQUIRED]

              The description that explains what this categorical rating represents and when it should be used.

            • label (string) – [REQUIRED]

              The label or name of this categorical rating option.

      • modelConfig (dict) – [REQUIRED]

        The model configuration that specifies which foundation model to use and how to configure it for evaluation.

        Note

        This is a Tagged Union structure. Only one of the following top level keys can be set: bedrockEvaluatorModelConfig.

        • bedrockEvaluatorModelConfig (dict) –

          The Amazon Bedrock model configuration for evaluation.

          • modelId (string) – [REQUIRED]

            The identifier of the Amazon Bedrock model to use for evaluation. Must be a supported foundation model available in your region.

          • inferenceConfig (dict) –

            The inference configuration parameters that control model behavior during evaluation, including temperature, token limits, and sampling settings.

            • maxTokens (integer) –

              The maximum number of tokens to generate in the model response during evaluation.

            • temperature (float) –

              The temperature value that controls randomness in the model’s responses. Lower values produce more deterministic outputs.

            • topP (float) –

              The top-p sampling parameter that controls the diversity of the model’s responses by limiting the cumulative probability of token choices.

            • stopSequences (list) –

              The list of sequences that will cause the model to stop generating tokens when encountered.

              • (string) –

          • additionalModelRequestFields (document) –

            Additional model-specific request fields to customize model behavior beyond the standard inference configuration.

  • level (string) –

    [REQUIRED]

    The evaluation level that determines the scope of evaluation. Valid values are TOOL_CALL for individual tool invocations, TRACE for single request-response interactions, or SESSION for entire conversation sessions.

Return type:

dict

Returns:

Response Syntax

{
    'evaluatorArn': 'string',
    'evaluatorId': 'string',
    'createdAt': datetime(2015, 1, 1),
    'status': 'ACTIVE'|'CREATING'|'CREATE_FAILED'|'UPDATING'|'UPDATE_FAILED'|'DELETING'
}

Response Structure

  • (dict) –

    • evaluatorArn (string) –

      The Amazon Resource Name (ARN) of the created evaluator.

    • evaluatorId (string) –

      The unique identifier of the created evaluator.

    • createdAt (datetime) –

      The timestamp when the evaluator was created.

    • status (string) –

      The status of the evaluator creation operation.

Exceptions