Skip to content

Python: Fix: Implement __deepcopy__ on KernelFunction to handle non-serializable OTEL metrics #12803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Aug 7, 2025

Conversation

MMoshtaghi
Copy link
Contributor

Description

This PR fixes TypeError: cannot pickle '_thread.RLock' object which occurs when kernel.clone() is called while OpenTelemetry metrics are enabled.

Motivation and Context

When using features like HandoffOrchestration, the agent's Kernel is cloned. This process uses copy.deepcopy, which fails on the KernelFunction's OpenTelemetry Histogram fields (invocation_duration_histogram and streaming_duration_histogram). These histogram objects contain thread locks, making them non-serializable.

Previous solutions, like the one in PR #9292, used Field(exclude=True, default_factory=...). While that was a necessary step, it was insufficient because deepcopy does not respect the exclude flag in the same way that Pydantic's serialization does. This bug blocks the use of certain agentic features in any application that has metrics-based observability enabled.

Description of the change

This solution introduces a custom deepcopy method to the KernelFunction Pydantic model. This method correctly handles the deep-copying process for KernelFunction instances:

  1. It intercepts the deepcopy call for KernelFunction objects.
  2. It uses self.model_copy(deep=False) to create a new instance of the Pydantic model. Using deep=False allows the default_factory for the histogram fields to create new, clean instances on the copied object, while other fields are shallow-copied.
  3. It then manually deep-copies all other attributes from the original object to the new one, ensuring that any other mutable objects (like metadata) are correctly duplicated. The histogram fields are explicitly skipped in this step, as they have already been recreated by model_copy.
  4. Standard deepcopy memoization is used to prevent infinite recursion in the case of cyclic object graphs.

This approach ensures that the non-serializable fields are re-initialized instead of copied, resolving the TypeError while maintaining the integrity of the cloned object.

How has this been tested?

The fix was validated using the reproduction steps outlined in the associated issue. Running an agent orchestration that triggers kernel.clone() with OpenTelemetry metrics enabled now completes successfully without any errors.

Contribution Checklist

@MMoshtaghi MMoshtaghi requested a review from a team as a code owner July 25, 2025 17:35
@moonbox3 moonbox3 added the python Pull requests for the Python Semantic Kernel label Jul 25, 2025
@github-actions github-actions bot changed the title Fix: Implement __deepcopy__ on KernelFunction to handle non-serializable OTEL metrics Python: Fix: Implement __deepcopy__ on KernelFunction to handle non-serializable OTEL metrics Jul 25, 2025
@MMoshtaghi
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Digital Workforce Services Oyj"

@moonbox3
Copy link
Collaborator

moonbox3 commented Aug 4, 2025

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
connectors
   azure_ai_search.py3143190%159–160, 208, 210–213, 215, 244, 246, 248, 339, 361–362, 423, 425, 439–440, 451, 522, 544, 554, 566, 586–587, 682, 684, 686, 689, 792–793
   chroma.py2629663%104, 107–111, 126–127, 135–136, 165, 167, 175, 187–190, 197, 216, 220, 241, 249, 253, 266, 270, 275–276, 285–288, 304–305, 313, 332, 343, 345, 349, 367–369, 371–386, 388–398, 400–406, 408, 410, 412–413, 416–417, 419–420, 423–432, 454–455, 457
   sql_server.py5048982%152, 349–351, 371, 390, 401, 403, 426, 472–475, 491, 503–506, 528, 530, 574–580, 586–587, 589–590, 594–608, 611–616, 619–623, 628, 634–635, 638, 644–645, 647–650, 789, 808–810, 816, 822, 827, 832–835, 841, 875, 917, 924–926, 1092–1093, 1095, 1097, 1118
   weaviate.py34214956%110, 315–318, 327–328, 330–332, 337, 341–342, 353–355, 364–372, 375–376, 380–381, 384–386, 389–390, 395–396, 399–400, 402, 411–413, 415–424, 426, 428–443, 447–453, 455, 457, 459–460, 463–464, 466–467, 470–479, 489–490, 493–494, 500–501, 504–505, 509, 512–516, 531, 546, 555–556, 570, 574–577, 580–581, 587–588, 592, 595–596, 599, 602, 622, 641, 657, 666–667, 672–673, 678, 772, 783–788, 793–798, 803–804
functions
   kernel_function.py141298%106, 108
TOTAL26829460882% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
3681 22 💤 0 ❌ 0 🔥 1m 58s ⏱️

@moonbox3
Copy link
Collaborator

moonbox3 commented Aug 4, 2025

Hi @MMoshtaghi, thanks for the contribution. Please make sure you install the pre-commit hooks locally. You can find the directions here. Specially the part around:

# Install pre-commit hooks
uv run pre-commit install -c python/.pre-commit-config.yaml

There are pre-commit check failures right now that need to be addressed. Thanks.

@TaoChenOSU
Copy link
Contributor

Could we also add tests?

@MMoshtaghi
Copy link
Contributor Author

Hi @moonbox3 , thanks for the review. I fixed the pre-commit check failures.

@moonbox3
Copy link
Collaborator

moonbox3 commented Aug 6, 2025

Per Tao's comment, which I agree with:

  1. Let's add tests
  2. There's a conflict with the uv.lock that you'll want to resolve. Once you pull latest main, you can remove the current uv.lock and then run uv sync --all-extras --dev to regenerate it.

@MMoshtaghi
Copy link
Contributor Author

1- Ok, I'll create a new test file in tests/integration/telemetry/test_telemetry_integration.py , and use the sample in MS Learn for telemetry.

2- I actually followed the DEV_SETUP.md (running make install) , and the Makefile already has uv sync --all-extras --dev --prerelease=if-necessary-or-explicit ... !

@moonbox3
Copy link
Collaborator

moonbox3 commented Aug 6, 2025

1- Ok, I'll create a new test file in tests/integration/telemetry/test_telemetry_integration.py , and use the sample in MS Learn for telemetry.

I think unit tests should be enough to cover the overridden deepcopy method you added. We don't need integration tests.

@MMoshtaghi
Copy link
Contributor Author

Ok, I'll create a unit test to cover deepcopy for KernelFunction. Under which file should I write it? how about test_kernel_function_from_prompt.py ? would this be ok? :

def test_kernel_function_from_prompt_deepcopy():
    """Test deepcopying a KernelFunctionFromPrompt."""
    function = KernelFunctionFromPrompt(
        function_name="test_function",
        plugin_name="test_plugin",
        prompt="Hello, world!",
        description="A test function.",
    )
    copied_function = deepcopy(function)
    assert copied_function is not function
    assert copied_function.name == function.name
    assert copied_function.plugin_name == function.plugin_name
    assert copied_function.description == function.description
    assert copied_function.prompt_template.prompt_template_config.template == (
        function.prompt_template.prompt_template_config.template
    )
    assert copied_function.prompt_template is not function.prompt_template

@TaoChenOSU
Copy link
Contributor

@MMoshtaghi
Copy link
Contributor Author

Hi @TaoChenOSU ,
Yeah, I actually explained that in the issue #12802 . It doesn't matter whether it's streaming or not.

@MMoshtaghi
Copy link
Contributor Author

All tests and pre-commit checks were passed on my end, before the push .

image

@moonbox3
Copy link
Collaborator

moonbox3 commented Aug 7, 2025

@MMoshtaghi not sure why mypy is failing.

There is a unit test failure, though, that I don't see in main. Could be related to the override __deepcopy__: https://github.com/microsoft/semantic-kernel/actions/runs/16792165215/job/47568903867?pr=12803#step:5:623.

@MMoshtaghi
Copy link
Contributor Author

MMoshtaghi commented Aug 7, 2025

There is a unit test failure, though, that I don't see in main. Could be related to the override __deepcopy__: https://github.com/microsoft/semantic-kernel/actions/runs/16792165215/job/47568903867?pr=12803#step:5:623.

@moonbox3 I know, but as you can see the unit test error occurred after you merged branch 'main' into main.
before that it was only the mypy error and I also get this error when running uv run mypy -p semantic_kernel --config-file mypy.ini. I think it's bc the type signature for ast.Constant.value is very broad. The connector code in weaviate.py, sql_server.py, azure_ai_search.py, and chroma.py does not properly handle all of these possibilities. I may be able to fix it here too.
https://github.com/microsoft/semantic-kernel/actions/runs/16779393946/job/47555115866#step:6:11

The unit test error is bc of a timeout for getting orchestration result in test_magentic.py only in Python 3.12 (My version is 3.10) :

=================================== FAILURES ===================================
___________________ test_invoke_with_agent_raising_exception ___________________

...

async def test_invoke_with_agent_raising_exception():
        """Test the invoke method of the MagenticOrchestration with a list of messages which raises an error."""
        with (
            patch.object(
                MockChatCompletionService, "get_chat_message_content", new_callable=AsyncMock
            ) as mock_get_chat_message_content,
            patch.object(
                StandardMagenticManager, "create_progress_ledger", new_callable=AsyncMock, side_effect=ManagerProgressList
            ),
        ):
            mock_get_chat_message_content.return_value = ChatMessageContent(role="assistant", content="mock_response")
            chat_completion_service = MockChatCompletionService(ai_model_id="test")
            prompt_execution_settings = MockPromptExecutionSettings()
    
            manager = StandardMagenticManager(
                chat_completion_service=chat_completion_service,
                prompt_execution_settings=prompt_execution_settings,
            )
    
            agent_a = MockAgentWithException(name="agent_a", description="test agent")
            agent_b = MockAgent(name="agent_b", description="test agent")
    
            runtime = InProcessRuntime()
            runtime.start()
    
            orchestration = MagenticOrchestration(members=[agent_a, agent_b], manager=manager)
            try:
                orchestration_result = await orchestration.invoke(task="test_message", runtime=runtime)
                with pytest.raises(RuntimeError, match="Mock agent exception"):
>                   await orchestration_result.get(1.0)

tests/unit/agents/orchestration/test_magentic.py:277: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
semantic_kernel/agents/orchestration/orchestration_base.py:57: in get
    await asyncio.wait_for(self.event.wait(), timeout=timeout)
/opt/homebrew/Cellar/[email protected]/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py:519: in wait_for
    async with timeouts.timeout(timeout):
               ^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

...

=========================== short test summary info ============================
FAILED tests/unit/agents/orchestration/test_magentic.py::test_invoke_with_agent_raising_exception - TimeoutError
1 failed, 3675 passed, 5 xfailed, 471 warnings in 88.87s (0:01:28)

@MMoshtaghi
Copy link
Contributor Author

MMoshtaghi commented Aug 7, 2025

I just fixed the mypy errors nad here I explain the root cause and and the solution:

Problem:
The mypy type checker found errors because the code was not correctly handling all possible data types for constant values (ast.Constant) when building database filters. A constant's value can be a string, number, boolean, or even bytes, but the original code treated them all the same, leading to type mismatches and potential formatting errors.

Solution Overview:
The fix was to replace the simple return node.value with more robust logic in the _lambda_parser method of four different connector files. This new logic explicitly checks the type of the constant's value and handles it correctly.

  • Specific Type Handling:

    • Strings (str): String values are now properly escaped (e.g., single quotes are doubled up) and enclosed in single quotes to make them valid string literals in the target database's query language.
    • Bytes (bytes): Byte strings are first decoded into regular strings (using UTF-8) and then handled just like normal strings. This resolved the str-bytes-safe mypy error.
    • Booleans (bool): Boolean values (True/False) are converted to their lowercase string equivalents (true/false) as required by some of the vector stores.
    • None: The Python None object is converted to the null keyword for database queries.
    • Numbers (int, float): Numbers are converted to their string representation without any special formatting.
    • Unsupported Types: If a constant has a type that the vector store cannot handle (like a complex number), the code now raises a VectorStoreOperationException with a clear error message instead of failing silently or causing unexpected behavior.
  • Affected Files: These changes were applied to the filtering logic in the following vector store connectors:

    • semantic_kernel/connectors/azure_ai_search.py
    • semantic_kernel/connectors/chroma.py
    • semantic_kernel/connectors/sql_server.py
    • semantic_kernel/connectors/weaviate.py

Tests:

  • uv run mypy -p semantic_kernel --config-file mypy.ini is executed without error
  • All unit tests are passed.
image

@TaoChenOSU
Copy link
Contributor

Hi @TaoChenOSU , Yeah, I actually explained that in the issue #12802 . It doesn't matter whether it's streaming or not.

This is weird because I cannot reproduce locally.

@MMoshtaghi
Copy link
Contributor Author

@TaoChenOSU
Copy link
Contributor

This is weird because I cannot reproduce locally.

Hmm ! How about with https://github.com/microsoft/semantic-kernel/blob/main/python/samples/getting_started_with_agents/multi_agent_orchestration/step4b_handoff_streaming_agent_response_callback.py ?

Still no. I approved to unblock you.

@moonbox3 moonbox3 added this pull request to the merge queue Aug 7, 2025
Merged via the queue into microsoft:main with commit 73ad6e8 Aug 7, 2025
28 checks passed
@moonbox3
Copy link
Collaborator

moonbox3 commented Aug 8, 2025

Thanks again for your contribution, @MMoshtaghi. We appreciate the support.

@MMoshtaghi
Copy link
Contributor Author

Thanks @moonbox3 and @TaoChenOSU for the review.

@MMoshtaghi
Copy link
Contributor Author

Still no.

Basically you don't necessarily need HandOff Orchestration, just clone a kernel with telemetry through the console. I wrote this as an integration test for telemetry and kernel, but Evan said a unit test would be enough.

@MMoshtaghi
Copy link
Contributor Author

MMoshtaghi commented Aug 8, 2025

just clone a kernel with telemetry through the console. I wrote this as an integration test for telemetry and kernel

You can comment out the deepcopy method and test this:

import logging

import pytest
from opentelemetry._logs import set_logger_provider
from opentelemetry.metrics import set_meter_provider
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor, ConsoleLogExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter, PeriodicExportingMetricReader
from opentelemetry.sdk.metrics.view import DropAggregation, View
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.trace import set_tracer_provider

from semantic_kernel.kernel import Kernel

# Create a resource to represent the service/sample
resource = Resource.create({"service.name": "telemetry-console-customer-service-agent"})


def set_up_logging():
    exporter = ConsoleLogExporter()

    # Create and set a global logger provider for the application.
    logger_provider = LoggerProvider(resource=resource)
    # Log processors are initialized with an exporter which is responsible
    # for sending the telemetry data to a particular backend.
    logger_provider.add_log_record_processor(BatchLogRecordProcessor(exporter))
    # Sets the global default logger provider
    set_logger_provider(logger_provider)

    # Create a logging handler to write logging records, in OTLP format, to the exporter.
    handler = LoggingHandler()
    # Add filters to the handler to only process records from semantic_kernel.
    handler.addFilter(logging.Filter("semantic_kernel"))
    # Attach the handler to the root logger. `getLogger()` with no arguments returns the root logger.
    # Events from all child loggers will be processed by this handler.
    logger = logging.getLogger()
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)


def set_up_tracing():
    exporter = ConsoleSpanExporter()

    # Initialize a trace provider for the application. This is a factory for creating tracers.
    tracer_provider = TracerProvider(resource=resource)
    # Span processors are initialized with an exporter which is responsible
    # for sending the telemetry data to a particular backend.
    tracer_provider.add_span_processor(BatchSpanProcessor(exporter))
    # Sets the global default tracer provider
    set_tracer_provider(tracer_provider)


def set_up_metrics():
    exporter = ConsoleMetricExporter()

    # Initialize a metric provider for the application. This is a factory for creating meters.
    meter_provider = MeterProvider(
        metric_readers=[PeriodicExportingMetricReader(exporter, export_interval_millis=5000)],
        resource=resource,
        views=[
            # Dropping all instrument names except for those starting with "semantic_kernel"
            View(instrument_name="*", aggregation=DropAggregation()),
            View(instrument_name="semantic_kernel*"),
        ],
    )
    # Sets the global default meter provider
    set_meter_provider(meter_provider)


# This must be done before any other telemetry calls, and creating kernels.
set_up_logging()
set_up_tracing()
set_up_metrics()


@pytest.mark.asyncio
async def test_kernel_with_function_clone_with_otel_enabled():
    kernel = Kernel()
    kernel.add_function(
        plugin_name="test_plugin",
        function_name="test_function",
        prompt="Hello, world!",
    )

    kernel_clone = kernel.clone()

    assert kernel_clone is not None
    assert kernel_clone.plugins is not None and len(kernel_clone.plugins) > 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Pull requests for the Python Semantic Kernel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Python: Bug: TypeError: cannot pickle '_thread.RLock' object when cloning Kernel with OpenTelemetry enabled
3 participants