Python: Fix: Implement deepcopy on KernelFunction to handle non-serializable OTEL metrics #12803

MMoshtaghi · 2025-07-25T17:35:20Z

Description

This PR fixes TypeError: cannot pickle '_thread.RLock' object which occurs when kernel.clone() is called while OpenTelemetry metrics are enabled.

Fixes: Python: Bug: TypeError: cannot pickle '_thread.RLock' object when cloning Kernel with OpenTelemetry enabled #12802

Motivation and Context

When using features like HandoffOrchestration, the agent's Kernel is cloned. This process uses copy.deepcopy, which fails on the KernelFunction's OpenTelemetry Histogram fields (invocation_duration_histogram and streaming_duration_histogram). These histogram objects contain thread locks, making them non-serializable.

Previous solutions, like the one in PR #9292, used Field(exclude=True, default_factory=...). While that was a necessary step, it was insufficient because deepcopy does not respect the exclude flag in the same way that Pydantic's serialization does. This bug blocks the use of certain agentic features in any application that has metrics-based observability enabled.

Description of the change

This solution introduces a custom deepcopy method to the KernelFunction Pydantic model. This method correctly handles the deep-copying process for KernelFunction instances:

It intercepts the deepcopy call for KernelFunction objects.
It uses self.model_copy(deep=False) to create a new instance of the Pydantic model. Using deep=False allows the default_factory for the histogram fields to create new, clean instances on the copied object, while other fields are shallow-copied.
It then manually deep-copies all other attributes from the original object to the new one, ensuring that any other mutable objects (like metadata) are correctly duplicated. The histogram fields are explicitly skipped in this step, as they have already been recreated by model_copy.
Standard deepcopy memoization is used to prevent infinite recursion in the case of cyclic object graphs.

This approach ensures that the non-serializable fields are re-initialized instead of copied, resolving the TypeError while maintaining the integrity of the cloned object.

How has this been tested?

The fix was validated using the reproduction steps outlined in the associated issue. Running an agent orchestration that triggers kernel.clone() with OpenTelemetry metrics enabled now completes successfully without any errors.

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the SK Contribution Guidelines and the pre-submission formatting script raises no violations
All unit tests pass, and I have added new tests where possible
I didn't break anyone 😄

… KernelFunction Pydantic model

MMoshtaghi · 2025-07-25T18:04:42Z

@microsoft-github-policy-service agree company="Digital Workforce Services Oyj"

moonbox3 · 2025-08-04T22:27:54Z

Python Test Coverage Report •

File	Stmts	Miss	Cover	Missing
connectors
azure_ai_search.py	314	31	90%	159–160, 208, 210–213, 215, 244, 246, 248, 339, 361–362, 423, 425, 439–440, 451, 522, 544, 554, 566, 586–587, 682, 684, 686, 689, 792–793
chroma.py	262	96	63%	104, 107–111, 126–127, 135–136, 165, 167, 175, 187–190, 197, 216, 220, 241, 249, 253, 266, 270, 275–276, 285–288, 304–305, 313, 332, 343, 345, 349, 367–369, 371–386, 388–398, 400–406, 408, 410, 412–413, 416–417, 419–420, 423–432, 454–455, 457
sql_server.py	504	89	82%	152, 349–351, 371, 390, 401, 403, 426, 472–475, 491, 503–506, 528, 530, 574–580, 586–587, 589–590, 594–608, 611–616, 619–623, 628, 634–635, 638, 644–645, 647–650, 789, 808–810, 816, 822, 827, 832–835, 841, 875, 917, 924–926, 1092–1093, 1095, 1097, 1118
weaviate.py	342	149	56%	110, 315–318, 327–328, 330–332, 337, 341–342, 353–355, 364–372, 375–376, 380–381, 384–386, 389–390, 395–396, 399–400, 402, 411–413, 415–424, 426, 428–443, 447–453, 455, 457, 459–460, 463–464, 466–467, 470–479, 489–490, 493–494, 500–501, 504–505, 509, 512–516, 531, 546, 555–556, 570, 574–577, 580–581, 587–588, 592, 595–596, 599, 602, 622, 641, 657, 666–667, 672–673, 678, 772, 783–788, 793–798, 803–804
functions
kernel_function.py	141	2	98%	106, 108
TOTAL	26829	4608	82%

Python Unit Test Overview

Tests	Skipped	Failures	Errors	Time
3681	22 💤	0 ❌	0 🔥	1m 58s ⏱️

moonbox3 · 2025-08-04T22:29:37Z

Hi @MMoshtaghi, thanks for the contribution. Please make sure you install the pre-commit hooks locally. You can find the directions here. Specially the part around:

# Install pre-commit hooks
uv run pre-commit install -c python/.pre-commit-config.yaml

There are pre-commit check failures right now that need to be addressed. Thanks.

TaoChenOSU · 2025-08-04T23:02:06Z

Could we also add tests?

MMoshtaghi · 2025-08-05T13:07:44Z

Hi @moonbox3 , thanks for the review. I fixed the pre-commit check failures.

moonbox3 · 2025-08-06T02:31:02Z

Per Tao's comment, which I agree with:

Let's add tests
There's a conflict with the uv.lock that you'll want to resolve. Once you pull latest main, you can remove the current uv.lock and then run uv sync --all-extras --dev to regenerate it.

MMoshtaghi · 2025-08-06T10:36:35Z

1- Ok, I'll create a new test file in tests/integration/telemetry/test_telemetry_integration.py , and use the sample in MS Learn for telemetry.

2- I actually followed the DEV_SETUP.md (running make install) , and the Makefile already has uv sync --all-extras --dev --prerelease=if-necessary-or-explicit ... !

moonbox3 · 2025-08-06T10:48:13Z

1- Ok, I'll create a new test file in tests/integration/telemetry/test_telemetry_integration.py , and use the sample in MS Learn for telemetry.

I think unit tests should be enough to cover the overridden deepcopy method you added. We don't need integration tests.

MMoshtaghi · 2025-08-06T12:33:54Z

Ok, I'll create a unit test to cover deepcopy for KernelFunction. Under which file should I write it? how about test_kernel_function_from_prompt.py ? would this be ok? :

def test_kernel_function_from_prompt_deepcopy():
    """Test deepcopying a KernelFunctionFromPrompt."""
    function = KernelFunctionFromPrompt(
        function_name="test_function",
        plugin_name="test_plugin",
        prompt="Hello, world!",
        description="A test function.",
    )
    copied_function = deepcopy(function)
    assert copied_function is not function
    assert copied_function.name == function.name
    assert copied_function.plugin_name == function.plugin_name
    assert copied_function.description == function.description
    assert copied_function.prompt_template.prompt_template_config.template == (
        function.prompt_template.prompt_template_config.template
    )
    assert copied_function.prompt_template is not function.prompt_template

TaoChenOSU · 2025-08-06T21:44:20Z

Hi @MMoshtaghi,

I am curious if you ran into the same issue running this sample: https://github.com/microsoft/semantic-kernel/blob/main/python/samples/getting_started_with_agents/multi_agent_orchestration/step4_handoff.py?

MMoshtaghi · 2025-08-07T05:50:55Z

Hi @TaoChenOSU ,
Yeah, I actually explained that in the issue #12802 . It doesn't matter whether it's streaming or not.

MMoshtaghi · 2025-08-07T07:17:54Z

All tests and pre-commit checks were passed on my end, before the push .

moonbox3 · 2025-08-07T10:40:49Z

@MMoshtaghi not sure why mypy is failing.

There is a unit test failure, though, that I don't see in main. Could be related to the override __deepcopy__: https://github.com/microsoft/semantic-kernel/actions/runs/16792165215/job/47568903867?pr=12803#step:5:623.

MMoshtaghi · 2025-08-07T13:40:07Z

There is a unit test failure, though, that I don't see in main. Could be related to the override __deepcopy__: https://github.com/microsoft/semantic-kernel/actions/runs/16792165215/job/47568903867?pr=12803#step:5:623.

@moonbox3 I know, but as you can see the unit test error occurred after you merged branch 'main' into main.
before that it was only the mypy error and I also get this error when running uv run mypy -p semantic_kernel --config-file mypy.ini. I think it's bc the type signature for ast.Constant.value is very broad. The connector code in weaviate.py, sql_server.py, azure_ai_search.py, and chroma.py does not properly handle all of these possibilities. I may be able to fix it here too.
https://github.com/microsoft/semantic-kernel/actions/runs/16779393946/job/47555115866#step:6:11

The unit test error is bc of a timeout for getting orchestration result in test_magentic.py only in Python 3.12 (My version is 3.10) :

=================================== FAILURES ===================================
___________________ test_invoke_with_agent_raising_exception ___________________

...

async def test_invoke_with_agent_raising_exception():
        """Test the invoke method of the MagenticOrchestration with a list of messages which raises an error."""
        with (
            patch.object(
                MockChatCompletionService, "get_chat_message_content", new_callable=AsyncMock
            ) as mock_get_chat_message_content,
            patch.object(
                StandardMagenticManager, "create_progress_ledger", new_callable=AsyncMock, side_effect=ManagerProgressList
            ),
        ):
            mock_get_chat_message_content.return_value = ChatMessageContent(role="assistant", content="mock_response")
            chat_completion_service = MockChatCompletionService(ai_model_id="test")
            prompt_execution_settings = MockPromptExecutionSettings()
    
            manager = StandardMagenticManager(
                chat_completion_service=chat_completion_service,
                prompt_execution_settings=prompt_execution_settings,
            )
    
            agent_a = MockAgentWithException(name="agent_a", description="test agent")
            agent_b = MockAgent(name="agent_b", description="test agent")
    
            runtime = InProcessRuntime()
            runtime.start()
    
            orchestration = MagenticOrchestration(members=[agent_a, agent_b], manager=manager)
            try:
                orchestration_result = await orchestration.invoke(task="test_message", runtime=runtime)
                with pytest.raises(RuntimeError, match="Mock agent exception"):
>                   await orchestration_result.get(1.0)

tests/unit/agents/orchestration/test_magentic.py:277: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
semantic_kernel/agents/orchestration/orchestration_base.py:57: in get
    await asyncio.wait_for(self.event.wait(), timeout=timeout)
/opt/homebrew/Cellar/[email protected]/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py:519: in wait_for
    async with timeouts.timeout(timeout):
               ^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

...

=========================== short test summary info ============================
FAILED tests/unit/agents/orchestration/test_magentic.py::test_invoke_with_agent_raising_exception - TimeoutError
1 failed, 3675 passed, 5 xfailed, 471 warnings in 88.87s (0:01:28)

MMoshtaghi · 2025-08-07T14:51:36Z

I just fixed the mypy errors nad here I explain the root cause and and the solution:

Problem:
The mypy type checker found errors because the code was not correctly handling all possible data types for constant values (ast.Constant) when building database filters. A constant's value can be a string, number, boolean, or even bytes, but the original code treated them all the same, leading to type mismatches and potential formatting errors.

Solution Overview:
The fix was to replace the simple return node.value with more robust logic in the _lambda_parser method of four different connector files. This new logic explicitly checks the type of the constant's value and handles it correctly.

Specific Type Handling:
- Strings (str): String values are now properly escaped (e.g., single quotes are doubled up) and enclosed in single quotes to make them valid string literals in the target database's query language.
- Bytes (bytes): Byte strings are first decoded into regular strings (using UTF-8) and then handled just like normal strings. This resolved the str-bytes-safe mypy error.
- Booleans (bool): Boolean values (True/False) are converted to their lowercase string equivalents (true/false) as required by some of the vector stores.
- None: The Python None object is converted to the null keyword for database queries.
- Numbers (int, float): Numbers are converted to their string representation without any special formatting.
- Unsupported Types: If a constant has a type that the vector store cannot handle (like a complex number), the code now raises a VectorStoreOperationException with a clear error message instead of failing silently or causing unexpected behavior.
Affected Files: These changes were applied to the filtering logic in the following vector store connectors:
- semantic_kernel/connectors/azure_ai_search.py
- semantic_kernel/connectors/chroma.py
- semantic_kernel/connectors/sql_server.py
- semantic_kernel/connectors/weaviate.py

Tests:

uv run mypy -p semantic_kernel --config-file mypy.ini is executed without error
All unit tests are passed.

TaoChenOSU · 2025-08-07T21:50:34Z

Hi @TaoChenOSU , Yeah, I actually explained that in the issue #12802 . It doesn't matter whether it's streaming or not.

This is weird because I cannot reproduce locally.

MMoshtaghi · 2025-08-07T22:17:16Z

This is weird because I cannot reproduce locally.

Hmm ! How about with https://github.com/microsoft/semantic-kernel/blob/main/python/samples/getting_started_with_agents/multi_agent_orchestration/step4b_handoff_streaming_agent_response_callback.py ?

TaoChenOSU · 2025-08-07T22:41:33Z

This is weird because I cannot reproduce locally.

Hmm ! How about with https://github.com/microsoft/semantic-kernel/blob/main/python/samples/getting_started_with_agents/multi_agent_orchestration/step4b_handoff_streaming_agent_response_callback.py ?

Still no. I approved to unblock you.

moonbox3 · 2025-08-08T01:36:41Z

Thanks again for your contribution, @MMoshtaghi. We appreciate the support.

MMoshtaghi · 2025-08-08T06:05:26Z

Thanks @moonbox3 and @TaoChenOSU for the review.

MMoshtaghi · 2025-08-08T06:10:03Z

Still no.

Basically you don't necessarily need HandOff Orchestration, just clone a kernel with telemetry through the console. I wrote this as an integration test for telemetry and kernel, but Evan said a unit test would be enough.

MMoshtaghi · 2025-08-08T07:07:29Z

just clone a kernel with telemetry through the console. I wrote this as an integration test for telemetry and kernel

You can comment out the deepcopy method and test this:

import logging

import pytest
from opentelemetry._logs import set_logger_provider
from opentelemetry.metrics import set_meter_provider
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor, ConsoleLogExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter, PeriodicExportingMetricReader
from opentelemetry.sdk.metrics.view import DropAggregation, View
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.trace import set_tracer_provider

from semantic_kernel.kernel import Kernel

# Create a resource to represent the service/sample
resource = Resource.create({"service.name": "telemetry-console-customer-service-agent"})


def set_up_logging():
    exporter = ConsoleLogExporter()

    # Create and set a global logger provider for the application.
    logger_provider = LoggerProvider(resource=resource)
    # Log processors are initialized with an exporter which is responsible
    # for sending the telemetry data to a particular backend.
    logger_provider.add_log_record_processor(BatchLogRecordProcessor(exporter))
    # Sets the global default logger provider
    set_logger_provider(logger_provider)

    # Create a logging handler to write logging records, in OTLP format, to the exporter.
    handler = LoggingHandler()
    # Add filters to the handler to only process records from semantic_kernel.
    handler.addFilter(logging.Filter("semantic_kernel"))
    # Attach the handler to the root logger. `getLogger()` with no arguments returns the root logger.
    # Events from all child loggers will be processed by this handler.
    logger = logging.getLogger()
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)


def set_up_tracing():
    exporter = ConsoleSpanExporter()

    # Initialize a trace provider for the application. This is a factory for creating tracers.
    tracer_provider = TracerProvider(resource=resource)
    # Span processors are initialized with an exporter which is responsible
    # for sending the telemetry data to a particular backend.
    tracer_provider.add_span_processor(BatchSpanProcessor(exporter))
    # Sets the global default tracer provider
    set_tracer_provider(tracer_provider)


def set_up_metrics():
    exporter = ConsoleMetricExporter()

    # Initialize a metric provider for the application. This is a factory for creating meters.
    meter_provider = MeterProvider(
        metric_readers=[PeriodicExportingMetricReader(exporter, export_interval_millis=5000)],
        resource=resource,
        views=[
            # Dropping all instrument names except for those starting with "semantic_kernel"
            View(instrument_name="*", aggregation=DropAggregation()),
            View(instrument_name="semantic_kernel*"),
        ],
    )
    # Sets the global default meter provider
    set_meter_provider(meter_provider)


# This must be done before any other telemetry calls, and creating kernels.
set_up_logging()
set_up_tracing()
set_up_metrics()


@pytest.mark.asyncio
async def test_kernel_with_function_clone_with_otel_enabled():
    kernel = Kernel()
    kernel.add_function(
        plugin_name="test_plugin",
        function_name="test_function",
        prompt="Hello, world!",
    )

    kernel_clone = kernel.clone()

    assert kernel_clone is not None
    assert kernel_clone.plugins is not None and len(kernel_clone.plugins) > 0

Update kernel_function.py - Added a custom __deepcopy__ method to the…

fcfc234

… KernelFunction Pydantic model

MMoshtaghi requested a review from a team as a code owner July 25, 2025 17:35

moonbox3 added the python Pull requests for the Python Semantic Kernel label Jul 25, 2025

github-actions bot changed the title ~~Fix: Implement __deepcopy__ on KernelFunction to handle non-serializable OTEL metrics~~ Python: Fix: Implement __deepcopy__ on KernelFunction to handle non-serializable OTEL metrics Jul 25, 2025

Merge branch 'microsoft:main' into main

4d21976

moonbox3 approved these changes Aug 4, 2025

View reviewed changes

MMoshtaghi and others added 2 commits August 5, 2025 12:01

Merge branch 'microsoft:main' into main

81056a8

Pre-commit checks passed

25f523b

MMoshtaghi added 2 commits August 6, 2025 16:33

Merge remote-tracking branch 'upstream/main'

5f3201e

Add test for KernelFunction.__deepcopy__ method

86af682

Merge branch 'main' into main

f29f726

Fix: Handle all ast.Constant types in vector store filters

8aaaca4

TaoChenOSU approved these changes Aug 7, 2025

View reviewed changes

Merge branch 'main' into main

4a89875

moonbox3 added this pull request to the merge queue Aug 7, 2025

Merged via the queue into microsoft:main with commit 73ad6e8 Aug 7, 2025
28 checks passed

Python: Fix: Implement __deepcopy__ on KernelFunction to handle non-serializable OTEL metrics #12803

Python: Fix: Implement __deepcopy__ on KernelFunction to handle non-serializable OTEL metrics #12803

Uh oh!

Conversation

MMoshtaghi commented Jul 25, 2025

Description

Motivation and Context

Description of the change

How has this been tested?

Contribution Checklist

Uh oh!

MMoshtaghi commented Jul 25, 2025

Uh oh!

moonbox3 commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python Unit Test Overview

Uh oh!

moonbox3 commented Aug 4, 2025

Uh oh!

TaoChenOSU commented Aug 4, 2025

Uh oh!

MMoshtaghi commented Aug 5, 2025

Uh oh!

moonbox3 commented Aug 6, 2025

Uh oh!

MMoshtaghi commented Aug 6, 2025

Uh oh!

moonbox3 commented Aug 6, 2025

Uh oh!

MMoshtaghi commented Aug 6, 2025

Uh oh!

TaoChenOSU commented Aug 6, 2025

Uh oh!

MMoshtaghi commented Aug 7, 2025

Uh oh!

MMoshtaghi commented Aug 7, 2025

Uh oh!

moonbox3 commented Aug 7, 2025

Uh oh!

MMoshtaghi commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MMoshtaghi commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TaoChenOSU commented Aug 7, 2025

Uh oh!

MMoshtaghi commented Aug 7, 2025

Uh oh!

TaoChenOSU commented Aug 7, 2025

Uh oh!

Uh oh!

moonbox3 commented Aug 8, 2025

Uh oh!

MMoshtaghi commented Aug 8, 2025

Uh oh!

MMoshtaghi commented Aug 8, 2025

Uh oh!

MMoshtaghi commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Python: Fix: Implement deepcopy on KernelFunction to handle non-serializable OTEL metrics #12803

Python: Fix: Implement deepcopy on KernelFunction to handle non-serializable OTEL metrics #12803

moonbox3 commented Aug 4, 2025 •

edited

Loading

MMoshtaghi commented Aug 7, 2025 •

edited

Loading

MMoshtaghi commented Aug 7, 2025 •

edited

Loading

MMoshtaghi commented Aug 8, 2025 •

edited

Loading