What is the best way to support multiple (in hundreds or even thousands) request to trafilatura? #515

krstp · 2024-02-26T15:00:34Z

krstp
Feb 26, 2024

Currently Trafilatura works in much synchronous way due to underlying urllib dependency. What I am finding when reaching approx. 60-80 req/sec the engine somewhat locks out. I wonder how people manage multiple extraction requests that work in expected concurrent and efficient way?

PS. I am not finding any existing conversations on the subject.
PS2. I see there is Trafilatura API that supports large volume requests... I wonder how the backend solution got handled here in terms of underlying urllib 🤔 Ref: https://rapidapi.com/trafapi/api/trafilatura

What is interesting, based on the "large volume" rapidAPI endpoint Trafilatura hosts, the latency on single request is approx 6000ms... which is 6sec, this is not blazing fast turnout in terms of high volume of data extraction (see screenshot). Note, this is not a complain on the Trafilatura itself, but rather backend solution for multiple and concurrent (perfectly in fully async way, i.e. aiohttp/asyncio) requests handling. I less care how much time the extraction takes or full extraction and return loop, could be 1sec, could be 30sec, I would rather prefer to have good support for concurrent requests, however, from what I see atm Trafilatura works in much synchronous way.

Ref:

related issue: Trafilatura to support more robust async library than standard request #514

adbar · 2024-03-01T12:30:35Z

adbar
Mar 1, 2024
Maintainer

Hi @krstp, there has been a problem with the latency, I'm investigating, it was between 0,5 and 1 sec before that.

The server is lean and Trafilatura distributes requests among hosts to allow for fluid downloads and also to avoid being blocked.

1 reply

adbar Mar 1, 2024
Maintainer

I checked, for the record: if a web page is slow to respond, then the API response will be slow to come but it has nothing to do with the API itself. So the statistics are not quite relevant to assess the performance of the package in itself here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

What is the best way to support multiple (in hundreds or even thousands) request to trafilatura? #515

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

What is the best way to support multiple (in hundreds or even thousands) request to trafilatura? #515

Uh oh!

Uh oh!

krstp Feb 26, 2024

Replies: 1 comment · 1 reply

Uh oh!

adbar Mar 1, 2024 Maintainer

Uh oh!

adbar Mar 1, 2024 Maintainer

krstp
Feb 26, 2024

Replies: 1 comment 1 reply

adbar
Mar 1, 2024
Maintainer

adbar Mar 1, 2024
Maintainer