Replies: 1 comment 1 reply
-
Hi @krstp, there has been a problem with the latency, I'm investigating, it was between 0,5 and 1 sec before that. The server is lean and Trafilatura distributes requests among hosts to allow for fluid downloads and also to avoid being blocked. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Currently Trafilatura works in much synchronous way due to underlying
urllib
dependency. What I am finding when reaching approx. 60-80 req/sec the engine somewhat locks out. I wonder how people manage multiple extraction requests that work in expected concurrent and efficient way?PS. I am not finding any existing conversations on the subject.
PS2. I see there is Trafilatura API that supports large volume requests... I wonder how the backend solution got handled here in terms of underlying
urllib
🤔 Ref: https://rapidapi.com/trafapi/api/trafilaturaWhat is interesting, based on the "large volume" rapidAPI endpoint Trafilatura hosts, the latency on single request is approx
6000ms
... which is6sec
, this is not blazing fast turnout in terms of high volume of data extraction (see screenshot). Note, this is not a complain on the Trafilatura itself, but rather backend solution for multiple and concurrent (perfectly in fullyasync
way, i.e.aiohttp/asyncio
) requests handling. I less care how much time the extraction takes or full extraction and return loop, could be1sec
, could be30sec
, I would rather prefer to have good support for concurrent requests, however, from what I see atm Trafilatura works in much synchronous way.Ref:
Beta Was this translation helpful? Give feedback.
All reactions