Top 10 website in the world ( July 2025 )

Arun Shah™@lemmy.world · 11 months ago

Top 10 website in the world ( July 2025 )

Pechente@feddit.org · 11 months ago

Wikipedia going down like that makes me sad, especially since due to ai crawlers, their traffic costs went up significantly.

clb92@feddit.dk · 11 months ago

Why would anyone crawl Wikipedia when you can freely download the complete databases in one go, likely served on a CDN…

But sure, crawlers, go ahead and spend a week doing the same thing in a much more expensive, disruptive and error-prone way…

Eager Eagle@lemmy.world · 11 months ago

There are valid reasons for not wanting the whole database e.g. storage constraints, compatibility with ETL pipelines, and incorporating article updates.

What bothers me is that they – apparently – crawl instead of just… using the API, like:

https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Lemmy_(social_network)&formatversion=2

I’m guessing they just crawl the whole web and don’t bother to add a special case to turn Wikipedia URLs into their API versions.

clb92@feddit.dk · 11 months ago

valid reasons for not wanting the whole database e.g. storage constraints

If you’re training AI models, surely you have a couple TB to spare. It’s not like Wikipedia takes up petabytes or anything.

limer@lemmy.ml · 11 months ago

Vibe coding

Pechente@feddit.org · 11 months ago

My comment was based on a podcast I listened to (Tech won’t save us, I think?). My guess is they also wanna crawl all the edits, discussion etc. which is usually not included in the complete dumps.

clb92@feddit.dk · 11 months ago

Dumps with complete page edit history can be downloaded too, as far as I can see, so no need to crawl that.

mesa@piefed.social · 11 months ago

Good pod cast

ThePantser@sh.itjust.works · 11 months ago

Yes they should really block crawlers or force them to pay. The only way I can think of that they could do is make you have to register an account to access content but that goes against what they originally intended. But these are new times and it’s probably for the best. Wouldn’t be hard to flag obvious AI scrappers.

skvlp@lemmy.wtf · 11 months ago

It seems there are ways to stop crawlers. Do a web search for “stop ai crawlers” or similar to learn more. I hope it doesn’t escalate into an arms race, but I realise I might be disappointed.

SebaDC@discuss.tchncs.de · 11 months ago

And click through rate is dropping.