Yeah, this really took its toll. It basically bought the site down: I noticed that my MYSQL server completely freaked out and hit 100% resource usage the websevers were actually ok, but they bottle-neckedbecause of MYSQL. I upgraded my db cluster as a result, but it’s like the wpcli command runs away with the resources on a big crawl. I think you’re right, in order to do this on a big site, you need huge resources so this could get pretty expensive. Its a whole mix of things I think, but it does make me cautious about moving forwards with it. I feel like there should be controls so that you can only fire it when it’s told to, but also some chunking in place to prevent resource over-use.
I’m not sure if I left it in there in recent edits, but in the v6 stream (now Static HTML Output), there was a configurable crawl delay, to avoid shared hosting resource flooding, for example. You could try adding in a sleep somewhere if it keeps timings acceptable, but as you’re finding, it’s conflicting aims with production and dev site serving.
I don’t know enough about WP-CLI regarding how much it differs vs via webserver, just that it usually bypasses php-fpm configs, for example and runs as PHP CLI. When the plugin itself is crawling the site though, those requests will still be made like a regular user.
Something I’ve never tried, but maybe there’s a place for it, is to avoid the webserver completely and just render the WP pages, capturing to output buffer. I don’t think it’s reliable enough to cover majority of sites, where other plugins may be using output buffer and WP2Static comes in too late, but performance-wise, perhaps it will be significant enough an improvement to make sense for some sites. We’d not use any of the web server resources, just PHP & the database… Anyway, that’s a good weekend distraction for me at some point.
WP2Static aside, I don’t think you’d be getting any better results by using any other outside-in crawler/scraper, like HTTrack or wget. If you run those from a far enough away server to your instances, the delay may be enough to give your server time to breathe between each crawl request.
You can try this if you’ve got bash and curl available: https://appi.sh - it’s another weekend project that does work fine and I’ve used it a few times for sites that aren’t in WP or the WP setup is too cumbersome to want to deal with.
Last thought - when crawling is slow, I often recommend people to try running wrk benchmarking tool, to see how quickly the site actually responds to frequent, concurrent requests. There could be some quick wins in WordPress/theme/plugins to speed the actual site up, which should only help further in crawling (unless more requests leads to DB falling over!).
To pinpoint WP resource bottlenecks and some quick wins, Xdebug and KCacheGrind can be a bit fiddly initially, but then do what Blackfire.io charges a hefty amount for in profiling.
Thats a good point. I was looking for this actually in the 7 build.
I might try switching back to 6, running as wp-cli with the crawl limit on. Is there docu for 6 anywhere that I can have a read through, rather than keep bothering you here?
Bother away - my penance for not providing documentation. There is some outdated stuff still up at https://docs.wp2static.com that may work, but there’s not as much going on as 7 to be caught out with.
Does 7 write a lot to the DB? I seem to have serious slow down on my site, post crawl…
There is, but name is different, not wp wp2static
, but wp statichtmloutput
or statichtml
sorry, away from pc
if you run just wp
, it should give you the available commands. Their not 1-1 match for wp2static, but enough to do the export
So I’m back! Sorry for going dark. I hope you are doing well?
My aim remains the same: to create a static copy of a large live production site on the wp cli and offload the files to an s3 bucket (with a cloudfront distribution sat in front) with a custom url. Route53 will monitor the health of the live production site, and in the event of the site becoming unhealthy (down) it will automatically re-reroute to the static copy.
My concerns: database performance/bloat
So my question coming back to this is: Should I be using wp2static or static-html-output?
No worries re time, I’ve been in coder’s block cpl weeks now :{
I’d got wp2static based on the CLI usage and more techie sounding approach. The Advanced Crawling Addon will probably be required, can check out that repo, but I believe still tied to 7.1.6 of wp2static.
Give it all a go and let me know any issues you encounter.
The latest master of Advanced Crawling Addon has revamped logic for the “Crawl only changed URLs” option that should help, but it’s new and hasn’t really been tested on live sites. It does make very fast deploys possible if you’ve only changed a few posts.
Unfortunately, performance is pretty horrendous on huge sites if you change the menu or something and have to reprocess the entire site. This is always going to be inherently tough since we’re working with WP and not a framework that’s actually designed for creating static sites, but I think there is room for a lot more optimization in WP2Static around minimizing DB usage. My changes to enable partial crawls use more DB calls to facilitate, and that seems to result in slower crawling when you have to do a full site crawl.
So crawling is pretty slow as you predicted - it’s taken the best part of a week of solid crawling to get to where it is. Its particularly hard to know how far along its done. I’m at circa 45k records and I can’t really say that I know how far along this is to complete. But in running over the CLI, I do seem to have been able to overcome the performance hoggining that was going on. Are there any ways to know how far along the process it is?
I’m trying to run a process command now, but again, it’s really hard to know if it is actually doing anything. The big thing preventing me from deploying at least something is the fact that my wp-content folder isnt in the wp2static-processed-site folder.
Is there anyway to exclude certain patterns (ie, anything including events) from being crawled/processed? Doing this, I think I could at least deploy something quite quickly and then add in additional sections once I have something up.
Hi @nathansmonk,
Running via CLI, there’s a stale PR to show progress, but wasn’t wuite the same as the UI, where we report progress of every 300 files.
I have another terminal with a watch command doing something like:
du -sh DIR
Or a find to show incrementing size/files.
A week is intense. I’d rather clone the site locally and give more power/cutout network bottlenecks.
Yup, I’ve put together a little script which kinda does this now, so at least I can see it’s working. It’s certainly doing its thing now. My hope is once this first crawl is out of the way, the follow up ones will be much quicker. I’m not actually seeing a bottleneck which is weird - what parameters can I provide to give wp2static a bit of a turbo boost? the crawl chunk size?
I have 2 immediate problems:
-
It will not seem to crawl my actual css file.
I’ve added the path into the Additional Paths to Crawl in the Advanced Crawl Plugin, but when I run wp wp2static crawl, it doesn’t add this path. Any ideas what I’m doing wrong here? Do I need to run the detect stage again? -
Because of my infrastructure, when the CPU gets too high, it just terminates the instance and starts a new one. This also terminates the long running script that is doing the crawl.
I think I can get round 2, if I address 1.
It will not seem to crawl my actual css file.
I’ve added the path into the Additional Paths to Crawl in the Advanced Crawl Plugin, but when I run wp wp2static crawl, it doesn’t add this path. Any ideas what I’m doing wrong here? Do I need to run the detect stage again?
You don’t have to run detect again, as they are added during the crawl step. You should see a log message from WsLog::l( count( $additional_paths ) . ' additional paths added.' );
at the start of the crawl. It always adds “/”, so the reported number may be higher than the amount that you’ve added.
The paths have to be the exact, full filenames, not a directory location (since this case is intended for when auto-discovery isn’t working out).
Because of my infrastructure, when the CPU gets too high, it just terminates the instance and starts a new one. This also terminates the long running script that is doing the crawl.
I haven’t heard about this before. Why is the infrastructure like that?
I’m perhaps oversimplifying a bit, but ultimately its a self healing type affair, so servers come and go depending on traffic requirements, but there’s no control about which instances get binned when scaling down happens.
Good to know regarding detection. I think I’m almost there!
Hi @nathansmonk,
Sounds like a fun project!
Re the self-healing, should be OK, as long as WP2Static is running on a permanent instance and the auto-scaling type stuff is just providing URLs to crawl through, but then, this would sound like a setup with additional DNS trips (crawling from same server as WP2Static is only really supported/recommended at the moment).
For giving WP2Static extra boost, yeah, giving more chunks per crawl is good and at least used to be an option via the UI, not sure if that’s in at the moment. I’m keen to see if Spatie’s PHP crawler library can be dropped into WP2Static, which could give some really nice crawl performance improvements.
If you want to bypass WP2Static or use it in conjunction with more optimised web crawlers, there’s some nice CLI tools, which come up when searching on ways to optimize cURL/wget, maybe some from here: GitHub - BruceDone/awesome-crawler: A collection of awesome web crawler,spider in different languages
If you can calculate the requests per second that WP2Static is processing using those tailing scripts, then you can compare with other servers (maybe try https://lokl.dev) as a consistent way to benchmark.
I like GitHub - wg/wrk: Modern HTTP benchmarking tool for general requests per second testing of servers. That may be good indicator of the server in general. If slow to load test, will be slow to crawl.
Crawling is expected the slowest part of the process (detect, crawl, post_process, deploy), simply due to the network requests, so I’d look into any lag there as lowest hanging fruits to speed up the whole export.
Well its now about 99.99% done.
I got a complete crawl process and deploy done.
It’s definitely missing a bunch of resources despite them being added in the “Additional Paths to Crawl”. I’ll try another crawl, but if there’s something I should know about this, I’m all ears.
I’ve also noticed that subdirectories produced (and shipped to s3) don’t default to going to the index.html file inside of them. Is there some additional config I need there?