Long crawl times?

Hi, I have the latest 7 Alpha installed directly from git. After an initial export, if I edit just one page, the jobs are automatically queued, which is great, but the crawl portion is taking 20 minutes. The other steps are almost instant. Even though only a single page was edited, it’s touching a ton of files. To give you an idea of the size of the site, it has 331 index.html files generated. The total size of the output folder is 337MB. I’m testing this out on our much smaller domain, but we really want to get this going on the corporate marketing site. I’m scared how long the processing will take though when the marketing team does many edits per day.

Should it be taking that long to crawl after only a single page edit? The site is using Divi Builder which I suspect may have a hand in it (I’m not a fan). However we are not using Divi on the larger site, so I may just clone that site and give it a shot in a test environment.

Thanks,
~Nate

That is not normal. It could be inefficient plugins or themes, but it could also be a crappy host.

Do you have PHP 7.4 installed? It seems to be significantly faster.

Thanks for the reply. It is 7.4. It’s on a smallish Digital Ocean server (2cpu, 4G RAM).

I just tried again, editing a single page, then kicked off the processing while looking at some of the data in the db. It’s not touching every record in the crawl cache, but there are only 331 actual pages.

SELECT count(*) FROM wp_wp2static_crawl_cache where time > '2020-07-05'
=> 592
SELECT count(*) FROM wp_wp2static_crawl_cache where time < '2020-07-05'
=> 10489

When I look at the timestamps on the all the ‘index.html’ pages generated for the whole site, I can see that every single one got re-built.

Maybe I’ll try disabling all plugins then trying the test again after enabling them one-by-one.

I haven’t used Digital Ocean myself, but it’s supposed to be decent hosting. I can run WP2Static on the smallest AWS VM with 512MB RAM, and it’s blazing fast. So I think your hosting is fine. Is the DB on that server, or separate?

Those SELECT results look wrong, but it’s hard to say what is causing it. There were some changes to the tables recently, and WordPress’s dbDelta just silently fails if it can’t make a change. (I’m trying to move away from dbDelta, but it’s what we have for now). I’d suggest dropping wp_wp2static_urls and wp_wp2static_crawl_cache, and then deactivating/reactivating WP2Static (so it recreates the tables).

It’s normal at present for WP2Static to crawl the entire site. I actually like that since it’s usually very quick, and WordPress’s design means there’s no reliable way to tell if a site-wide change has been made. I’m not sure if Leon has any plans to support partial crawling.

Thanks again for the help. The db is local. I dropped the tables and re-initialized but it still took 20 minutes to crawl. I then switched themes off DIVI builder as a test and it took just over 10 minutes to crawl. I’m going to disable every plugin I can and see what happens, though that’s not really an option for the real site.

It still took nearly 10 minutes to process the queue after a single page edit with all plugins disabled and using the default WP theme. Is this what you guys would expect on a site with ~300 pages?

I’m looking at the crawl code, and it definitely should not be writing every file every time. Do you have “Use CrawlCache” checked on the Options page? You’ll still see activity in the crawl cache table even with that option off, since WP2Static needs the data, but that option controls whether the files are written to disk each time. (Sorry, I forgot that option had been added. It should be on by default in new installs, but upgrades might leave it off).

Yep, that option is checked. I’ll try with it off (even though I think you mean that it should be on). One of the log entries in the previous run said:
Crawling complete. 595 crawled, 7593 skipped (cached).

The time it’s taking on this test site is not a deal breaker, but I’m hoping to get it as quick as possible, and I’m a little worried about how long it will take on our bigger site.

Thanks for all the help.

Interesting that w/out the crawl cache turned on, it took the same amount of time, but the log reports:
Crawling complete. 8189 crawled, 0 skipped (cached).

I guess that’s as fast as it’s going to get for now. I know Leon has plans for an Advanced Crawling Addon in the future, but currently crawling always hits the entire site. I’m not sure if it would work for your use case, but if you need to crawl a very large site, it might be more practical to do it in a cron job rather than on every change. Or if you have the dev resources, I don’t think it would be terribly hard to modify WP2Static to record changed posts and crawl just those.

Hi @ncrosno, sorry for the late reply from me (bit of update of situation: Where did Leon disappear to? Project updates)

Those times definitely slow for a DO VPS with that much grunt and for so few pages.

Is this a 1-click WP or other application used or did you setup your own Nginx/Litespeed/Apache instance?

In case you got a dodgy box on DO by chance, it may be worth spinning up a minimal Vultr or EC2 box, clone site and compare there.

Another consideration is the DNS resolution. The plugin will make requests to whatever is in your WordPress > Settings > Site URL, so let’s say that’s https://dev.mydomain.com - is that resolving locally within the VPS or does it go outside, then get pointed back to the IP? That could greatly slow down each crawl request.

The next thing I’d probably look at is the site itself - ie, run something like wrk against the site and see how many requests per second it manages - if this is slow, then any site activity will be slow.

Thinking out loud, I may be confusing how things work behind the scenes, but if you have some external asserts, ie Google Fonts, remote JS or CSS, then on each crawl it could but shouldn’t be waiting on those for each crawl iteration.

There’s a few optimizations to go out in next release, along with what John mentioned, so there should be some speedups, but what you’ve described does sound exceptionally slow.

Cheers,

Leon

I’ve made some changes to the crawl behavior at https://github.com/WP2Static/wp2static/pull/633. Crawls are instant for me since they are only crawling posts that were changed. Can you give that PR a test drive?

1 Like

Hi John,
I gave it a quick try. After a small edit to the text of our “about” page, it took less than a minute for the job to run, but then when I view the file “wp-content/uploads/wp2static-processed-site/about/index.html” the edit is not there. I can see from the file’s timestamp it was at least touched, but there is no change. The file ‘wp-content/uploads/wp2static-crawled-site/about/index.html’ was not touched and still had an old timestamp.

Thanks for testing it out. I haven’t been able to produce that problem myself, but I’ve created an Advanced Crawling Addon to do more long-term testing. The plan is to add new & niche features to that addon, and possibly add the more commonly used features into core after they are tested & proven.

1 Like

I’m experiencing horribly long crawl problems, i.e., this has been running all night & I’m not 1/3 done. It’s probly me bein’ stupid, but… I’m trying to make a static site on my local box for export. The site is using NextGen gallery & that plug won’t accept galleries on an external server. The shared hosting space the client’s using has hit its limits, & I’ve gotta put these galleries on an external s3 server. So my thought was to make links to the images that can later be changed. I downloaded v. 6.21 from Github, I’m running php 7.3 on a win 10 box w/16g ram & about a half tb of free space. Latest version of WP & all plugs, the majority of which I’ve disabled.

I’ve ramped up some of the processing options, but it doesn’t seem to be helping. I’m never gonna finish this project by the Oct 1 deadline the hosting company’s given me if I can’t get this moving along a bit faster.

Any help so very much appreciated.

That is indeed very long.

If you have Win 10 Pro, you can try this new Docker-based setup, seems very fast for me for multiple sites. It is very new though:

https://lokl.dev

I don’t have a Windows box at the moment, but @gulshan was able to get this going before in Windows, by installing Docker and cURL and I think using WSL or Git bash as the terminal…

How are you running WordPress within Windows now? ie, MAMP, XAMPP, Local by FlyWheel, etc?

If your site uses Windows file paths, there’s a fix that allows plugin to work with those, not released yet, but available in this build: https://github.com/WP2Static/static-html-output-plugin/files/4943854/gulshanwinziptest.zip

Even on very large sites, crawl times should not take that long. If you inspect the directory it’s generating (in wp-content/uploads), do you see the size of the directory increasing? If not, it may be stuck on an error. There may also be something interfering with the crawling, such as a Windows firewall.

Unfortunately, there are infinite variables when it comes to people’s setups, hence my recent work to make Lokl a consistent environment across all major operating systems and optimized for generating static sites.

I know about the infinite variables. I have win 10 home, not pro. The directory does not seem to be increasing in size, so yeah–it appears to be a problem.

Can I ask you this–the s3 deployment options–are those just for Amazon or can they be used on Digitalocean Spaces also?

I read your post on disappearing w/some concern, Leon. As a former physician, I think there’s reason for such. Delegate where possible. Confine yourself to a precious 1 or 2 projects, ie, decide what’s better left for someone else & concentrate on your passion. The old saying back in the 60’s “if it feels good, do it” is certainly applicable here. Take breaks. Hard as it is to fathom, your family & friends like to see you around. If you’re really depressed, please see someone. The hardest patients to lose for me were those who needed to do so but wouldn’t. Cancer & heart disease & neurological conditions were all justifiable in my book. Those conditions where the patient suicided or became addicted were much harder because it didn’t have to be like that. We computer folks tend to shove our feelings down pretty hard, & we do so to our peril. I’m talkin’ to myself here as well, understand. Many patients who had mental health issues said they found meditation of any form useful.

& I am now done preachin to you & back at this @#$% website generating problem. Thank you. Please do not stress on my account. I can do that on my own. Lol.

1 Like

Thanks @abletec, really!

I think I’m low-risk to typical suicide category, but definitely prone to falling into self-destructive patterns of eating/lifestyle. The last week has been good, with 2 x productive days so far. 2 more than the last few months, so that’s an improvement, along with some other housekeeping type things. When I say productive, that’s on my own projects that I want to do, not client work or such, so those do give me a mental boost. Shifting my focus to what really matters to me, in terms of open source, minimalist tech, accessibility is helping. It will be a slower/longer journey, but in a direction where I feel the progress :slight_smile:

Back to the issue at hand - there’s an error or (less likely) an endless loop somewhere.

If you can trace the errors, starting from the browser (open console to Network tab and wait to see a failed request during export). And going through to the webserver - look for the PHP or webserver error logs.

But, I’d suggest switching to that more Windows-friendly build I sent you link to (gulshantest.zip).

Re S3 compatible deployer - not in this project or WP2Static. We discussed in GitHub Issues a little while back whether we should add that to the new WP2Static S3 Add-On, but decided it makes more sense to make a standalone “S3 Compatible” add-on, as the S3 add-on has a lot of very AWS specific stuff in it now, which is great, but we can’t be sure what’s supported for each service like Minimo, DO Spaces, Rackspace, etc.

I imagine we’ll end up with:

  • S3 specific add-on
  • S3 compatible add-on
  • other vendor specific add-ons when I get around to it or if someone contributes (easier now with WP2Static vs Static HTML Output to create a custom add-on).

Also, I didn’t learn what specific way you’re hosting WordPress within Win 10, but chances are it’s the bug around Windows filepath support, for which the aformentioned link should get you there, unless there’s some other errors in the log.

Also, once it’s running for you, we can discuss ways to speed up the process/tweak the webserver a bit more, too.

Cheers,

Leon

Unfortunately, Leon, I’m in a serious time crunch right now. I can try to troubleshoot this after my project is complete, & I will, but right now I need to explore other options, including using on a DO droplet or another plug (boo, hiss).

I also did want to thank you for your accessibility work. I use a screen reader now because me & my eyesight had a labor dispute & it walked out & did a 301 redirect to a black hole lol–or not?. At any rate, from a screen reader user’s perspective, it was dead simple–no issues. Please keep that up. You don’t have any idea how much I appreciate that, including enough to provide a little $$$ your way, whether I actually use the plug during this project or not. I believe in rewarding folks for that sort of kindness & consideration. Thank you! I will be in touch.

1 Like

I understand the crunchiness!

Some alternative options listed here to go static: https://wp2static.com/alternatives/

If you just want to rip the current WP site and get it into S3, I’d probably go with HTTrack, which has a GUI for Windows. Then use any tool to sync the files into S3. I have yet another project which can do the ripping part for you, alas, Windows is the least supported of my things, so would require extra hurdles to get working there.

I haven’t looked at the accessibility of any of the plugins yet, so if it’s usable already, that’s awesome and probably just a neat side effect of keeping code minimal.

My personal site is now using a new theme for the Hugo static site generator, which I built with the help of @rkleinert, called Accessible Minimalism. If you like the screen-reader “look” of that site, I am considering to port it to WordPress, if I hear there’s some interest.

Edit: we built the theme, not Hugo - that’s a massive project with a lot of contributors!