WP2Static crawls every pages on every run?

Hi

I work for a client wich uses WP2Static on a site, with 150 000 links detected by Detection phase.

Than Crawling phase is very long … I read the source and I saw that CrawlCache is only for skipping rewriting processed-site/webpage. If I read correctly, WP2Site needs to get every URL, to calculate checksum of the output and then, compares it against CrawlCache.

I didn’t understand in the code, if Detection phase should detect every URL of the website or only recently modified pages.

My question : is WP2Static working like described or is there a problem with my website, as every run should only detect modified URL ?

Thanks !

PS : the crawl phase has a duration of several hours and it’s quite difficult to see what it’s doing (the log output every 300 steps is too long).

With WordPress, it’s impossible to reliably detect which URLs have changed without actually crawling them. So the default should be to crawl everything, because most sites are not huge, and that will always work.

The Advanced Crawling Add-on (works with WP2Static 7.1.6) has an option to “Crawl only changed URLs”. This option still needs some work, however. It probably needs a full review of the detection logic, and it definitely needs battle-testing. I may be able to take a look at it tomorrow.

I’ve made some changes to the “Crawl only changed URLs” in #11. It should do a better job at picking up changes to pagination URLs, etc. It’s still experimental, and needs some testing on large sites.

In #12, I’ve added an option to set to progress reporting interval.

Thanks !

I sent it to my client. If he wants to install it, I will tell you how it’s going, and I should be able to debug it with some instructions from you, if you want.

Have a nice day.

1 Like

Sure. Let me know how it goes.