Failed during "Crawling initial file list", please consult the Help tab for where to get assistance

leonstafford · June 9, 2020, 3:54pm

With WP2Static version 6.6.7 (from next release, plugin is renamed to Static HTML Output), this is a common (and rather unhelpful) error message to receive:

Failed during “Crawling initial file list”, please consult the Help tab for where to get assistance

With the upcoming release, some of the causes of this issue are mitigated and steps to troubleshoot it are made easier, but here’s my general advice on how to troubleshoot this after having helped countless users over the years (and 2 more today!).

Why is it happening?

A server has occurred that the plugin cannot recover from / proceed with the crawl.

Has the user done something wrong?

No, if anyone’s at fault, it’s me for developing the plugin and documentation that should avoid getting into this state. This sucks, especially for new users, who are unfamiliar with the plugin and think they’ve done something wrong. So, sorry about that, but let’s try to remedy it!

How to solve it and allow the crawling to complete?

If you’re unsure about any of the info here, please reply with a question and I’m happy to help!

Troubleshooting this is usually a process of elimination. We check all the things that could be causing the issue. In my experience, there are quite a few different causes for this, such as server setup, browser extensions, plugin configuration and site content issues.

Here’s how we try to pinpoint the cause in order to know what to change.

Does it happen immediately after you start the export?

This means it’s unlikely to be related to any content on your site and more to do with the server setup.

When crawling, the plugin makes requests to your website, very similar to how you do when you access the site from your browser. Similar, but not identically. When you view your site, you are doing it from your browser, on your local computer. When the plugin crawls your site, it does this from within the server itself. And to do this, the plugin infers the site’s URL from your WordPress Settings > Site URL address.

This is where there can be some differences between you using your browser and the plugin making its requests to your site, including:

DNS resolution
User-agent
HTTP version
IP address

Check your WordPress Site URL

For DNS resolution, let’s say your WP site address is http://mydevsite.com and that’s just an entry you’ve put in your /etc/hosts file on your local computer to point to some IP address of a development server you’re using. Your browser uses the DNS from your local computer to lookup that domain and map it to the IP address you told it to. The plugin has no idea. WP will continue to work fine for you from your browser, maybe your wp-config.php file even has a line to tell it to use whatever domain it’s access from as it’s Site URL (often useful for running same regular WP sites on staging/production).

Unfortunately, the plugin isn’t aware of any custom setups you’ve got to access your site from your browser, so we need to ensure that it (running on the same server as WordPress) can also access the site from the same http://mydevsite.com. This may be as simple as adding an entry for that domain in your server’s /etc/hosts file.

As a rule, always try to ensure that your WordPress’s Site URL is set to the actual address that you access it from and that it can resolve to itself. For Docker based setups, this can be a bit trickier, as the common way to run WP in official Docker images, is to map different ports from the host to guest, ie you access http://localhost:8000 from your browser, but the container is actually resolving to http://localhost (port 80). In that case, the plugin is trying to connect to http://localhost:8000, which isn’t being served from within the container (only port 80 is).

To quickly confirm if this is the issue, SSH into your dev server and run cURL or wget, such as curl http://mydevsite.com and see if it succeeds to access the site without errors.

Security measures

Any of these security measures may be in place in your dev site that could need accommodating within the plugin or adjusting within the security tool:

Web Application Firewall (WAF)
WordPress security plugins
HTTP basic authentication

For the WAF and WP security plugins, you may need to whitelist the user-agent for the plugin (how it identifies itself to the server when requesting a page. Your browser will use something like Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail firefox/firefoxversion, the plugin will use WP2Static.com or StaticHTMLOutput.com (above version 6.6.8).

If you have HTTP basic authentication enabled on your dev server (a good way to keep it private), then you’d be entering an additional username and password when you access your WordPress site, before you reach the WordPress login screen. You may have saved this or not had to login for a long time, so easy to forget it’s also needed to input into the plugin, in the options page’s Crawling tab.

For some WAFs or even some hosting providers, the plugin’s many frequent requests when crawling may flag it for having requests blocked. This would usually be seen after crawling proceeds for a while, but if you’re attempting to run again, it could fail immediately, if still blocked. This is an unlikely cause, but worth including.

To eliminate these WP security plugins or WAFs within your hosting setup, you can try temporarily disabling them when you crawl. In reality, your development site should not be publicly accessible to bots/malicious users, so with http basic authentication enabled, WordPress security plugins or WAFs should not be required, but that’s a security risk for you to weigh for yourself.

Browser extensions

Though rare, I have seen one case where a browser extension was messing up the plugin from being able to run properly, so you can quickly re-attempt to crawl using an incognito/private browser window without extensions or even a different browser.

Check logs

This is usually a fast way to find any obvious errors that the server is reporting. We have a few levels of logs we can check:

the plugin’s
WordPress debug log
webserver’s log
any additional load balancer/proxy logs
PHP error log
server error log

Note, you can also try running the plugin using WP_CLI commands if you’re comfortable to do that, as most errors will then show directly in the terminal where you’re running the command and don’t need to dig around in log files.

Plugin’s error log

To get the plugin (versions 6.6.7/8) to generate an error log, we need to enable Debug Mode from the Advanced tab and re-run our export. This should then produce an Export Log, accessible from within the Advanced tab. I noticed today, that this Export Log can sometimes show cached results, so you can open it up in a new tab instead, at https://mydevsite.com/wp-content/uploads/WP-STATIC-EXPORT-LOG.txt` you can then refresh this as the export is running or at the end to see the results. If that doesn’t show any obvious errors at the end, then we’ll need to look at the other logs.

WordPress Debug Log

If you have WordPress’ debug enabled (see https://wordpress.org/support/article/debugging-in-wordpress/), then you may find errors have been caught there and not sent to the plugin’s Export Log.

Webserver’s error log

By this, I refer to your webserver, which could be Nginx, Apache, httpd, Litespeed or other, or a combination of these for setups with proxying in place.

Loadbalancer error logs

If a loadbalancer sites in front of your webserver, you can check it’s error logs that may have blocked requests before they reached your webserver.

PHP Error Log

Most setups will have PHP write error logs to the webserver’s logfile. Sometimes, this is being written to another location or disabled completely. You can force an error somewhere in your WordPress site, say your theme’s functions.php, like error_log('FORCING AN ERROR'); and check that you can find that in a logfile somewhere, that’s where any plugin PHP errors should also end up.

Server error log

Having exhausted all the logs above, you may find some logs in your overall system error logs, though I don’t think I’ve had to resort to this yet, it’s usually in one of the errors above.

Finding error logs

If you’re using some hosting provider with a control panel, you can usually access the error logs within the control panel. For other sites, there are common locations, a quick internet search for “HOSTING_PROVIDER_NAME error logs” should give some clues, else ask here for help finding them.

Other places to look for the cause of the issue

Browser console

So, having covered error log locations, there is one more, that is also uncommon, but worth checking - your browser’s Console. This can show JavaScript or some request related errors that you should ask about here if not obvious. JavaScript issues would likely be caused by a conflict with another plugin or a browser extension as mentioned above.

Your browser’s Console > Network tab will also show all the requests/responses made if you open it and refresh the plugin page before starting the crawl. It will at least tell us if the request by the plugin resulted in a 403 (unauthorized) or 500 (server error) type error, which should give you hints where to look next.

Site setup / content

Less common, but sometimes site content can cause issues. To eliminate this as a possible cause, try to export a default new WordPress installation on your same server, with no other plugins enabled and using the default TwentyTwenty theme. If this completes and your site setup doesn’t on the same server, then we can have more confidence it’s due to one of your plugins, themes or content. Try enabling them one by one until you can reproduce the failure, then report the offending plugin, theme or content here.

Server setup

When the crawling progresses for a while before the failure occurs, this is often due to a server timeout or resource limit being reached, which could be one of:

PHP’s max_execution_time limit
Webserver’s timeout settings
CGI timeout settings
Load balancer timeout settings
Server max open files

PHP settings

Besides having the required PHP extensions installed (you’ll see such errors in the error logs), there are 2 main PHP settings which will help the plugin to export:

max_execution_limit - increase this or reduce your plugin’s Advanced > Crawl Increment until it completes successfully. On many servers, this is set to 30, 60 or 90 seconds, which isn’t a lot of time to crawl a whole site. The plugin’s Crawl Increment setting allows each iteration of the crawling to process less URLs, which can help keep it under these execution limits. Better is to increase the setting on your dev server (I usually have it at some crazy high setting like an hour. It’s a development server, so I’m not too concerned about rogue long-running PHP functions).
memory_limit - I’ll generally put this up to about 75% of my web server’s total RAM, if it’s not doing much besides hosting WordPress.

Some restrictive hosts won’t allow you to adjust those settings - in which case, I say find a better host. There is some truth in that plugin’s shouldn’t need to execute long running processes, but if you’re paying for a server, you should have control over it. (This is often their way of overloading many users on one shared server and then upselling you to upgrade when things are slow, but I’ll save that rant for now. I accept some blame for the plugin being resource hungry, but if all we need to do is crawl a site as quicky as possible, I want to use all the resources I’m paying for!)

Webserver timeout settings

There can be a bunch of these to configure for Nginx, Apache of your webserver. Quickest way is to open the webserver config file and search for any config option with timeout in the name and increase that. If stuck, reply here and I’ll dig out a comprehensive list.

CGI timeout settings

I think most webservers now are using PHP-FPM, which can have it’s own connection timeout settings you can increase.

Load balancer timeout settings

Again, if using a load balancer, this can have it’s own connection timeout settings, as with one recent troubleshooting case with AWS’s ELB.

Server max open files

This is one of the issues that shouldn’t occur after version 6.6.8 of the plugin, but due to the way I designed earlier versions to use flat files to store lists of URLs to crawl and error logs, we can end up in a case where frequent accessing of files can hit your server’s max open files limit. If this error shows up in your logs, you can do an internet search on how to increase max open files for your specific server setup.

Your internet connection

This is more a cause of issues during the deployment phase, where on slow connections, sometimes the API, such as S3 or BunnyCDN can timeout when data isn’t being sent fast enough (or too fast, but that’s less related to this Crawling failure troubleshooting guide).

That’s what I can think of for now, please reply to this if you’ve gone through these possibilities and still not having any luck or need help remedying an identified issue.

Cheers,

Leon

jrnotjnr · July 19, 2020, 9:42pm

Good day Leon,

I’ve tried everything but still getting the “Crawling initial file list” error. Will really appreciate it if you can help solve the issue as i really like your plugin and would love to use it on our site.

brad · September 13, 2020, 9:55pm

Hi Everyone,

I was having a Crawling initial file list error as well. Error log was telling me it was failing due to a NULL $url passed in the HTMLProcessor.php on line 631. A added a conditional if (!is_null($url)) to that section of code and I no longer have the crawling error. Do you see a similar error?

leonstafford · September 14, 2020, 1:29am

Thanks for reporting, @brad!

It looks related to the bit of code which processes an image srcset tag, which there’s another open issue with at the moment.

I’ve created an issue to track this specifically

leonstafford · September 14, 2020, 1:30am

@brad, in your case, while that may have suppressed the error, I’d suggest checking any srcset's you have are serving the correct images in your static site.

Cris · October 29, 2020, 8:27am

Also had the Failed during “Crawling initial file list”. It happened after I downloaded an Astra site then deleted all the pages I didn’t want and removed all the included Woo Commerce plugins and Woo Commerce tables. Pulled my hair out for many hours testing online and locally. Finally tried to crawl the original unmodified Astra site, crawled perfectly! My conclusion is broken links (to plugins?) can throw it, even though my broken link checker hadn’t found anything. Anyway, thanks for this script Leon, you deserve a gong!

P.S. Online seems to work generally better than offline (with Windows Bitnami), e.g. with me “Exclude certain URLS” works online but not offline. Fun & games!