The Problem: The scanner that we are required to use does not recognize spider traps. Since the site I have to scan contains over 4 million records and thus over 4 million URLs, the scanner will take hours to complete a scan and then report the same issue 4 million times.
The Fix (First Iteration): When the site was smaller, I would just save an example of each page to a directory on the Web server, then scan that directory for issues. The problem with this approach was that 1) it required me to manually save these HTML files and 2) would require me to remember each file that needed to be checked.
The Fix (Second Iteration): I decided a better approach would be to create a list of all URLs that needed to be scanned. Then write a server-side script to retrieve each one of those pages using wget and save them to the server. But I ran into a problem. Authentication on this site is a single sign-on (SSO) application that works by redirecting the user to a log in page on a different server, then redirect back after the user has successfully logged in. Maybe that could be handled in the server side script, but I don't want to figure out the code to do that.
The Fix (Final Iteration): It then occurred to me that a better solution would be to retrieve these files through AJAX. I would require authentication for my main page and then the AJAX calls would include my credentials. Of course client-side JavaScript can't save files to the server, but I found a way around that. Here's an overview of the solution:
My main ColdFusion page contains a list of relative URLs to test. Then it loops through those to generate AJAX requests. JavaScript then returns HTML data. I pass the file name and encoded HTML back to ColdFusion through a second AJAX call. Then that ColdFusion page saves the HTML to my specified data. Here's a snippet of the code:
<cfset variables.count = 0> <cfloop array="#variables.pagesToCheck#" index="variables.url"> <cfset variables.count = variables.count + 1> <script> $(function() { $("#current-file").text('Processing <cfoutput>#variables.url#</cfoutput>'); $.ajax({ url: '<cfoutput>#variables.url#</cfoutput>', async: false, dataType: 'html', error: function (jqXHR, textStatus, errorThrown) { $("#pbar").progressbar({value:<cfoutput>#variables.count#</cfoutput>}); $('#file-list').append('<li style="font-weight: bold; color: red">Could not retrieve <cfoutput>#variables.url#</cfoutput>. Error: ' + errorThrown + '</li>'); }, success: function (data, textStatus, jqXHR) { $.ajax({ async: false, type: 'POST', url: 'save_html.cfm', dataType: 'json', data: { source: escape('<cfoutput>#variables.url#</cfoutput>'), html: escape(data) }, error: function (jqXHR, textStatus, errorThrown) { $("#pbar").progressbar({value:<cfoutput>#variables.count#</cfoutput>}); $('#file-list').append('<li>Could not process #variables.url# Error: ' + errorThrown + '</li>'); }, success: function (data, textStatus, jqXHR) { $("#pbar").progressbar({value:<cfoutput>#variables.count#</cfoutput>}); $('#file-list').append('<li>' + data.source + ' - ' + data.success + '</li>'); } }); } }); }); </script> </cfloop>And then here is the source of the save_html.cfm page:
<cfset variables.file_name = ReReplace(form.source, "[^\w\_]", "-", "ALL")> { "source": "<cfoutput>#form.source#</cfoutput>", "success": <cftry> <cffile action="write" file="#application.webroot#/508/#variables.file_name#.html" output="#URLDecode(form.html)#"> "Saved" <cfcatch type="any"> "Failed to save" </cfcatch> </cftry> }Now when I'm ready for 508 testing, I just run my page to create all of my HTML pages then set my scanner to my /508 directory.