Shopping Product Reviews

Behind the Form: Google, Crawling the Deep Web, and the Impact on Search Engine Visibility

Crazy Things Really Rich Companies Do

Sort of like that weird guy at the party with an acoustic guitar and Pink Floyd T-shirt, Google is going DEEP. Some would say…uncomfortably deep. After an already busy year, in which Google released an open source mobile operating system and a browser that is rapidly gaining market share, they recently announced that they had mapped the seabed, including the Mariana Trench. And hey, why not find a school with some of the top scientific minds and see what happens?

So Google has been more visible than ever lately, and there’s no doubt this will continue as they get into more and more projects, but let’s go down a few floors and see something that should drastically affect how Google indexes programs. (“spiders” or “crawlers”) collect data, analyze websites and present the results. For all the work that BEM Interactive’s search engine marketing team puts into making sites attract spiders (and there’s a lot we can do to make those spiders love it), the spider programs themselves are pretty straightforward. : Access a site’s page index, check the structure and content, and compare it to what Google has determined to be “relevant” or “popular.”

But because of the way these programs are written, there are certain areas they just can’t get to…namely, pages that require information, input, or human action. As a basic example, there is usually a confirmation page after a user submits a “Contact Us” or “Newsletter Signup” form; this could contain a promotional code or some other type of unique information. This dynamically generated content (this could also be a search results page, calculations or conversions, even the results of a symptom tool on a medical site) simply doesn’t exist until the user creates it. Depending on which form you filled out, the resulting page is yours and yours alone, so try to ignore that tingle of omnipotence the next time you Google something.

But search engine spiders can’t understand what the form is requesting or the information being delivered to the user, and even if they could, how would they figure out what to insert to generate relevant content? Dropdown boxes, category selection, postal code input – any of these forms can prevent data from being indexed. Collectively, this locked data is known as the “Deep Web.” By some estimates, the Deep Web contains a staggering amount of data, several orders of magnitude more than is currently searchable. Since they are primarily based on sitemaps and hyperlinks, search engine crawlers simply cannot find a way to access the information.

So can Google really expect to find, record, and interpret this data? Well, between mapping the ocean and opening a school that’s likely to discover the meaning of life before lunch, Google did just that. Working with scientists at Cornell and UCSD, Google researchers (who I just hope won’t turn into supervillains at some point) have devised a method for their spiders to fill out and submit HTML forms with intelligent content. The resulting pages are then indexed and treated as regular indexed data and displayed in search results; in fact, right now, the content collected behind an HTML form is displayed on the first page of Google search queries 1,000 times per second. The methods the bots use are great, but I’m a Nerd McNerdleson about that kind of thing. So we won’t dive into the technicalities here, but check out the article if you’re interested.

That’s cool… NERD. But what does it mean?

Everyone knows that Google loves relevance – its entire business model is based on it. This technology is all about getting exactly what the user is looking for and providing it immediately without even requiring them to visit any pages outside of the Google results page! Chilling.

Say you feel bad. Instead of typing “symptom checker” and finding a WebMD-like page, type “cough, runny nose, strange swelling similar to bubonic plague” directly into the search engine. Google, which has already made its spiders respond to every medical symptom form there is, queries them in infinite varieties and combinations, and determines the relevance and popularity of the results, immediately returns with “You have the Black Death” and you. reconfigure (or… maybe not).

From a retail standpoint, many sites have features to generate product listings based on user input. As it stands now, a buyer looking for a red American-made minivan with less than 30,000 miles will find the appropriate website, enter their criteria, after which the website will query the database and return the results. If Google continues to move forward with its deep web crawls, this information could be displayed directly through your chosen outlet without the user going anywhere other than Google (if the user makes a purchase, does Google receive a part? Hmm… .)

This is obviously a huge step forward in search technology, and in an industry that seems to change by the hour, it represents a new method of obtaining and presenting information. As web marketers, this is another variable, another challenge to consider in our work: how can we optimize pages that can be generated in a seemingly unlimited number of ways? With search engines becoming more powerful and their data mining capabilities ever deeper, will there ever come a time when all data is presented through one aggregated portal? This may take years down the road, but the technology and foundation are here now; Forward-thinking companies and web marketers need to be there too.

Leave a Reply

Your email address will not be published. Required fields are marked *