Search Engines demystified.

55

By Fixxxer

A search engine is a tool that you can use to find information on the internet. It is a co-ordinated set of programs which searches an index (or database of indexed web pages) and returns the matches to a specific keyword or phrase. All search engines consists of three parts: A) a database of web documents B) a search engine operating on that database C) a program (or series of programs) that determine how the results are displayed. The results are then displayed on a search engine results page (SERP).

Where does the information come from? Search Engines are classified by the way they gather their data. There are three types of Search Engines: Those that make use of crawlers (Robots or Spiders). Those that rely on human submissions. Those that are a hybrid of the two. The most important tool in gathering information for Search Engines, are the crawlers (robots/spiders). Although a lot of websites are indexed due to human submissions - it is still the crawler that goes to the website to index the data. Human, or manual submission is normally done when you have a new website, with no inbound links to it. You then register your website with a search engine, which in turn will add your website’s URL on the list of websites to be crawled.

Crawler

A search engine crawler is an automated software program used to locate and collect data from web pages for inclusion in a search engine’s database. They browse the web in a methodical, automated manner. The first web crawler was called “World Wide Web Wanderer”. It was developed at MIT (Massachusetts Institute of Technology) It was initially developed to measure the growth of the web. Soon after that, an index was created from the results - effectively the first “search engine”. The first crawlers were simple creatures. They had limited capabilities in what information they could gather from a website. This was limited to the content in the meta tags. Search Engines (or the companies behind them) soon realized that this method was very ineffective. Crawlers were redesigned to be able to index more than just meta tags. They started indexing other information, like visible text, alt tags (used on images and other visual elements) and other non-html content like PDF and other word processor documents.

So how exactly do they work?

Web crawlers are a central part of search engines, and details on their algorithms and architecture are kept as business secrets. We do know a little however of the manner in which they collect their data. Generally, the crawler gets a list of URL’s to visit and store. The crawler doesn’t rank the pages, it only goes out and gets copies which it stores, or forwards to the search engine to later index and rank according to various aspects. Let’s imagine there’s only three websites on this list: websites A, B and C. So - starting with website A then, the crawler will go out and get a copy of the homepage and send it back to the repository. It will then go through each hyperlink, and follow it to locate every other page of website A. These pages will also be copied and sent back to the search engine for indexing. (It is important to remember here that the crawler will send copies of a page to the search engine, but not of the images on the page. Crawlers only read text. It is the search engine that will then go through each page, including the meta tags, content and alt tags and index this page for the search results.) If website A links to website B, then the crawler will automatically follow this link when it comes across it, and start indexing website B. This might happen even while the crawler is still busy with website A - remember that the crawler follows each link and copies as it goes along. The crawler can also just store the link, and visit website B at a later stage. (We will asume for arguments sake though that it follows it immediately. ) Because of the change in URL from website A to website B, however, the contents of website B won’t be indexed as part of website A. If in this case website B links to website C, the crawler will follow the link and the same process will happen on website C. Now you might wonder “What happens if website C links back to website A? Will the crawler get stuck in a loop?” Luckily, the answer is no. Crawlers usually perform some type of URL normalization in order to avoid crawling the same resource or URL more than once in the same crawl-run. The term URL normalization, also called URL canonicalization, refers to the process of modifying and standardizing a URL in a consistent manner. One example is the conversion of URLs to lowercase, removal of “.” and “..” segments, and adding trailing slashes to the non-empty path component. This helps the crawler avoid getting stuck in a loop between ‘n number of websites.

Can crawlers be controlled?

The answer is Yes. In a way, they can be. A lot of times there are certain parts of your site which you don’t want indexed. Or certain folders, containing images and such, which you don’t want the search engines to have access to. This is how you will be able to controll the crawler. There are two ways to do this:

Robots Meta Tag

<meta name=”robots” content=”index,follow”> <meta name=”robots” content=”noindex,follow”> <meta name=”robots” content=”index,nofollow”> <meta name=”robots” content=”noindex,nofollow”>

The robots meta tag is used as a command to the spiders so that they know what to do when encountering a specific page on your web site. If the command instructs the robot to “index”, then your page will be crawled and indexed. If the command instructs the robot to “follow”, then the spider will follow the links from this page. If the command instructs the robot to not index a page, “noindex”, the spider will NOT spider this page. If the command instructs the robot to not follow, “nofollow”, then links on this page will not be followed. Only one set of commands should be given to the robot. Decide which commands to give and place that tag on the corresponding page of your web site. Currently, only a few robots supports this tag. A more effective why is through the robots.txt file.

Robots.txt

To effectively control crawlers, you use a file called the robots.txt file. This is a file with instructions that the crawler will follow. The file is placed in the root of your domain, and the crawler will find it automatically. The robots.txt file is a text file, with entries giving the instructions. Each entry has just two lines: User-Agent: [Spider or Bot name] Disallow: [Directory or File Name] This line can be repeated for each directory or file you want to exclude, or for each spider or bot you want to give the commands to.

1. Exclude a file from an individual Search Engine You have a file, privatefile.htm, in a directory called ‘private’ that you do not wish to be indexed by Google. You know that the spider that Google sends out is called ‘Googlebot’. You would add these lines to your robots.txt file: User-Agent: Googlebot Disallow: /private/privatefile.htm

2. Exclude a section of your site from all spiders and bots You are building a new section to your site in a directory called ‘newsection’ and do not wish it to be indexed before you are finished. In this case you do not need to specify each robot that you wish to exclude, you can simply use a wildcard character, ‘*’, to exclude them all.

User-Agent: * Disallow: /newsection/ Should you have more than one directory which you want to disallow, you do not need to make a new command for each one. You can just list them one after the other, like this:

User-agent: * Disallow: /cgi-bin/ Disallow: /_borders/ Disallow: /_derived/

 

 

Comments

romper20 profile image

romper20 Level 1 Commenter 2 years ago

awesome post, great hubber I can see.

Romper

Submit a Comment
Members and Guests

Sign in or sign up and post using a hubpages account.



    • No HTML is allowed in comments, but URLs will be hyperlinked
    • Comments are not for promoting your Hubs or other sites

    Please wait working