English | Deutsch | Español | Português
 UserID:
 Passwd:
new user
 Survey Data Mining: Home | Order/Renew | SQL DB Documentation                                            Reports: Latest | Archive | FAQ | Glossary
 Queries: Reverse IP Spy | Name Server Spy | Mail Server Spy | Web Bug Spy | Area Code Spy | Zip Code Spy | Browse/Download Datasets
Browse: select * from where  =  '' or
 

Survey History

The survey was started back in May of 1998. It started as a simple web server survey and SSL server survey, and has over the years grown to include other technology surveys (e.g. mail servers, real time streaming protocol servers).

Raw data is collected each month and added to the database of information we have collected over the years, which is then used to produce the monthly and querying capabilities now available on-line.

Survey Methodology

1.1 Guiding Principles

Early on, several key design decisions were made regarding our surveys.

  • Attempt to avoid bias whenever possible.
  • Be consistent.
  • Only use information that is free from any restrictions on re-use.
Some examples of techniques that we explicitly avoid because they break one of the above principles include:

  • Never perform domain zone transfers to discover additional hosts. Zone transfers are unreliable. Many name servers block zone transfers, and as web site security has become more important, fewer and fewer zones are accessible. Because the ability to do zone transfers may be driven by security concerns, performing transfers would introduce a bias on the hostname discovery process, and is thus rejected.

  • Never guess host names. The top 100 host names account for 65% of all known hostnames in use. So if we find a domain "yourdomain.com", the odds are really good that one of several hostnames are in use in the domain, such as "www", or "secure", or "mail". That makes it a particular tempting target for discovery of hosts. However, doing so overlays a bias (the selection of a precanned set of hostnames) onto an existing data set, and as such, is rejected.

  • Never use top level domain name server data.1 The problem with top level domain data is that we cannot get access to ALL top level domains, and as such, the acquisition of TLD data introduces a bias based on which TLDs we include. We already have seen in our data sets that certain countries demonstrate clear trends towards certain technologies. In order to fairly represent market share, the same collection methodology must be used across all countries, languages, etc.

    And if that wasn't enough, TLD data can come with licensing restrictions that would taint our entire data set should we have chosen to use it. So even if we could get access to all TLD data, it is unlikely we would choose to use it because it would restrict what we could do with the data we have collected.

    1It is highly suspected that another popular set of surveys uses TLD data to acquire host names, based on the fact that a number of domains purchased by us have been queried by said surveying organization before the domains were ever publicly announced.

The result is that we have chosen a methodology that results in a host being included in our survey if and only if we find the host referenced by someone else on the web through a link of one form another.

1.2 Crawlers and Polling

We have two distinct types of data collection activities that operate on an ongoing basis.

  • Crawlers operate non-stop and operate much as any web crawler does, spidering a web site. Crawlers are responsible for finding links to new web sites and services, making note of the types of imbedded content found in web pages (e.g. frames, applets, image types), and recording other interesting attributes of pages and the headers (cookies, privacy policies, etc.). Approximately 10% of the known web sites are crawled each month.

  • Polling processes are scheduled monthly and responsible for updating data each month an entire data set. For example, every IP known to host a web server will be visited to determine the type of web server operating on it. Every domain will have several DNS queries issued against it, resolving IP addresses of known hosts, determining the location of name servers and mail servers, querying for any TXT records (e.g. for SPF usage).
All data collected for our surveys is exclusively collected from one of the above activity types.



Home | About Us | Contact Us | Partner Programs | Privacy | Mailing Lists | Abuse
Security Audits | Managed DNS | Network Monitoring | Site Analyzer | Internet Research Reports
Web Probe | Whois

© 1998-2008 E-Soft Inc. All rights reserved.