English | Deutsch | Español | Português
 UserID:
 Passwd:
new user
 Survey Data Mining: Home | Order/Renew | FAQ | Archive | Glossary
 

Web Sites

1.1 Description

Forming the basis of many collectivities, this table contains the list of all web sites supporting the HTTP protocol. The complete record set is update each month with up to date ip addresses and server types.

This is our oldest continuously updated data set, with the original data collected in April, 1998, and updated every month since then. Our free survey report archive features web server survey reports dating back to June 1st, 1998 based on this data.

1.2 Schema

NameTypeDescription
hostnamevarchar(80)The hostname of the web site as originally found by our web spiders.
domainvarchar(80)The domain name of the site, derived from the hostname. This may be the same as the hostname.
portintThe port number on which the web site's HTTP service resides. A value of -1 indicates the default of 80
ipaddrvarchar(15)The IP address of the web server. If multiple IP addresses are returned when resolving the hostname, this will be the first IP address.
servertypevarchar(255)The server signature string, as determined via the "Server:" HTTP header line.
jtimevarchar(13)The julian timestamp. For historical reasons, represented in milliseconds, but the last 3 digits are currently always set 0.

1.3 Unique Keys

hostname + port

1.4 Additional Keys

domain
port
ipaddr
servertype

1.5 Data Collection Methodology

Data collection is broken down into two distinct phases. Phase 1 consists of IP address resolution of all known hosts. Phase 2 consists of contacting the remote servers in order to collect server signature strings. This two phase process is done because in many cases hostnames will share a web server on a common IP address. Phase 1 allows us to identify all distinct IP addresses which can subsequently be contacted once and only once for the server signature.

When the web server is contacted, a "GET /robots.txt" HTTP 1.0 protocol request is issued.

Each web site in the database is processed in the above fashion each month.

1.6 Known Limitations

The two phase data collection approach means that there is a lag of up to about one week between the IP address resolution phase and the signature collection phase. If a web site changes its IP address during this interval, there can be mismatch of the IP address resolved in phase 1, with the actual IP address of the site when the signature string is retrieved from the server as reported by jtime.

1.7 Other Notes

Field ipaddr can have the special value of NO_RESOLVE implying no name resolution could be done

Field servertype can have the special values of NO_RESOLVE implying no name resolution could be done, or NO_CONTACT if a server response could not be obtained when issuing an HTTP GET request.

At the start of each month, a fresh table with nil data values is created. At that point in time, new unique hosts not already in the table based (newhosts) are added to the table, and web sites that were non-responsive for the previous 3 months (expired) are removed.



Home | About Us | Contact Us | Partner Programs | Privacy | Mailing Lists | Abuse
Security Audits | Managed DNS | Network Monitoring | Site Analyzer | Internet Research Reports
Web Probe | Whois

© 1998-2010 E-Soft Inc. All rights reserved.