A number of early web datasets have been made public at The Internet Archive in celebration of a partnership between Archive-It (a commercial service at The Internet Archive) and Archives unleashed, a global research initiative.
The collections may be accessed directly here:
GeoCities Collection (1994–2009)
Early Web Language Datasets (1996–1999)
More information on the collections may be found here.