Language encodes geographical information

Cogn Sci. 2009 Jan;33(1):51-73. doi: 10.1111/j.1551-6709.2008.01003.x.

Abstract

Population counts and longitude and latitude coordinates were estimated for the 50 largest cities in the United States by computational linguistic techniques and by human participants. The mathematical technique Latent Semantic Analysis applied to newspaper texts produced similarity ratings between the 50 cities that allowed for a multidimensional scaling (MDS) of these cities. MDS coordinates correlated with the actual longitude and latitude of these cities, showing that cities that are located together share similar semantic contexts. This finding was replicated using a first-order co-occurrence algorithm. The computational estimates of geographical location as well as population were akin to human estimates. These findings show that language encodes geographical information that language users in turn may use in their understanding of language and the world.