Models and Methods
LI Lingling, LIU Jinsong, LI Zhi, WEN Peizhang, LI Yancheng, LIU Yi
Random forest model is a mainstream research method to accurately describe the regional population distribution law and impact mechanism. Taking Shijiazhuang as the experimental area and its endowment zones as the modeling unit, we carried out stratified sampling on a hectare grid scale, and conducted a systematic experiment to determine the factors influencing the increasing population density. An optimized random forest model was applied throughout the whole process of zoning modeling, stratified sampling, factor selection, to obtain weighted outputs. Four main conclusions can be drawn as follows: (1) Zoning before modeling prevented the model from confusing the population distribution laws. Sampling at the raster unit not only freed the training samples from the modifiable areal unit problem (MAUP), but also formally reduced the negative effect of the ecological fallacy. Stratified sampling ensured the stability of the maximum population density in the training samples. (2) The experiments to determine the factors influencing population density were conducted in different zones, and the introduction of these factors significantly improved the fit (R2) of the model. Distance to a settlement was the dominant factor influencing population density in each zone. There were significant differences in the geographical mechanisms that influenced the population distribution in different regions. Innovation endowment factors had the strongest impact on population density in urban areas, while natural endowment factors had the strongest impact in rural areas. (3) The optimized combination of the population density prediction datasets significantly improved the robustness of the model. (4) The population density datasets had the characteristics of multi-scale superposition. At the large scale, the population density in the plain area was higher than that in the mountain area, whereas at the small scale the population density in urban areas was higher than that in rural areas, which represented the characteristics of a core-periphery model. The optimized scheme of the population density random forest model provided a unified technical framework for determining the factors that control the local population distribution and the geographical mechanisms that influence population distribution.