专题论文

基于大语言模型的数据生成与验证

侯亚杰，庄亚儿

2025年第4期

53 1527

引用本文 PDF

摘要

传统统计调查面临成本高昂、样本损耗及时效滞后等系统性难题,而大语言模型(LLMs)驱动的生成式人工智能(GAI)为革新数据获取范式提供了新途径。本文以中国健康与养老追踪调查(CLHLS)为实证场景,构建基于LLMs的老年健康数据生成框架,借助知识增强技术注入先验规则,对2021年追踪样本的自评健康和日常生活活动能力(ADL)进行高保真模拟。研究发现,知识增强有效突破了通用大模型的局限,校正了模型偏差,较为准确地复现了健康指标与健康行为、人口学因素的关联模式。然而,技术落地仍面临三重挑战。基于此,本文提出在认识上构建“人机互馈”、方法上建立“人机共审”、生态上实现“人机共生”的“人类—AI”协同的社会研究新范式。

关键词

大语言模型数据生成验证 CLHLS

正文

参考文献

边燕杰、缪晓雷,2025,《大数据视野下的社会科学实证研究》,《智能社会研究》第1期。
陈忱,2025,《生成式人工智能赋能随机实验———机遇与挑战》,《智能社会研究》第2期。
郭茂灿等,2025,《智能社会学:生成式人工智能驱动下的社会变迁与社会学范式重构》,《智能社会研究》第2期。
凌宛莹、崔思瞻,2025,《复现的限度———“硅基样本”的可能与可为》,《智能社会研究》第2期。
周穆之等,2025,《大语言模型智能体能模仿问卷调查受访者吗? ———来自中国人口数据的基准比
较》,《智能社会研究》第2期。
Aher, G. et al. 2023, “Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies. ” arXiv Preprint. doi: 10. 48550/ arXiv. 2208. 10264.
Argyle, L. et al. 2023, “Out of One, Many: Using Language Models to Simulate Human Samples. ” Political Analysis 31(3).
Bail, C. 2024, “Can Generative AI Improve Social Science?” PNAS 121(21).
Bender, E. et al. 2021, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency.
Binns, R. & R. Kirkham 2021, “How Could Equality and Data Protection Law Shape AI Fairness for People with Disabilities?” Preprint. Bisbee, J. et al. 2024, “Synthetic Replacements for Human Survey Data? The Perils of Large Language Models. ” Political Analysis 32(4).
Dillon, D. et al. 2023, “Can AI Language Models Replace Human Participants?” Trends in Cognitive Sciences 27(7).
Gilpin, L. et al. 2022, “‘Explanation’ is Not a Technical Term: The Problem of Ambiguity in XAI. ” Proceedings of the 5th IEEE International Conference on Data Science and Advanced Analytics.
Hartmann, J. et al. 2023, “The Political Ideology of Conversational AI: Converging Evidence on ChatGPT's Pro-Environmental, Left-libertarian Orientation. ” Preprint.
Heyde, L. et al. 2023, “Vox Populi, Vox AI? Using Language Models to Estimate German Public Opinion. ” SocArXiv Preprint. doi:10.31235/osf.io/8je9g.
Horton, J. 2023, “Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?” NBER Working Paper 31122.
Jordon, J. et al. 2022, “ Synthetic Data: What, Why and How?” arXiv Preprint. doi:10.48550/arXiv.2205.03257.
Kozlowski, A. et al. 2024, “In Silico Sociology: Forecasting Covid- 19 Polarization with Large Language Models. ” arXiv Preprint. doi:10.48550/arXiv.2407.11190.
Mehrabi, N. et al. 2019, “A Survey on Bias and Fairness in Machine Learning. ” arXiv Preprint. doi:10.48550/arXiv.1912.04889.
Motoki, F. et al. 2024, “More Human Than Human: Measuring ChatGPT Political Bias. ” Public Choice 198 (1).
Rozado, D. 2024, “The Political Preferences of LLMs. ” PLOS ONE 19(7).
Rutinowski, J. et al. 2024, “The Self-Perception and Political Biases of ChatGPT. ” Human Behavior and Emerging Technologies 2024(7115633).
Sakib, S. & A. Das 2024, “Challenging Fairness: A Comprehensive Exploration of Bias in LLM-Based Recommendations. ” 2024 IEEE International Conference on Big Data.
Santurkar, S. et al. 2023, “Whose Opinions Do Language Models Reflect?” Proceedings of Machine Learning Research 202.
Stadler, T. et al. 2022, “Synthetic Data-Anonymisation Groundhog Day. ” 31st USENIX Security Symposium.