Pretraining A Map for Studying Pre-training in LLMs Data Collection General Text Data Specialized Data Data Preprocessing Quality Filtering Deduplication