New ask Hacker News story: Ask HN: What does the data engineering behind LLMs look like?

Ask HN: What does the data engineering behind LLMs look like?
6 by lostpharoah | 0 comments on Hacker News.
I've seen a lot of discussion about key aspects of LLMs like ML (research, architecture), Infrastructure (GPUs, Cloud), and Product (ChatGPT et al) but not much on the data engineering side. A lot of hand waving like you "just" train on the entire public Internet. There must be a ton of complexity here, as well. What is the difference between web scraping and crawling? They are not simply indexing websites, these systems must be extracting and storing vasts amount of data from those crawled sites (hence Reddit, Twitter, etc calling foul). Do these systems rely on tons of proxy IPs? There's probably not too much going on after ingestion beyond storing all this data as text or image in an optimal format for the training system(s) to use.

Comments