New ask Hacker News story: Ask HN: What does the data engineering behind LLMs look like?

April 19, 2023

New ask Hacker News story: Ask HN: What does the data engineering behind LLMs look like?

Ask HN: What does the data engineering behind LLMs look like?
6 by lostpharoah | 0 comments on Hacker News.
I've seen a lot of discussion about key aspects of LLMs like ML (research, architecture), Infrastructure (GPUs, Cloud), and Product (ChatGPT et al) but not much on the data engineering side. A lot of hand waving like you "just" train on the entire public Internet. There must be a ton of complexity here, as well. What is the difference between web scraping and crawling? They are not simply indexing websites, these systems must be extracting and storing vasts amount of data from those crawled sites (hence Reddit, Twitter, etc calling foul). Do these systems rely on tons of proxy IPs? There's probably not too much going on after ingestion beyond storing all this data as text or image in an optimal format for the training system(s) to use.

Search This Blog

We with the world...

New ask Hacker News story: Ask HN: What does the data engineering behind LLMs look like?

Comments

Post a Comment

Popular Posts

New ask Hacker News story: Ask HN: Hackathons feel SO fake now. Anyone else noticing this?

New ask Hacker News story: C3 lang – A modern C alternative – 0.6.3 released