New ask Hacker News story: Tadpole the Language for Scraping 0.2.0 – Complex Control Flow, Stealth and More
Tadpole the Language for Scraping 0.2.0 – Complex Control Flow, Stealth and More
4 by zachperkitny | 0 comments on Hacker News.
Hello, I posted a few weeks ago about my custom scraping language. It definitely got some traction, which was very exciting to see. Github Repo: https://ift.tt/rMmP1qR Docs: https://tadpolehq.com/ The past 2 weeks, I've been focusing my efforts in introducing specific stealth actions, more complicated control flow actions and a lot of various evaluators for cleaning data. Here is an example for scraping from `books.toscrape.com` main { new_page { goto "https://ift.tt/dZbPgIN" loop { do { $$ article.product_pod { extract "books[]" { title { $ "h3 a"; attr title } rating { $ ".star-rating"; attr "class"; extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true; func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)" } price { $ "p.price_color"; text; as_float } in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true } } } } while { $ "li.next" } next { $ "li.next a" { click } wait_until } } } } I've introduced actions like `apply_identity` to override User Agent Headers and User Agent Metadata. Here is an example module to selectively create different identities: module stealth { // Apple M2 Pro action apply_apple_m2 { apply_identity mac set_webgl_vendor "Apple Inc." "Apple M2" set_device_memory 16 set_hardware_concurrency 8 set_viewport 1440 900 deviceScaleFactor=2 } // Windows Desktop action apply_windows_16_8 { apply_identity windows set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)" set_device_memory 16 set_hardware_concurrency 8 set_viewport 1920 1080 } // Windows Budget Laptop action apply_windows_8_4 { apply_identity windows set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)" set_device_memory 8 set_hardware_concurrency 4 set_viewport 1366 768 } } The full release changelog is available here: https://ift.tt/2ftLIYy My goals for the next 0.3.0 release is to heavily focus on Plugins, Distributed Execution through Message Queues, Redis Support for Crawling, Static Parsing as opposed to exclusively over CDP/Chrome. I will keep trying to keep my release cadence at every 2 weeks!
4 by zachperkitny | 0 comments on Hacker News.
Hello, I posted a few weeks ago about my custom scraping language. It definitely got some traction, which was very exciting to see. Github Repo: https://ift.tt/rMmP1qR Docs: https://tadpolehq.com/ The past 2 weeks, I've been focusing my efforts in introducing specific stealth actions, more complicated control flow actions and a lot of various evaluators for cleaning data. Here is an example for scraping from `books.toscrape.com` main { new_page { goto "https://ift.tt/dZbPgIN" loop { do { $$ article.product_pod { extract "books[]" { title { $ "h3 a"; attr title } rating { $ ".star-rating"; attr "class"; extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true; func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)" } price { $ "p.price_color"; text; as_float } in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true } } } } while { $ "li.next" } next { $ "li.next a" { click } wait_until } } } } I've introduced actions like `apply_identity` to override User Agent Headers and User Agent Metadata. Here is an example module to selectively create different identities: module stealth { // Apple M2 Pro action apply_apple_m2 { apply_identity mac set_webgl_vendor "Apple Inc." "Apple M2" set_device_memory 16 set_hardware_concurrency 8 set_viewport 1440 900 deviceScaleFactor=2 } // Windows Desktop action apply_windows_16_8 { apply_identity windows set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)" set_device_memory 16 set_hardware_concurrency 8 set_viewport 1920 1080 } // Windows Budget Laptop action apply_windows_8_4 { apply_identity windows set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)" set_device_memory 8 set_hardware_concurrency 4 set_viewport 1366 768 } } The full release changelog is available here: https://ift.tt/2ftLIYy My goals for the next 0.3.0 release is to heavily focus on Plugins, Distributed Execution through Message Queues, Redis Support for Crawling, Static Parsing as opposed to exclusively over CDP/Chrome. I will keep trying to keep my release cadence at every 2 weeks!
Comments
Post a Comment