The Traders' Den  

  The Traders' Den > Where we go to learn ..... > Technobabble
 

Technobabble Post your general Need for Help questions here.
Lossy or Lossless?
Moderators

Reply
 
Thread Tools
  #1  
Old 2024-09-13, 09:12 AM
mjcrossuk mjcrossuk is offline
424.65 GB/1.88 TB/4.54
 
Join Date: Mar 2007
Any good Python coders here? Especially with regard to web scraping?

I'm trying to scrape the torrent details from TTD to put into a database which I would make available online, in the event that TTD closes down.

Please see this thread for more details: http://www.thetradersden.org/forums/...d.php?t=202461

I've not used the requests package before, and I'm having problems.

I'm starting from this page: http://www.thetradersden.org/forums/....php/f-11.html, then going into the various list pages for some of the categories (eg Audio, Audio Inactive, Audio Pulled). The next level pages list torrents, 250 per page, each with a link to the torrent detail thread eg http://www.thetradersden.org/forums/....php/f-12.html is page 1 of 147 listing Active Audio torrents.

Sometimes my scraping code retrieves an Index page, but other times I get a Status of 200 from the request, but an empty response.

Trying to retrieve a torrent thread, eg http://www.thetradersden.org/forums/.../t-203252.html, always give a 200 status and an empty response.

If the request fails, I should get a non-200 status code, but I don't.

Could it be authentication? Caching? Is the TTD backend blocking scraping of torrent detail threads?

TIA

I realise that web scraping these pages may not be the best way to get the info; a better way would be a dump/extract of the backend database(s). If the scraping does work, I would be mindful of NOT scraping 100,000+ pages quickly.
Reply With Quote Reply with Nested Quotes
  #2  
Old 2024-09-25, 11:26 AM
brentter brentter is offline
104.03 GB/290.97 GB/2.80
 
Join Date: Feb 2005
Re: Any good Python coders here? Especially with regard to web scraping?

So what's happening is your scraper is trying to grab the page data before it's loaded because there's a js file that loaded before the content.
For example: http://www.thetradersden.org/forums/.../t-203252.html
If you pull up your dev tools in your browser->console-> this is the error:

Quote:
Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content. markup.js:250:53
clicking on the markup.js file highlights
Quote:
try {
// If we didn't wait for the document to load, we want to force a layout update
// to ensure the anonymous content will be rendered (see Bug 1580394).
const forceSynchronousLayoutUpdate = !this.waitForDocumentToLoad;
this._content = this.anonymousContentDocument.insertAnonymousContent(
forceSynchronousLayoutUpdate
);
The page hasn't been rendered. So there's a page there, hence the 200, just nothing on it yet. If you add in a wait timer that might fix it or just switch to a tool like Selenium that works well with javascript and has a built-in option to automatically wait until a certain page element has been loaded:

Quote:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Path to your WebDriver executable (e.g., chromedriver)
driver_path = '/path/to/chromedriver'

# Initialize WebDriver (Chrome in this case)
driver = webdriver.Chrome(executable_path=driver_path)

try:
#open the web page
driver.get('http://example.com')

# Wait until specific element is loaded (adjust selector as needed)
wait = WebDriverWait(driver, 10) # Timeout after 10 seconds
element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'your-target-element-selector')))

# Now that we are sure the content has loaded, get page source
html_content = driver.page_source

# You can now parse html_content with BeautifulSoup or any other parser as needed

finally:
driver.quit() # Close browser when done
Selenium uses chrome btw and this example was obv just for a single page scrape.
https://selenium-python.readthedocs.io/waits.html
Reply With Quote Reply with Nested Quotes
  #3  
Old 2024-09-25, 11:52 AM
mjcrossuk mjcrossuk is offline
424.65 GB/1.88 TB/4.54
 
Join Date: Mar 2007
Re: Any good Python coders here? Especially with regard to web scraping?

Thanks so much for your reply.

I will study what you've suggested, and see what happens.

Would you be willing to engage in a direct conversation via email?

all the best,

Mike
Reply With Quote Reply with Nested Quotes
  #4  
Old 2024-09-25, 12:45 PM
brentter brentter is offline
104.03 GB/290.97 GB/2.80
 
Join Date: Feb 2005
Re: Any good Python coders here? Especially with regard to web scraping?

Yes, actually after I closed the browser (i'm at work atm) I realized that I could probably just scrape it all tonight for you.
I will DM you my email.
Reply With Quote Reply with Nested Quotes
Reply

The Traders' Den > Where we go to learn ..... > Technobabble

Tags
archive, python, ttd

Similar Threads
Thread Forum Replies Last Post
Reseed of Monty Python's Hastily Cobbled Together - krokodyle Seeding Talk - ISO Requests 0 2009-03-13 05:31 PM


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forums


All times are GMT -5. The time now is 05:47 AM.


Powered by: vBulletin, Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright ©2004 - , TheTradersDen.org - All Rights Reserved - Hosted at QuickPacket