[TriEmbed] Pointers for attempting very basic Web Scraping?

Mon Mar 10 13:59:50 CDT 2014

I didn’t explain my response much, but you really should look at BeautifulSoup. It has already abstracted all the pain points of scraping pages. Requests makes HTTP requests in Python easy, even for beginners. There is no need to invent your own parser of any kind or inspect the page DOM.

--

–Adam

On Mar 10, 2014, at 2:55 PM, Geoffrey Tattersfield <gtatters at gmail.com> wrote:

> You have to do this on a raspi or beagleboard or something so this thread stays on topic. :)
> 
> I grabbed the print links for each article.  They point to clean html without all the web page window dressing.
> 
> here's a quick and dirty to get you started.  I just redirected console output to a file and got a document with broken image elements.  you can create a nested images directory and save them there.  But if you want to email the doc to others, you might rather prepend the hostname to the src attribute of each image element.  xpath would be something simple like "//img"
> 
> I'm not a python programmer so I bailed before I had to go messing with file IO.  
> 
> I'm sure the python indentation will be gone by the time it gets to your inbox.
> 
> -gt
> 
> [code]
> 
> import urllib3
> from lxml import etree
>  
> domain = "http://www.tmsacademy.org/"
> initurl = domain + "index.php?option=com_content&view=category&id=120&Itemid=553"
>  
> http = urllib3.PoolManager() # connection pool
> code = http.request( 'GET', initurl ) # get the newsletter page
> html = etree.HTML(code.data) # do stuff with it (I think it cleans it up)
> result = html.xpath('//li[@class="print-icon"]/a/@href') # get the article links (the print ones)
>  
> print "<html><head></head><body>"
> for item in result:    # for each article url...                                                                                            
>    article = http.request( 'GET' , domain + item ) # get the doc
>    article_html = etree.HTML(article.data) # "process" it
>    article_result = article_html.xpath('//body/child::*') # get the contents of just the body element
>    article_contents =  map(etree.tostring,article_result) # make a string 
>    for thingie in article_contents: # it's a list so dump each element of the list
>       print thingie
> 
> print "</body></html>"
> 
> [/code]
> 
> 
> On Mon, Mar 10, 2014 at 12:14 PM, Rodney Radford <ncgadgetry at gmail.com> wrote:
> Since you are looking for a very specific case, you probably don't need a generic XML parser - often times this can be done with a little bit of looking at the html code, look or a pattern, and write code to find what you need.
> 
> In looking at the source for the page, all the sub articles all have the format listed below where you see a "<p class="readmore">" tag at each location.
> 
> You could simply look for the readmore tag, and then grab the very next href to get the list of URLs you need for your final document.
> 
> The trick then becomes how to automate pulling down each of those pages and putting them into a document.
> 
> This is an excerpt from the page (I hope all these html tags are not going to get eaten and turned into garbage in this email send):
> 
> <p class="readmore">
> <a href="/index.php?option=com_content&view=article&id=528:1st-annual-tmsa-sudoku-tournament&catid=120&Itemid=553">
> Read more: 1st Annual TMSA Sudoku Tournament</a>
> </p>
> 
> 
> On Mon, Mar 10, 2014 at 11:31 AM, Shane Trent <shanedtrent at gmail.com> wrote:
> I am looking for pointers at attempting what I hope will be a very simple web scraping project. Our elementary school has a newsletter that has every article under a separate link, requiring 15 clicks to read the whole newsletter. Not a great UI experience in my option. Here is an example newsletter.
> 
> http://www.tmsacademy.org/index.php?option=com_content&view=category&id=120&Itemid=553
> 
> I would like to find a way to get all of the newsletter content on a single page (and learn a few "teach a man to fish" skills). Pulling into a local document would be acceptable but I would like to be able to share the single page view with other parents at the school. I am not sure of the best way to do this either!
> 
> A casual web-search points to Python and a few extensions but most references I found target data harvesting. I wonder if there is a simpler approach. 
> 
> I suspect Carl can point me in the right direction but wanted to shout-out to the list on the chance that someone has already done something similar. 
> 
> Thanks,
> Shane
> 
> 
> 
> _______________________________________________
> Triangle, NC Embedded Computing mailing list
> TriEmbed at triembed.org
> http://mail.triembed.org/mailman/listinfo/triembed_triembed.org
> TriEmbed web site: http://TriEmbed.org
> 
> 
> 
> _______________________________________________
> Triangle, NC Embedded Computing mailing list
> TriEmbed at triembed.org
> http://mail.triembed.org/mailman/listinfo/triembed_triembed.org
> TriEmbed web site: http://TriEmbed.org
> 
> 
> _______________________________________________
> Triangle, NC Embedded Computing mailing list
> TriEmbed at triembed.org
> http://mail.triembed.org/mailman/listinfo/triembed_triembed.org
> TriEmbed web site: http://TriEmbed.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.triembed.org/pipermail/triembed_triembed.org/attachments/20140310/5c68dba6/attachment.htm>