[TriEmbed] Pointers for attempting very basic Web Scraping?

Mon Mar 10 13:55:07 CDT 2014

You have to do this on a raspi or beagleboard or something so this thread
stays on topic. :)

I grabbed the print links for each article.  They point to clean html
without all the web page window dressing.

here's a quick and dirty to get you started.  I just redirected console
output to a file and got a document with broken image elements.  you can
create a nested images directory and save them there.  But if you want to
email the doc to others, you might rather prepend the hostname to the src
attribute of each image element.  xpath would be something simple like
"//img"

I'm not a python programmer so I bailed before I had to go messing with
file IO.

I'm sure the python indentation will be gone by the time it gets to your
inbox.

-gt

[code]

import urllib3
from lxml import etree

domain = "http://www.tmsacademy.org/"
initurl = domain +
"index.php?option=com_content&view=category&id=120&Itemid=553"

http = urllib3.PoolManager() # connection pool
code = http.request( 'GET', initurl ) # get the newsletter page
html = etree.HTML(code.data) # do stuff with it (I think it cleans it up)
result = html.xpath('//li[@class="print-icon"]/a/@href') # get the article
links (the print ones)

print "<html><head></head><body>"
for item in result:    # for each article url...

   article = http.request( 'GET' , domain + item ) # get the doc
   article_html = etree.HTML(article.data) # "process" it
   article_result = article_html.xpath('//body/child::*') # get the
contents of just the body element
   article_contents =  map(etree.tostring,article_result) # make a string
   for thingie in article_contents: # it's a list so dump each element of
the list
      print thingie

print "</body></html>"

[/code]

On Mon, Mar 10, 2014 at 12:14 PM, Rodney Radford <ncgadgetry at gmail.com>wrote:

> Since you are looking for a very specific case, you probably don't need a
> generic XML parser - often times this can be done with a little bit of
> looking at the html code, look or a pattern, and write code to find what
> you need.
>
> In looking at the source for the page, all the sub articles all have the
> format listed below where you see a "<p class="readmore">" tag at each
> location.
>
> You could simply look for the readmore tag, and then grab the very next
> href to get the list of URLs you need for your final document.
>
> The trick then becomes how to automate pulling down each of those pages
> and putting them into a document.
>
> This is an excerpt from the page (I hope all these html tags are not going
> to get eaten and turned into garbage in this email send):
>
> <p class="readmore"> <a href="
> /index.php?option=com_content&view=article&id=528:1st-annual-tmsa-sudoku-tournament&catid=120&Itemid=553<http://www.tmsacademy.org/index.php?option=com_content&view=article&id=528:1st-annual-tmsa-sudoku-tournament&catid=120&Itemid=553>
> "> Read more: 1st Annual TMSA Sudoku Tournament</a> </p>
>
>
> On Mon, Mar 10, 2014 at 11:31 AM, Shane Trent <shanedtrent at gmail.com>wrote:
>
>> I am looking for pointers at attempting what I hope will be a very simple
>> web scraping project. Our elementary school has a newsletter that has every
>> article under a separate link, requiring 15 clicks to read the whole
>> newsletter. Not a great UI experience in my option. Here is an example
>> newsletter.
>>
>>
>> http://www.tmsacademy.org/index.php?option=com_content&view=category&id=120&Itemid=553
>>
>> I would like to find a way to get all of the newsletter content on a
>> single page (and learn a few "teach a man to fish" skills). Pulling into a
>> local document would be acceptable but I would like to be able to share the
>> single page view with other parents at the school. I am not sure of the
>> best way to do this either!
>>
>> A casual web-search points to Python and a few extensions but most
>> references I found target data harvesting. I wonder if there is a simpler
>> approach.
>>
>> I suspect Carl can point me in the right direction but wanted to
>> shout-out to the list on the chance that someone has already done something
>> similar.
>>
>> Thanks,
>> Shane
>>
>>
>>
>> _______________________________________________
>> Triangle, NC Embedded Computing mailing list
>> TriEmbed at triembed.org
>> http://mail.triembed.org/mailman/listinfo/triembed_triembed.org
>> TriEmbed web site: http://TriEmbed.org
>>
>>
>
> _______________________________________________
> Triangle, NC Embedded Computing mailing list
> TriEmbed at triembed.org
> http://mail.triembed.org/mailman/listinfo/triembed_triembed.org
> TriEmbed web site: http://TriEmbed.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.triembed.org/pipermail/triembed_triembed.org/attachments/20140310/fa68e2af/attachment.htm>