[TriEmbed] Pointers for attempting very basic Web Scraping?

Rodney Radford ncgadgetry at gmail.com
Mon Mar 10 11:14:58 CDT 2014


Since you are looking for a very specific case, you probably don't need a
generic XML parser - often times this can be done with a little bit of
looking at the html code, look or a pattern, and write code to find what
you need.

In looking at the source for the page, all the sub articles all have the
format listed below where you see a "<p class="readmore">" tag at each
location.

You could simply look for the readmore tag, and then grab the very next
href to get the list of URLs you need for your final document.

The trick then becomes how to automate pulling down each of those pages and
putting them into a document.

This is an excerpt from the page (I hope all these html tags are not going
to get eaten and turned into garbage in this email send):

<p class="readmore"> <a href="
/index.php?option=com_content&view=article&id=528:1st-annual-tmsa-sudoku-tournament&catid=120&Itemid=553<http://www.tmsacademy.org/index.php?option=com_content&view=article&id=528:1st-annual-tmsa-sudoku-tournament&catid=120&Itemid=553>
"> Read more: 1st Annual TMSA Sudoku Tournament</a> </p>


On Mon, Mar 10, 2014 at 11:31 AM, Shane Trent <shanedtrent at gmail.com> wrote:

> I am looking for pointers at attempting what I hope will be a very simple
> web scraping project. Our elementary school has a newsletter that has every
> article under a separate link, requiring 15 clicks to read the whole
> newsletter. Not a great UI experience in my option. Here is an example
> newsletter.
>
>
> http://www.tmsacademy.org/index.php?option=com_content&view=category&id=120&Itemid=553
>
> I would like to find a way to get all of the newsletter content on a
> single page (and learn a few "teach a man to fish" skills). Pulling into a
> local document would be acceptable but I would like to be able to share the
> single page view with other parents at the school. I am not sure of the
> best way to do this either!
>
> A casual web-search points to Python and a few extensions but most
> references I found target data harvesting. I wonder if there is a simpler
> approach.
>
> I suspect Carl can point me in the right direction but wanted to shout-out
> to the list on the chance that someone has already done something similar.
>
> Thanks,
> Shane
>
>
>
> _______________________________________________
> Triangle, NC Embedded Computing mailing list
> TriEmbed at triembed.org
> http://mail.triembed.org/mailman/listinfo/triembed_triembed.org
> TriEmbed web site: http://TriEmbed.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.triembed.org/pipermail/triembed_triembed.org/attachments/20140310/2328a581/attachment.htm>


More information about the TriEmbed mailing list