[TriEmbed] Pointers for attempting very basic Web Scraping?

Mon Mar 10 10:55:56 CDT 2014

Shane,

I read your email with a grin on my face. Scraping, a web page is not a
trivial task and any code your write will break as soon as they make a
slight change to how the page is displayed or formatted. With that said
there are two methods that are used to do this.

1) Write a regex (regular expression) engine. This will execute fast, but
will take more time to write.

2) Use an XML parser. As more and more HTML becomes XML compliant XML
parsers will work, but they run very slow as compared to a regex engine.

Both of these approaches will break as soon as the HTML varies from what
you expect making the result gobbledygook.

Here is a link to a Wikipedia page that may be useful:
http://en.wikipedia.org/wiki/Data_scraping

As is indicated on the Wikipedia page there are APIs that con be found to
help with this.

Here is a link to a Python method of doing it:
http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/

Carl

On Mon, Mar 10, 2014 at 11:31 AM, Shane Trent <shanedtrent at gmail.com> wrote:

> I am looking for pointers at attempting what I hope will be a very simple
> web scraping project. Our elementary school has a newsletter that has every
> article under a separate link, requiring 15 clicks to read the whole
> newsletter. Not a great UI experience in my option. Here is an example
> newsletter.
>
>
> http://www.tmsacademy.org/index.php?option=com_content&view=category&id=120&Itemid=553
>
> I would like to find a way to get all of the newsletter content on a
> single page (and learn a few "teach a man to fish" skills). Pulling into a
> local document would be acceptable but I would like to be able to share the
> single page view with other parents at the school. I am not sure of the
> best way to do this either!
>
> A casual web-search points to Python and a few extensions but most
> references I found target data harvesting. I wonder if there is a simpler
> approach.
>
> I suspect Carl can point me in the right direction but wanted to shout-out
> to the list on the chance that someone has already done something similar.
>
> Thanks,
> Shane
>
>
>
> _______________________________________________
> Triangle, NC Embedded Computing mailing list
> TriEmbed at triembed.org
> http://mail.triembed.org/mailman/listinfo/triembed_triembed.org
> TriEmbed web site: http://TriEmbed.org
>
>

-- 
-------------------------------------------------------------------------------
Carl J. Nobile (Software Engineer)
carl.nobile at gmail.com
-------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.triembed.org/pipermail/triembed_triembed.org/attachments/20140310/97dd2482/attachment.htm>