<div dir="ltr">Since you are looking for a very specific case, you probably don't need a generic XML parser - often times this can be done with a little bit of looking at the html code, look or a pattern, and write code to find what you need.<div>
<br></div><div>In looking at the source for the page, all the sub articles all have the format listed below where you see a "<p class="readmore">" tag at each location.<div><br></div><div>You could simply look for the readmore tag, and then grab the very next href to get the list of URLs you need for your final document.</div>
<div><br></div><div>The trick then becomes how to automate pulling down each of those pages and putting them into a document.<br><div><br></div><div>This is an excerpt from the page (I hope all these html tags are not going to get eaten and turned into garbage in this email send):</div>
<div><br></div><div><table><tbody><tr><td class=""> <span class=""><p <span class="">class</span>="<span class="">readmore</span>"></span></td></tr><tr><td class="" value="414"></td><td class=""> <span class=""><a <span class="">href</span>="<a class="" target="_blank" href="http://www.tmsacademy.org/index.php?option=com_content&view=article&id=528:1st-annual-tmsa-sudoku-tournament&catid=120&Itemid=553">/index.php?option=com_content&view=article&id=528:1st-annual-tmsa-sudoku-tournament&catid=120&Itemid=553</a>"></span></td>
</tr><tr><td class="" value="415"></td><td class=""> Read more: 1st Annual TMSA Sudoku Tournament<span class=""></a></span></td></tr><tr><td class="" value="416"></td><td class=""> <span class=""></p></span></td>
</tr></tbody></table></div></div></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Mar 10, 2014 at 11:31 AM, Shane Trent <span dir="ltr"><<a href="mailto:shanedtrent@gmail.com" target="_blank">shanedtrent@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I am looking for pointers at attempting what I hope will be a very simple web scraping project. Our elementary school has a newsletter that has every article under a separate link, requiring 15 clicks to read the whole newsletter. Not a great UI experience in my option. Here is an example newsletter.<div>
<br></div><div><a href="http://www.tmsacademy.org/index.php?option=com_content&view=category&id=120&Itemid=553" target="_blank">http://www.tmsacademy.org/index.php?option=com_content&view=category&id=120&Itemid=553</a><br>
</div><div><br></div><div>I would like to find a way to get all of the newsletter content on a single page (and learn a few "teach a man to fish" skills). Pulling into a local document would be acceptable but I would like to be able to share the single page view with other parents at the school. I am not sure of the best way to do this either!</div>
<div><br></div><div>A casual web-search points to Python and a few extensions but most references I found target data harvesting. I wonder if there is a simpler approach. </div><div><br></div><div>I suspect Carl can point me in the right direction but wanted to shout-out to the list on the chance that someone has already done something similar. </div>
<div><br></div><div>Thanks,</div><div>Shane</div><div><br></div><div><br></div></div>
<br>_______________________________________________<br>
Triangle, NC Embedded Computing mailing list<br>
<a href="mailto:TriEmbed@triembed.org">TriEmbed@triembed.org</a><br>
<a href="http://mail.triembed.org/mailman/listinfo/triembed_triembed.org" target="_blank">http://mail.triembed.org/mailman/listinfo/triembed_triembed.org</a><br>
TriEmbed web site: <a href="http://TriEmbed.org" target="_blank">http://TriEmbed.org</a><br>
<br></blockquote></div><br></div>