Results 1 to 7 of 7

Thread: Parsing HTML with C++

  1. #1
    Join Date
    Dec 2008
    Beans
    77
    Distro
    Ubuntu 8.10 Intrepid Ibex

    Parsing HTML with C++

    I'm using libcurl to give an html dump of a webpage. I then want to go through the html and find the lists, so I'm only looking for stuff between <li> and <\li>.

    I thought it would be easier to do this myself than download a library to do this (libxml is for this stuff right?). I'm basically too lazy to download and link in another library. Libcurl gave me a few headaches but I got them sorted.

    The problem is how do I deal with the escape character? '\' is just ignored in my string. Also, wierdly, my 'g' turn out as a smiley face, I have no idea why.

    Can I do this myself or should I find an html parsing lib?
    Code:
    if (REAL_LIFE){
       PANIC();
    }

  2. #2
    Join Date
    Aug 2006
    Location
    60°27'48"N 24°48'18"E
    Beans
    3,458

    Re: Parsing HTML with C++

    How about just regular expressions?
    LambdaGrok. | #ubuntu-programming on FreeNode

  3. #3
    Join Date
    Nov 2009
    Beans
    1,081

    Re: Parsing HTML with C++

    libxml isn't recommended; HTML is a looser standard, structurally speaking.

    Regular expressions will work, although you may have to consider whether you'll need to handle lists within lists and other more complex cases.

  4. #4
    Join Date
    Feb 2009
    Location
    Dallas
    Beans
    1,494

    Re: Parsing HTML with C++

    I realize the post is asking about C++ but maybe there's a better tool for parsing this kind of information (e.g. Python or Perl). Its worth noting I'm not very familiar with C++ but I am with Python, so I'm a little biased. Cheers!

  5. #5
    Join Date
    Feb 2010
    Location
    Silicon Valley
    Beans
    1,898
    Distro
    Xubuntu 12.04 Precise Pangolin

    Re: Parsing HTML with C++

    Quote Originally Posted by Drone022 View Post
    so I'm only looking for stuff between <li> and <\li>.

    The problem is how do I deal with the escape character? '\' is just ignored in my string.
    Is there some forward/backward slash confusion here? HTML uses a forward slash to terminate tags, not a backslash; <li> ... </li>.

    Regarding the parsing: I do this sort of thing in Perl with the HTML::TreeBuilder module. Also http fetching is easy with the LWP::UserAgent module.

  6. #6
    Join Date
    Feb 2007
    Location
    Tuxland
    Beans
    Hidden!
    Distro
    Ubuntu Development Release

    Re: Parsing HTML with C++

    Qt includes a pretty powerful HTML/XML parser. Eg:

    Code:
    QDomDocument d;
    d.setContent(someFile);
    QDomNodeList e = d.elementsByTagName("li");
    for (int i=0; i<e.length(); i++) {
       // e.item(i).toElement().text() 
    }
    Proud GNU/Linux zealot and lover of penguins
    "Value your freedom or you will lose it, teaches history." --Richard Stallman

  7. #7
    Join Date
    Apr 2008
    Beans
    507

    Re: Parsing HTML with C++

    If you must parse foreign HTML (by foreign I mean that you did not compose it), then consider first tidying it up with HTMLTidy http://tidy.sourceforge.net/.

    As someone else pointed out, doing this in Perl or Python is trivial and either language would be a better fit for your problem.

    In Perl you can simply grab everything in between the '<i></i>' tags if you go list context eg.

    Code:
    use strict;
    use warnings;
    my $s = '<i>hello</i> there <i>world</i>';
    my @p = ($s =~ m/<i>(.*?)<\/i>/g);
    print join '|', @p;
    Yields:-

    Code:
    hello|world
    If you go with Python, have a look at BeautifulSoup http://www.crummy.com/software/BeautifulSoup/.
    Go you good thing!

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •