Skip to content

Commit 24080dd

Browse files
jhycode4craft
authored andcommitted
Site
Work Suds: inherit meta from parent dir ignore Merge Updated readme Updated readme to be more useful. Project init. readme update Initial implementation of TokenStream, to pull tokens from HTML for parsing. Initial implementation of TokenStream. Initial implementation of Tag. Draft implementation of Parser. Stack, not Queue, so use getLast() Ignore tag "html" in parse tree as created by doc Initial implementation of AttributeParser. URL in readme Updated TokenStream to deal with < or > within attributes. Implemented comment handling. Not very happy with token and parser. Would like to reimplement with some kind of expect / consumeTo behavior. Getting a bit too hacky. Element child(int) method, attr(key) method Set parent on add Child Complain if moving elements in tree (to implement) Implemented getElementsByTagName Test zero return Implemented Element.getElementById Implemented text() Implemented sibling methods Renamed JSoup -> Jsoup Reimplemented parser. Pulled string matching stuff out of Parser ad into TokenQueue, for reuse in selector parser. Added som texts and corrected behaviour of parser. Implemented: * Element.getElementsByClass * Element.getElementsWithAttribute * Element.getElementsWithAttributeValue * Element.className * Element.hasClass * Elmenet.classNames Handle HTML encoded (escaped) data in text nodes and attributes. Nodes get to html() method. All nodes have outerHtml method, elements have (inner) HTML. Initial, partial implementation of Selector. Implemented select(query, query, query) or group selector. Implemented ElementList.select() Renamed ElementList to Elements Renamed ElementList to Elements Implemented select("ancestor descendant"). Deeper descendant test. Support for data only tags (script, textarea etc) Removed scratch test. Neatened Parser Implicit parent creation for elements, more general document structure creation. Introduced a DataNode, and SCRIPT, TEXTAREA, TITLE etc goes into DataNodes, so that Element.text() does not get clouded with script or style inners. Fixed head canContain todo note Have Element DOM methods return Elements instead of List<Element>, to give ready access to .select(query) Implemented Element.text(string) method, to set the text of an element, and clear out existing html. Selector: added * (all elements) and parent > child. Implement baseURIs for all Nodes, and absolute URL attribute getter. Initial test suite from real world html Updated parser to support namespaced attributes (i.e xml:lang=en). Implemented Elements first(), last(), attr(), hasAttr() methods. Git ignore Eclipse Explicit empty String[], to remove warning on null as vararg. Fixed selector so that "parent child" doesn't contain parent, and "element.class" is an AND on parent element, and doesn't match .class in child element (i.e. I re-read the CSS3 selector doc, which is clearer than jquery doc). Attribute helpers in Element and Elements. Test confirms selectors are case insensitive. Parser updated to handle CDATA, and rogue < in text nodes. Output HTML correctly for <! ... > and <? ... ?> xml tags. Implemented advanced attribute selectors (!= ^= $= *=) and element methods. Extend selector test for multi classes and attributes. E.g. div.foo.bar[title=qux][name=bar] matches: <div class="foo bar" title="qux" name="bar"> Don't register unknown tags. Unknown tags created with Tags.valueOf(String) were being registered, so that further .valueOf()s would return an == tag. But that's a potential memory leak, particularly with malicious input HTML, and serves no real purpose (as .equals() still works), so that functionality has been removed. Tests parser for unknown tags. Handle empty (self closing) blocks. <div/><div></div> was parsing as <div><div></div></div>. No longer. Implemented Elements methods text(), eq(), and is(). Removed unused Select#groupOp noop. Renamed Elements#select to Elements#filter. Renamed so the API is better self describing. Initial bits of HTML cleaner. Initial implementation of Whitelist cleaner config. Initial Cleaner implementation. Added integration test for www.news.com.au Fixed select parser for childs. Selector cleanup Google search result parse test. Fixed selector for multi descenders. Google parse test Implemented "abs:" virtual attribute prefix for absolute URLs. Modified Node#absURL to return only absolute URLs, or "". Previously if there was no baseURI, it would return a relative URL from the attribute value, which is unreliable. Also documented method. Modified parser to add elements found past </body> into body. Test for binary content. Parse Test for Yahoo Japan. Extra Cleaner tests. Adds Parser.parseBodyFragment method. Makes static members final. Updates pom and adds license Updates description Encourage Maven to copy using UTF-8. Otherwise it seems to copy java source (but not other resources?) using local encoding, which means that some of the char encoding tests break. Maven plugin to build javadoc Rename Element.addChild to appendChild. Also implement Element.addElement(String tagName) Implemented Element.append(html) Element documentation. Doc tidy. Benchmark script. Create tests with Thrash for sandbox benchmarking. Removed StartTag Moved Evaluator to nodes from select, to close down public methods. Doc Doc Copyright date Knock access down Version is 0.1 until first beta release Removed Element children list, and create on fly from nodes. Attributes values back to Attribute Attributes format Implemented Element prepend methods Simplified Document bean methods for consistency. Linked Document title methods with HTML structure Fixed parse of unclosed <dl><dt>Foo<dd>Bar</dl> runs. Force compile source from UTF-8. Otherwise it defaults to something odd. And the property already set (project.build.sourceEncoding) doesn't seem to be used by the mvn compiler. Which is... also odd. Or more likely, I don't 100% understand it. POM update to build source and javadoc jars Implemented Jsoup.parse(File), and javadocced. Flipped integration test to use Parse(File) No default constructor for Jsoup Dropped "get" Elements Don't escape text in data nodes, to preserve " chars Simplifed Tag creator. Text normalisation. Use string builders for HTML creation. Whitespace tests Preserve whitespace in children of <pre> Implemented parse from URL. Doc typo Doc Parse <frameset> outside of <body> Updated URL integration test Javadoc Parser javadoc Cleaner javadoc Selector documentation Document.createElement(String) Support for inline font tag pom update [maven-release-plugin] prepare release jsoup-0.1.1 [maven-release-plugin] prepare for next development iteration Readme update Parse unknown tags as inline elements that can contain blocks. Ensures <p><custom>Test</custom></p> parses like that, and not <p></p><custom>Test</custom>. Closes #1 Changelog Release 0.1.2 prep. [maven-release-plugin] prepare release jsoup-0.1.2 [maven-release-plugin] prepare for next development iteration Fix absolute URL resolution issue when a base tag has no href. Example program: list links Implemented Element#wrap and #Elements#wrap Also protected Node.replaceChild, removeChild, addChild. New: E + F adjacent sibling selector, E ~ F preceding sibling. Corrected change note Maven Sonatype setup [maven-release-plugin] prepare release jsoup-0.2.1 [maven-release-plugin] prepare for next development iteration [maven-release-plugin] prepare release jsoup-0.2.1 Release prep [maven-release-plugin] prepare release jsoup-0.2.1a [maven-release-plugin] prepare for next development iteration Sonatype release machinations [maven-release-plugin] prepare release jsoup-0.2.1b [maven-release-plugin] prepare for next development iteration Add addClass, removeClass, toggleClass, hasClass to Element and Elements. Closes #2 Improved document normalisation. hasText Improved HTML output (pretty-print) Changelog for release prep [maven-release-plugin] prepare release jsoup-0.2.2 [maven-release-plugin] prepare for next development iteration Assert attribute values are not null, not not empty. Closes #7. Changed Elements#attr(key) to scan all elements for attribute. Closes #4. Implemented Elements html(), html(string), append, and prepend. Closes #5. Changelog Normalise head by prepending, not appending. Closes #9. Cleaner.isValid() method. Closes #6. IsValid test for OK attribute Test self is not descender Deploy prep Release prep [maven-release-plugin] prepare release jsoup-0.3.1 [maven-release-plugin] prepare for next development iteration Allow - and _ in CSS ID selectors. Closes #10. Changelog Changelog Resolve relative links when cleaning. Closes #12. Allow combinators at start of selector query Closes #13 Added val() and val(string) to Element and Elements. Treat contents of textarea as text, not data. Closes #14 Added Node#remove and Node#replaceWith. Closes #19 Throw exception if trying to parse non-text content Closes #17 Added TextNode#text and TextNode#text(String) Closes #18 Added selector support for :eq, :lt, and gt Closes #16 String.isEmpty() and LinkedList.peekFirst() is not part of the Java 5.0 API. Updated ignore list Preparing 1.1.1 release [maven-release-plugin] prepare release jsoup-1.1.1 [maven-release-plugin] prepare for next development iteration Change notes Fixed test package Fix an issue where text order was incorrect when parsing pre-document HTML. Fixes #23 Clean up the parse stack correctly when parsing data-nodes. Fixes #22. Fixed javadoc typo Added :has(selector) pseudo-selector. Added Element#parents() and Elements#parents() methods. Fixes #20 Chanelog release date Improved implicit close tag heuristic detection when parsing malformed HTML. Fixes an issue where appending / prepending rows to a table (or to similar implicit element structures) would create a redundant wrapping elements. Fixes #21 Cleanup Element and Node add mechanism Added .before(html) and .after(html) methods to Element and Elements, to insert sibling HTML Added :contains(text) selector [maven-release-plugin] prepare release jsoup-1.2.1 [maven-release-plugin] prepare for next development iteration Changelog release date Fixed javadoc for :eq(n) Upgraded the selector query parser to allow nested selectors like 'div:has(p:has(span))' Updated TokenQueue so :contains(text) can be escaped, if looking for ( or ) within text Implemented :matches(regex) selector. Changelog Parsing optimisation. Modified TokenQueue to use a StringBuilder + offset to back the queue, instead of a linked list. Reduces memory and CPU use. Parsing performance optimisation. Modified TokenQueue chompTo method to use indexOf to allow rapid scan for next token. Parsing performance optimisation. Intern attribute keys (often shared), and dropped back default bucket sizes for attributes and element children so as to conserve memory. TextNode performance tweaks Performance optimisation in parsing. Use a Visitor instead of recursion for HTML and selectors. Performance tweaks. Tidy Added [key~=regex] attribute selector by regular expression Tidy Changelog Test update [maven-release-plugin] prepare release jsoup-1.2.2 [maven-release-plugin] prepare for next development iteration Automatically determine charset when parsing from URL or File. Auto detect charset from HTML5 <meta charset> tag if present Changed DT & DD tags to block-mode tags, to follow practise over spec. Added support for [^attributePrefix] selector query. Useful for finding elements with HTML5 datasets: [^data] Implemented Element.dataset(), to retrieve a map of custom data attributes. Improved tag definitions to allow limited children and excluded children. Improved implicit table element creation, particularly around tbody tags. Cleaned tag definitions to make head and dl parsing more generic. Implicit close for <caption> tags. Changelog Testcase for malformed meta http-equiv charset. HTML5 tag support Added support for namespaced elements (<fb:name>) and selectors (fb|name) Improved HTML output format for empty elements and auto-detected self closing tags. Closes #27 Added support for tag names with - and _ (<abc_foo>, <abc-foo>) Removed obsolete nodeDepth method Implemented Node.ownerDocument DOM API method. Fixed support for character class regular expressions in [attr=~regex] selector Fixed support for character class regular expressions in [attr=~regex] selector Note <tag > fix Draft implementation of Entities, for customisable entity escaping. Working on escape/unescape routine. Simplified Entity unescaper Added ability to configure the document's output charset. Re-ordered changelog [maven-release-plugin] prepare release jsoup-1.2.3 [maven-release-plugin] prepare for next development iteration Use jsoup escaper for attributes, not Apache's. Optimise adding nodes to end of childnode list. TokenQueue optimisations Optimised document normalisation Mini optimisations Restored public access for Entities.EscapeMode Javadoc fix Removed dependency on Apache Commons-lang. Jsoup now has no external dependencies. Optimised normaliseWhitespace Optimised attribute html Micro-optimise tag ancestor Optimised textnodes to not hold attributes or childnodes unless required on use. Fixed support for case-sensitive HTML escape entities. Fixes #31 Fixed issue when parsing tags with keyless attributes. Fixes #32 Entity doc Draft / in progress implementation of Connection Initial implementation of Connection Working on http connection implementation Implemented request headers Implemented query string from data Fixed Attributes.hmtl() Added support for gzipped output. Fixes #28 Connection timeout specified in millis, not seconds Documented Connection interface methods Tidied up Connection and Jsoup use URL connection tests Implemented Element#ownText() Changelog Added support for non-pretty-printed HTML output, to more closely mirror the input HTML. Fixes #8 Changelog Fixed html() method of Attribute Added support for selectors :containsOwn(text) and :matchesOwn(regex), to supplement Element.ownText(). Updated the link example program to use Jsoup.connect() Validations for Connection Changelog release prep [maven-release-plugin] prepare release jsoup-1.3.1 [maven-release-plugin] prepare for next development iteration Doc Treat HTTP headers as case insensitive in Jsoup.Connection. Improves compatibility for HTTP responses. Tweaks Improved malformed table parsing by implementing ignorable end tags. More tests for Jsoup.Connection [maven-release-plugin] prepare release jsoup-1.3.2 [maven-release-plugin] prepare for next development iteration Implement Elements.empty() and Elements.remove(). Javadoc note for Elements.get(int) Selector documentation tweak Relaxed parse rules of H1 - H6 to allow nested content. Relaxed parse rule of SPAN to treat as block, to allow nested block content. Added ability to load and parse HTML from an input stream. Test fix Fixed issue in Entities when unescaping &#36; ("$") Fixes #34 added EscapeMode.minimum Added restricted XHTML output entity option Changelog [maven-release-plugin] prepare release jsoup-1.3.3 [maven-release-plugin] prepare for next development iteration Implemented DataNode.setWholeData() to allow updating of script and style data contents. Fixed support for jsoup.connect to follow redirects between http & https URLs. Fixes #37 Fixed issue in jsoup.connect when extracting character set from content-type header; now supports quoted charset declaration. Javadoc example on absUrl Document normalisation now more correctly enforces document structure. - ensure only one head and one body element, both under html el - allow html/head/noscript/img for some site's analytic pattern Fixes #43 Support node.outerHtml() method when node has no parent. Fixes #45 Fixed support for HTML entities with numbers in name (e.g. &frac34, &sup1) Fixes #46 Fixes IndexArrayOutOfBoundException on response with empty headers Implemented Node.clone() to create deep, independent copies of Nodes, Elements, and Documents. Fixes #47 Testcase to confirm doctypes get cloned Fixed absolute URL generation from relative URLs which are only query strings. Fixes #49 Output format tweak Added :not() selector, to find elements that do not match the selector. E.g. div:not(.logo) finds divs that do not have the "logo" class name. Fixes #36 Added Elements.not(selector) method, to remove undesired results from selector results. Changelog update in launce preperation Changelog tweak [maven-release-plugin] prepare release jsoup-1.4.1 [maven-release-plugin] prepare for next development iteration added .clone() for Elements Initial add of new generation selectors(faster than original) Added attribute selectors added AttrSelector.AttrNamePrefixSelector fix bug in element selector: incorrect behavior on multiple classes removing as exists matching evaluator classes changes wrt existing Evaluator class Selectors update removing boxing/unboxing evaluators made public Adding evaluators tests added new tests Renaming in some selectors adding new ctors to And and Or adding toString() added base container added .toString() to basic Evaluators Adding Selector parser removing char boxing adding empty and .addAll implemented :has selector Working parser except the root node selector. Added basic tests removed unused constructor parser update: normal order of selectors fix non-void element parsing such as <a href=/https/github.com/link/>link text</a> Evaluator.match(Element test) -> Evaluator.match(Element root, Element test) change Character -> char change restored all tests. added RootSelector updated tree selectors wrt subtree matching update evaluator wrt subtree matching Added RootSelector support small optimizations added javadocs Added javadocs for Evaluators. Updated tests. Updated parser Fixed issue when using descendant regex attribute selectors. Fixes #52 Added a test to confirm combinators don't match in balanced contains queries Updated OutputSettings inside of Document to be a static inner class. This addresses the issue with using jsoup and scala 2.8 discussed here: http://groups.google.com/group/jsoup/browse_thread/thread/3f7ec2fa41dfb87f This change should be safe to commit and won't make a backwards incompatible to the public interface of jsoup (you can always reference a static member via a non-static path) for any existing users. A recompilation of their code won't even be necessary. All tests continue to pass for me after this change. Fixed tokeniser optimisation when scanning for missing data element close tags. Fixes #67 There are no valid (x)html tags that start with numbers Reverted changes that only allow empty tags in pre-defined instances. Markup like <tag /> needs to be parsed as an empty element. Integrated new single-pass selector evaluators, contributed by knz (Anton Kazennikov). Removed com.sun.xml.internal.ws.util.StringUtils to fix https://github.com/jhy/jsoup/issues/#issue/69 "jsoup/src/main/java/org/jsoup/select/selectors/AndSelector.java:[8,35] package com.sun.xml.internal.ws.util does not exist" Force strict entity matching (must be &xxx; and not &xxx) in element attributes. Fixes #71 Ensure that Jsoup.Connect handles relative redirects in cases where the underlying HTTP stack doesn't automatically follow them. Fixes #73 Allow Jsoup.Connect to parse application/xml and application/xhtml+xml responses. Fixes #72 Defined U (underline) element as an inline tag. Cleanup of selector class files Updated Jsoup.Connection so that cookies set on a redirect response will be included on the redirected request and response. Prevent infinite redirection loops in jsoup.connect. Implemented TextNode.splitText Moved .wrap, .before, and .after from Element to Node for flexibility. Overriding implementations in Element still return Element. Don't run URL connectivity tests by default. Added ability to change an element's tag with Element.tagName(String), and to change many at once with Elements.tagName(String). Test to confirm that abs URL method works on img src attributes. Generify empty child list. Removed redundant empty array Changelog updates Readme update [maven-release-plugin] prepare release jsoup-1.5.1 [maven-release-plugin] prepare for next development iteration Fixed issue with selector parser where some boolean AND + OR combined queries (e.g. "meta[http-equiv], meta[content]") were being parsed incorrectly as OR only queries (e.g. former as "meta, [http-equiv], meta[content]") Fixed issue where a content-tye specified in a meta tag may not be reliably detected, due to the above issue. Allow <a> and <font> elements to be treated as flow/block elements, to match browser parse trees. Updated copyright date Updated Element.text() method to ensure <br> tags output as whitespace. Tweaked Element.outerHtml() method to not generate initial newline on first output element. Have <br> output as " " for Element.ownText() Testcase for bug #63. Fixes #63 Test to ensure that charset detection from <meta> tag works when preceeded by an irrelevant <meta> tag. Changelog prep for 1.5.2 [maven-release-plugin] prepare release jsoup-1.5.2 [maven-release-plugin] prepare for next development iteration Reimplementation of parser and tokeniser, to make jsoup a HTML5 conformat parser, against the http://whatwg.org/html spec. Added test to verify that solidus as end of unquoted attribute in tag is handled as part of attribute, and not a self-closing tag, which was the old behaviour of jsoup. Fixes #66 Added test to confirm that tbody in span does not create a new table. Fixes #64 Improved "abs:" absolute URL handling in Elements.attr("abs:href") and Node.hasAttr("abs:href"). Fixes #97 Fixed issue in TokeniserState where the tokeniser could get trapped at EOF if in RCDataEndTag state. Fixed cookie handling issue in jsoup.Connect where empty cookies would cause a validation exception. Fixes #87 Allow 400-500 errors and response with no content-type to be parsed. Documentation and test cases for jsoup.Connect ignoreHttpErrors and ignoreContentType options. Cleanup datastream test. Handle unclosed <title> tags in document by breaking out of the title at the next start tag, instead of eating up to the end of the document. Fixes #82 Added OSGi bundle support to the jsoup package jar. Fixes #98 Test for #101 JavaDoc update Node.after and Node.before(node) change log Test to confirm that <textarea>s do not get implicit <form> tags with the new parser. Verifies #102 Added Node.unwrap() and Elements.unwrap(), to remove a node but keep its contents. Fixes #100 JavaDoc usage example [maven-release-plugin] prepare release jsoup-1.6.0 [maven-release-plugin] prepare for next development iteration Marked release date. Fix Java 1.5 compatibility Specify Felix Maven plugin verion Javadoc update for DescendableLinkedList Fixes #103 Fix an incorrect case fall-through, and add some not-null validations to prevent warnings. For #103 Changelog for Java 1.5 fix Changelog correction Fixed an issue when parsing <script> tags. When in body where the tokeniser wouldn't switch to the InScript state, which meant that data in a <script> wouldn't parse correctly. Fixes #104 Refactor of script and rawtext end tag name states. Fixed CharacterReader to handle unconsuming at EOF correctly. Additional <script> test at EOF. Fixed an issue with a missing quote when serialising DocumentType nodes. Fixes #109 Fixed issue where a single 0 character was lexed incorrectly as a null character. Fixes #107 Fixed normalisation of carriage returns to newlines on input HTML Fixes #110 Disabled memory mapped files Disabled memory mapped files when loading files from disk, to improve compatibility in Windows environments. Release date Updated changelog Updated POM for github push [maven-release-plugin] prepare release jsoup-1.6.1 [maven-release-plugin] prepare for next development iteration Javadoc typo fix. Follow POST redirects as GETs Fixes #120 Optionally preserve related links in elements when cleaning Fixed handling of null characters within comments. Fixes #121 Added jsoup.connect.cookies(Map) method, to set multiple cookies at once, possibly from a prior request. Tweaked escaped entity detection in attributes to not treat &entity_... as an entity form. Updated the Cleaner to support custom allowed protocols such as "cid:" and "data:". Fixes #127 Fixed doctype tokeniser to allow whitespace between name and public identifier. Tweaked HTML output of closing script and style tags to not add an extraneous newline when pretty-printing. Tweaked Element#select documentation to reinforce CSS selector syntax. Added Element.textNodes() and Element.dataNodes(), to easily access an element's children text nodes and data nodes. Implemented an example HTML to plain-text converter. Updated changelog for HtmlToPlainText example Added documentation for NodeVisitor and NodeTraversor. Corrected documentation of NodeTraversor to reflect depth-first order of node visitation. Added Node.traverse() and Elements.traverse() methods, to iterate through a node's descendants. Made Evaluator constructor public to allow custom implementations Act on only the first base href in parse. And make node.setBaseUri() recurse down to descendants. First draft of a simple XML treebuilder / parser. This provides an alternative to the HTML5 parser which enforces HTML semantics on the parsed input. The simple XML parser has no understanding of HTML, and will parse the input as-is into a DOM. Added test coverage for XML parser. Allow an alternate parser to be supplied for core use cases. Fixed URL tests. Fixed invocation of alternative parser in Jsoup.Connection. Updated test to confirm. Added test for invocation of alternate parser when loading from file input stream. Added support to optionally keep track of errors while tokenising and tree-building. Test XML (un)known self-closing behaviour. Changelog for XML and error tracking. Change what considered as "whitespace" Changelog and code tweak for whitespace test. Fixes issue #126 in jsoup, where comments inside table were replicated inside body Changelog and test for pull #165 (comments in tables). Fixes #126 Updated parser error tracking to cap the max size of errors tracked. Defaults to 0 (disabled). Make ParseError public. Drop BOM at start of byte data if present after decode. Fixes #134. Correctly handle content (ignore it) after frameset end. Fixes #162 Reduced memory pre-allocation in Node.outerHtml from 32KB to 128B, to reduce memory pressure. Fixes #143. Changelog release prep POM update for git url syntax [maven-release-plugin] prepare release jsoup-1.6.2 [maven-release-plugin] prepare for next development iteration Tidied up readme, for github presentation. Fixed parsing of group-or commas in CSS selectors. Fixes #179 Updated license date Fixed precedence parsing of group OR (,) in CSS selectors. Added tests, and repaired cheekily / hastily / incorrectly modified test. If a node has no parent, return null on previousSibling and nextSibling instead of throwing a null pointer exception. Fixes #184 Correct Elements.not() javadoc. Fixes #177 Fixed HTML entity parser to correctly parse entities like frac14 (letter + number combo). Fixes #145 Fixed issue where contents of a script tag within a comment could be incorrectly parsed. Fixes #115 Fixed GAE support: load HTML entities from a file on startup, instead of embedding in the class. Fix NPE in style fragment parse Fixes #189 Fixed issue with :all pseudo-tag in HTML sanitizer when cleaning tags Fixes #156 NPE changelog Fixed NPE in Parser.parseFragment() when context parameter is null. Fixes #195. In HTML whitelists, when defining allowed attributes for a tag, automatically add the tag to the allowed list. Fixes #192. Copy .properties files, to fix mvn build of Entities. Fixes #203 Fixed missing <td> in javadoc Splelling. Correct date for the last release. Releae prep. [maven-release-plugin] prepare release jsoup-1.6.3 [maven-release-plugin] prepare for next development iteration Reflect actual release date. Javadoc align headers lefJavadoc align headers leftt Use English-locale settings when uppercasing charset. Fixes a problem with Turkish characters, see: http://mattryall.net/blog/2009/02/the-infamous-turkish-locale-bug Set Locale.ENGLISH when running upper/lowercase methods, to ensure locale independence. Fixed whitespace preservation in <textarea> tags. Fixes #167. In jsoup.connect, fail faster if the return content type is not supported. Fixes #153. In jsoup.clean, allow custom OutputSettings, to control pretty printing, character set, and entity escaping. Fixes #148 Use a StringBuilder to accumulate attribute values. When initially implemented I expected attribute values to normally be read in one sweep, removing the need to accumulate values. However profiling shows that attributes are often accumulated, and so the string concatination implementation was very slow. Using a StringBuilder here gives reduces parse times > 50%. Reuse StringBuilders No longer strip \r before parsing. This saves memory and CPU time at start of parse. Changelog for performance improvement. Make copies of all strings returned, vs returning pointers to substrings of input. The former method is a little faster, and creates less garbage. However when the input is large, and apps retain data pulled from the DOM, the app may perceive a memory leak, as even a small string is actually as large as the original input (although multiple strings are all backed by the one original input). So, this implementation is a little less performant, but has a potential for greater safety, depending on how the library is used. Replaced Strings with char array in CharacterReader, for well improved parse times. Faster to scan, and less garbage created. Only create attribute objects for end tag tokens when required. Saves a bit of GC time. Don't create Iterator objects in these tight Evaluator loops. Saves a fair bit of GC time when selecting. To save GC time in select, don't wrap childnodes in unmodifiable list. Check for null body, possible in framesets. Fixes #154 Changelog catch-up. Javadoc fix Upgrade to maven-resources-plugin version 2.4 Eclipse m2e doesn't work with the old version 2.3. Updated Maven pom.xml to specify jar plugin version. Fixed an issue when normalising whitespace for strings containing high-surrogate characters. Don't throw an exception if no content-type specified. Fixes #213 If the charset from the server is not supported, fail-back to UTF8. Fixes #215 Remove unnecessary synchronisation in Tag.valueOf Fixes #238 Micro-optimised Tag.valueOf Only lower-case and trim tag names if their original form is not found in the registered tags. Shaves parse time for the bulk of cases where the tag is already in lowercase. In DataUtil, check if body length > 0 before looking at docData Prevents a string out of bounds exception. Fixes #230 Refactored entity decoding. Modified the heuristic entity decoder to be less greedy; does not repeatedly chomp down the string until a match is found, and requires a semicolon terminator for extended entities. Updated Entities to use the entity decoder in Tokeniser, vs the legacy decoder. Fixes #224. Extra wrap/unwrap tests Whitespace normalise document.title() output. Fixes #168 Tidy up javadoc one-liners for Element Introduced finer granularity to Jsoup.connect exceptions. Fixes #229 Don't run URL tests by default. Confirm cleans Russian characters OK [maven-release-plugin] prepare release jsoup-1.7.1 [maven-release-plugin] prepare for next development iteration Refactored the Cleaner to traverse rather than recurse child nodes Avoids the risk of overflowing the stack Fixes #246 Typo in Doctype node When parsing in XML mode, preserve XML declarations (<?xml ... ?>). Fixes #242 Allow Whitelist test methods to be extended Fixes #85 Javadoc typo Fixed an issue when parsing <textarea>/RCData tags containing unescaped closing tags that would drop the traling >. Added a maximum body response size to Jsoup.Connection The Connection API is no longer in beta. Modified maxBodySize to truncate at precise max. Rather than previous implementation which was up to the internal buffer size (130K) larger. Corrected the javadoc for Element#child() to note that it throws IndexOutOfBounds. Fixes #277 Added test to verify absolute URLs for file:// URIs Fixes #255 Added Element.insertChildren Also tidied up JavaDoc, and returned Node.childNodes to a unmodifiable list. Fixes #239 (with alternative implementation) Added Node.childNodesCopy() Don't clone the element's classnames Fixes #278 Limit how far up the stack the formatting adoption agency algorithm will travel Fixes #234 Modified Element.text() to build text by traversing child nodes rather than recursing. Fixes #271 Introduced Parser.parseXmlFragment(), to allow easy parsing of XML fragments. Updated copyright year. Support ins, del and s tags inline Tweaked koz's changes in merge prep. Adds outline mode to Document.OutputSettings. Fixes #274 Fixed overzealous indenting in outerHtmlTail When parsing, allow all tags to self-close. Tags that aren't expected to self-close will get an end tag. Fixes #2458 changed return type of Tokeniser.consumeCharacterReference from Character to char[], and also changed TokeniserState accordingly changed Entities.escape to escape String with supplementary characters correctly, and added two test cases to verify added test cases to verify supplementary characters can be used as attribute name and value, as well as be selected by selector added a containsOwn test fixed incorrect code copy-and-paste Tweaked mingfai's surrogate pair implementation for efficiency. On the core cases where characters are not surrogate pairs, I've kept to pushing chars around. This is to try and minimize the number of short-lived String objects. Output escape codes in hex instead of decimal. For characters that need to be escaped, this produces a smaller more legible escape code, and Unicode chars are generally described in hex, not dec. Output valid hex escapes this time. Changelog for supplementary characters Added support for css pseudo classes :first-child, :last-child, :nth-child, :nth-last-child, :first-of-type, :last-of-type, :nth-of-type, :nth-last-of-type, :only-child, :only-of-type, :empty, :root Changelog for structural pseudo selectors Note the Jsoup.Connect default max size [maven-release-plugin] prepare release jsoup-1.7.2 [maven-release-plugin] prepare for next development iteration Changelog 1.7.2 release date Test that hostless URIs resolve to absolute URLs correctly. Removed code duplication in data end tag handlers Reduce code dupes for ScriptDataDoubleEscapeStart and ScriptDataDoubleEscapeEnd First pass at a FomElement The FormElement extends Element to provide ready access to a form's controls, and to allow the form to be submitted. It also connects forms to their controls in situations when the DOM tree created does not have the form element be a parent of the control, like when the form tag is in a TR but the control in a TD. In that case the form tag gets reparented. FormElement changelog note Don't auto submit buttons Because they weren't clicked on, we shouldn't submit their values. Do need to find a convenient way to simulate a button press. Added a forms() convenience method to Elements This allows one to get at FormElements without casting.
1 parent b25eb52 commit 24080dd

File tree

102 files changed

+29908
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

102 files changed

+29908
-1
lines changed

.gitignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
.idea/
2+
jsoup.iml
3+
jsoup.ipr
4+
jsoup.iws
5+
target/
6+
.classpath
7+
.project
8+
.settings/
9+
*Thrash*
10+

CHANGES

Lines changed: 484 additions & 0 deletions
Large diffs are not rendered by default.

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
The MIT License
2+
3+
Copyright (c) 2009, 2010, 2011, 2012, 2013 Jonathan Hedley <[email protected]>
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in
13+
all copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21+
THE SOFTWARE.

README

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,15 @@
1-
Private github project to keep all personal work.
1+
jsoup: Java HTML parser that makes sense of real-world HTML soup.
22

3+
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
4+
5+
jsoup implements the WHATWG HTML5 specification (http://whatwg.org/html), and parses HTML to the same DOM as modern browsers do.
6+
7+
* parse HTML from a URL, file, or string
8+
* find and extract data, using DOM traversal or CSS selectors
9+
* manipulate the HTML elements, attributes, and text
10+
* clean user-submitted content against a safe white-list, to prevent XSS
11+
* output tidy HTML
12+
13+
jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.
14+
15+
See http://jsoup.org/ for downloads and documentation.

pom.xml

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
3+
<modelVersion>4.0.0</modelVersion>
4+
<name>jsoup</name>
5+
6+
<groupId>org.jsoup</groupId>
7+
<artifactId>jsoup</artifactId>
8+
<version>1.7.3-SNAPSHOT</version>
9+
<description>jsoup HTML parser</description>
10+
<url>http://jsoup.org/</url>
11+
<inceptionYear>2009</inceptionYear>
12+
<issueManagement>
13+
<system>GitHub</system>
14+
<url>http://github.com/jhy/jsoup/issues</url>
15+
</issueManagement>
16+
<licenses>
17+
<license>
18+
<name>The MIT License</name>
19+
<url>http://jsoup.com/license</url>
20+
<distribution>repo</distribution>
21+
</license>
22+
</licenses>
23+
<scm>
24+
<url>http://github.com/jhy/jsoup</url>
25+
<connection>scm:git:http://github.com/jhy/jsoup.git</connection>
26+
<!-- <developerConnection>scm:git:[email protected]:jhy/jsoup.git</developerConnection> -->
27+
</scm>
28+
<organization>
29+
<name>Jonathan Hedley</name>
30+
<url>http://jonathanhedley.com/</url>
31+
</organization>
32+
33+
<build>
34+
<plugins>
35+
<plugin>
36+
<groupId>org.apache.maven.plugins</groupId>
37+
<artifactId>maven-compiler-plugin</artifactId>
38+
<version>2.0.2</version>
39+
<configuration>
40+
<source>1.5</source>
41+
<target>1.5</target>
42+
<encoding>UTF-8</encoding>
43+
</configuration>
44+
</plugin>
45+
<plugin>
46+
<groupId>org.apache.maven.plugins</groupId>
47+
<artifactId>maven-javadoc-plugin</artifactId>
48+
<version>2.6.1</version>
49+
<configuration>
50+
</configuration>
51+
<executions>
52+
<execution>
53+
<id>attach-javadoc</id>
54+
<phase>verify</phase>
55+
<goals>
56+
<goal>jar</goal>
57+
</goals>
58+
</execution>
59+
</executions>
60+
</plugin>
61+
<plugin>
62+
<groupId>org.apache.maven.plugins</groupId>
63+
<artifactId>maven-source-plugin</artifactId>
64+
<version>2.1.1</version>
65+
<configuration>
66+
</configuration>
67+
<executions>
68+
<execution>
69+
<id>attach-sources</id>
70+
<phase>verify</phase>
71+
<goals>
72+
<goal>jar</goal>
73+
</goals>
74+
</execution>
75+
</executions>
76+
</plugin>
77+
<plugin>
78+
<groupId>org.apache.maven.plugins</groupId>
79+
<artifactId>maven-jar-plugin</artifactId>
80+
<version>2.2</version>
81+
<configuration>
82+
<archive>
83+
<manifestFile>${project.build.outputDirectory}/META-INF/MANIFEST.MF</manifestFile>
84+
</archive>
85+
</configuration>
86+
</plugin>
87+
<plugin>
88+
<groupId>org.apache.felix</groupId>
89+
<artifactId>maven-bundle-plugin</artifactId>
90+
<version>2.1.0</version>
91+
<executions>
92+
<execution>
93+
<id>bundle-manifest</id>
94+
<phase>process-classes</phase>
95+
<goals>
96+
<goal>manifest</goal>
97+
</goals>
98+
</execution>
99+
</executions>
100+
<configuration>
101+
<instructions>
102+
<Bundle-DocURL>http://jsoup.org/</Bundle-DocURL>
103+
</instructions>
104+
</configuration>
105+
</plugin>
106+
<plugin>
107+
<groupId>org.apache.maven.plugins</groupId>
108+
<artifactId>maven-resources-plugin</artifactId>
109+
<version>2.4</version>
110+
</plugin>
111+
</plugins>
112+
<resources>
113+
<resource>
114+
<directory>src/main/java</directory>
115+
<includes>
116+
<include>**/*.properties</include>
117+
</includes>
118+
</resource>
119+
</resources>
120+
</build>
121+
122+
<distributionManagement>
123+
<snapshotRepository>
124+
<id>sonatype-nexus-snapshots</id>
125+
<name>Sonatype Nexus Snapshots</name>
126+
<url>http://oss.sonatype.org/content/repositories/snapshots</url>
127+
</snapshotRepository>
128+
<repository>
129+
<id>sonatype-nexus-staging</id>
130+
<name>Nexus Release Repository</name>
131+
<url>http://oss.sonatype.org/service/local/staging/deploy/maven2/</url>
132+
</repository>
133+
</distributionManagement>
134+
135+
<profiles>
136+
<profile>
137+
<id>release-sign-artifacts</id>
138+
<activation>
139+
<property>
140+
<name>performRelease</name>
141+
<value>true</value>
142+
</property>
143+
</activation>
144+
<build>
145+
<plugins>
146+
<plugin>
147+
<groupId>org.apache.maven.plugins</groupId>
148+
<artifactId>maven-gpg-plugin</artifactId>
149+
<executions>
150+
<execution>
151+
<id>sign-artifacts</id>
152+
<phase>verify</phase>
153+
<goals>
154+
<goal>sign</goal>
155+
</goals>
156+
</execution>
157+
</executions>
158+
</plugin>
159+
</plugins>
160+
</build>
161+
</profile>
162+
</profiles>
163+
164+
<dependencies>
165+
166+
<dependency>
167+
<!-- junit -->
168+
<groupId>junit</groupId>
169+
<artifactId>junit</artifactId>
170+
<version>4.5</version>
171+
<scope>test</scope>
172+
</dependency>
173+
174+
</dependencies>
175+
176+
<dependencyManagement>
177+
<dependencies>
178+
</dependencies>
179+
</dependencyManagement>
180+
181+
<properties>
182+
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
183+
</properties>
184+
185+
<developers>
186+
<developer>
187+
<id>jhy</id>
188+
<name>Jonathan Hedley</name>
189+
<email>[email protected]</email>
190+
<roles>
191+
<role>Lead Developer</role>
192+
</roles>
193+
<timezone>+11</timezone>
194+
</developer>
195+
</developers>
196+
197+
</project>

0 commit comments

Comments
 (0)