Commit 24080dd
Site
Work
Suds: inherit meta from parent dir
ignore
Merge
Updated readme
Updated readme to be more useful.
Project init.
readme update
Initial implementation of TokenStream, to pull tokens from HTML for parsing.
Initial implementation of TokenStream.
Initial implementation of Tag.
Draft implementation of Parser.
Stack, not Queue, so use getLast()
Ignore tag "html" in parse tree as created by doc
Initial implementation of AttributeParser.
URL in readme
Updated TokenStream to deal with < or > within attributes.
Implemented comment handling.
Not very happy with token and parser. Would like to reimplement with some kind of expect / consumeTo behavior. Getting a bit too hacky.
Element child(int) method, attr(key) method
Set parent on add Child
Complain if moving elements in tree (to implement)
Implemented getElementsByTagName
Test zero return
Implemented Element.getElementById
Implemented text()
Implemented sibling methods
Renamed JSoup -> Jsoup
Reimplemented parser.
Pulled string matching stuff out of Parser ad into TokenQueue, for reuse in selector parser. Added som texts and corrected behaviour of parser.
Implemented:
* Element.getElementsByClass
* Element.getElementsWithAttribute
* Element.getElementsWithAttributeValue
* Element.className
* Element.hasClass
* Elmenet.classNames
Handle HTML encoded (escaped) data in text nodes and attributes.
Nodes get to html() method.
All nodes have outerHtml method, elements have (inner) HTML.
Initial, partial implementation of Selector.
Implemented select(query, query, query) or group selector.
Implemented ElementList.select()
Renamed ElementList to Elements
Renamed ElementList to Elements
Implemented select("ancestor descendant").
Deeper descendant test.
Support for data only tags (script, textarea etc)
Removed scratch test.
Neatened Parser
Implicit parent creation for elements, more general document structure creation.
Introduced a DataNode, and SCRIPT, TEXTAREA, TITLE etc goes into DataNodes, so that Element.text() does not get clouded with script or style inners.
Fixed head canContain
todo note
Have Element DOM methods return Elements instead of List<Element>, to give ready access to .select(query)
Implemented Element.text(string) method, to set the text of an element, and clear out existing html.
Selector: added * (all elements) and parent > child.
Implement baseURIs for all Nodes, and absolute URL attribute getter.
Initial test suite from real world html
Updated parser to support namespaced attributes (i.e xml:lang=en).
Implemented Elements first(), last(), attr(), hasAttr() methods.
Git ignore Eclipse
Explicit empty String[], to remove warning on null as vararg.
Fixed selector so that "parent child" doesn't contain parent, and "element.class" is an AND on parent element, and doesn't match .class in child element (i.e. I re-read the CSS3 selector doc, which is clearer than jquery doc).
Attribute helpers in Element and Elements.
Test confirms selectors are case insensitive.
Parser updated to handle CDATA, and rogue < in text nodes.
Output HTML correctly for <! ... > and <? ... ?> xml tags.
Implemented advanced attribute selectors (!= ^= $= *=) and element methods.
Extend selector test for multi classes and attributes.
E.g.
div.foo.bar[title=qux][name=bar] matches:
<div class="foo bar" title="qux" name="bar">
Don't register unknown tags.
Unknown tags created with Tags.valueOf(String) were being registered, so that further
.valueOf()s would return an == tag. But that's a potential memory leak, particularly with
malicious input HTML, and serves no real purpose (as .equals() still works), so that
functionality has been removed.
Tests parser for unknown tags.
Handle empty (self closing) blocks.
<div/><div></div> was parsing as <div><div></div></div>. No longer.
Implemented Elements methods text(), eq(), and is().
Removed unused Select#groupOp noop.
Renamed Elements#select to Elements#filter.
Renamed so the API is better self describing.
Initial bits of HTML cleaner.
Initial implementation of Whitelist cleaner config.
Initial Cleaner implementation.
Added integration test for www.news.com.au
Fixed select parser for childs.
Selector cleanup
Google search result parse test.
Fixed selector for multi descenders.
Google parse test
Implemented "abs:" virtual attribute prefix for absolute URLs.
Modified Node#absURL to return only absolute URLs, or "".
Previously if there was no baseURI, it would return a relative URL from the
attribute value, which is unreliable.
Also documented method.
Modified parser to add elements found past </body> into body.
Test for binary content.
Parse Test for Yahoo Japan.
Extra Cleaner tests.
Adds Parser.parseBodyFragment method.
Makes static members final.
Updates pom and adds license
Updates description
Encourage Maven to copy using UTF-8.
Otherwise it seems to copy java source (but not other resources?) using local encoding,
which means that some of the char encoding tests break.
Maven plugin to build javadoc
Rename Element.addChild to appendChild.
Also implement Element.addElement(String tagName)
Implemented Element.append(html)
Element documentation.
Doc tidy.
Benchmark script.
Create tests with Thrash for sandbox benchmarking.
Removed StartTag
Moved Evaluator to nodes from select, to close down public methods.
Doc
Doc
Copyright date
Knock access down
Version is 0.1 until first beta release
Removed Element children list, and create on fly from nodes.
Attributes values back to Attribute
Attributes format
Implemented Element prepend methods
Simplified Document bean methods for consistency.
Linked Document title methods with HTML structure
Fixed parse of unclosed <dl><dt>Foo<dd>Bar</dl> runs.
Force compile source from UTF-8.
Otherwise it defaults to something odd.
And the property already set (project.build.sourceEncoding) doesn't seem to be
used by the mvn compiler. Which is... also odd. Or more likely, I don't 100%
understand it.
POM update to build source and javadoc jars
Implemented Jsoup.parse(File), and javadocced.
Flipped integration test to use Parse(File)
No default constructor for Jsoup
Dropped "get"
Elements
Don't escape text in data nodes, to preserve " chars
Simplifed Tag creator.
Text normalisation.
Use string builders for HTML creation.
Whitespace tests
Preserve whitespace in children of <pre>
Implemented parse from URL.
Doc typo
Doc
Parse <frameset> outside of <body>
Updated URL integration test
Javadoc
Parser javadoc
Cleaner javadoc
Selector documentation
Document.createElement(String)
Support for inline font tag
pom update
[maven-release-plugin] prepare release jsoup-0.1.1
[maven-release-plugin] prepare for next development iteration
Readme update
Parse unknown tags as inline elements that can contain blocks.
Ensures <p><custom>Test</custom></p> parses like that, and not
<p></p><custom>Test</custom>.
Closes #1
Changelog
Release 0.1.2 prep.
[maven-release-plugin] prepare release jsoup-0.1.2
[maven-release-plugin] prepare for next development iteration
Fix absolute URL resolution issue when a base tag has no href.
Example program: list links
Implemented Element#wrap and #Elements#wrap
Also protected Node.replaceChild, removeChild, addChild.
New: E + F adjacent sibling selector, E ~ F preceding sibling.
Corrected change note
Maven Sonatype setup
[maven-release-plugin] prepare release jsoup-0.2.1
[maven-release-plugin] prepare for next development iteration
[maven-release-plugin] prepare release jsoup-0.2.1
Release prep
[maven-release-plugin] prepare release jsoup-0.2.1a
[maven-release-plugin] prepare for next development iteration
Sonatype release machinations
[maven-release-plugin] prepare release jsoup-0.2.1b
[maven-release-plugin] prepare for next development iteration
Add addClass, removeClass, toggleClass, hasClass to Element and Elements.
Closes #2
Improved document normalisation.
hasText
Improved HTML output (pretty-print)
Changelog for release prep
[maven-release-plugin] prepare release jsoup-0.2.2
[maven-release-plugin] prepare for next development iteration
Assert attribute values are not null, not not empty.
Closes #7.
Changed Elements#attr(key) to scan all elements for attribute.
Closes #4.
Implemented Elements html(), html(string), append, and prepend.
Closes #5.
Changelog
Normalise head by prepending, not appending.
Closes #9.
Cleaner.isValid() method.
Closes #6.
IsValid test for OK attribute
Test self is not descender
Deploy prep
Release prep
[maven-release-plugin] prepare release jsoup-0.3.1
[maven-release-plugin] prepare for next development iteration
Allow - and _ in CSS ID selectors.
Closes #10.
Changelog
Changelog
Resolve relative links when cleaning.
Closes #12.
Allow combinators at start of selector query
Closes #13
Added val() and val(string) to Element and Elements.
Treat contents of textarea as text, not data.
Closes #14
Added Node#remove and Node#replaceWith.
Closes #19
Throw exception if trying to parse non-text content
Closes #17
Added TextNode#text and TextNode#text(String)
Closes #18
Added selector support for :eq, :lt, and gt
Closes #16
String.isEmpty() and LinkedList.peekFirst() is not part of the Java 5.0 API.
Updated ignore list
Preparing 1.1.1 release
[maven-release-plugin] prepare release jsoup-1.1.1
[maven-release-plugin] prepare for next development iteration
Change notes
Fixed test package
Fix an issue where text order was incorrect when parsing pre-document HTML.
Fixes #23
Clean up the parse stack correctly when parsing data-nodes.
Fixes #22.
Fixed javadoc typo
Added :has(selector) pseudo-selector.
Added Element#parents() and Elements#parents() methods.
Fixes #20
Chanelog release date
Improved implicit close tag heuristic detection when parsing malformed HTML.
Fixes an issue where appending / prepending rows to a table (or to similar implicit
element structures) would create a redundant wrapping elements.
Fixes #21
Cleanup Element and Node add mechanism
Added .before(html) and .after(html) methods to Element and Elements, to insert sibling HTML
Added :contains(text) selector
[maven-release-plugin] prepare release jsoup-1.2.1
[maven-release-plugin] prepare for next development iteration
Changelog release date
Fixed javadoc for :eq(n)
Upgraded the selector query parser to allow nested selectors like 'div:has(p:has(span))'
Updated TokenQueue so :contains(text) can be escaped, if looking
for ( or ) within text
Implemented :matches(regex) selector.
Changelog
Parsing optimisation.
Modified TokenQueue to use a StringBuilder + offset to back the queue,
instead of a linked list. Reduces memory and CPU use.
Parsing performance optimisation.
Modified TokenQueue chompTo method to use indexOf to allow rapid
scan for next token.
Parsing performance optimisation.
Intern attribute keys (often shared), and dropped back default
bucket sizes for attributes and element children so as to conserve
memory.
TextNode performance tweaks
Performance optimisation in parsing.
Use a Visitor instead of recursion for HTML and selectors.
Performance tweaks.
Tidy
Added [key~=regex] attribute selector by regular expression
Tidy
Changelog
Test update
[maven-release-plugin] prepare release jsoup-1.2.2
[maven-release-plugin] prepare for next development iteration
Automatically determine charset when parsing from URL or File.
Auto detect charset from HTML5 <meta charset> tag if present
Changed DT & DD tags to block-mode tags, to follow practise over spec.
Added support for [^attributePrefix] selector query. Useful for finding
elements with HTML5 datasets: [^data]
Implemented Element.dataset(), to retrieve a map of custom data attributes.
Improved tag definitions to allow limited children and excluded children.
Improved implicit table element creation, particularly around tbody tags.
Cleaned tag definitions to make head and dl parsing more generic.
Implicit close for <caption> tags.
Changelog
Testcase for malformed meta http-equiv charset.
HTML5 tag support
Added support for namespaced elements (<fb:name>) and selectors (fb|name)
Improved HTML output format for empty elements and auto-detected self closing tags.
Closes #27
Added support for tag names with - and _ (<abc_foo>, <abc-foo>)
Removed obsolete nodeDepth method
Implemented Node.ownerDocument DOM API method.
Fixed support for character class regular expressions in [attr=~regex] selector
Fixed support for character class regular expressions in [attr=~regex] selector
Note <tag > fix
Draft implementation of Entities, for customisable entity escaping.
Working on escape/unescape routine.
Simplified Entity unescaper
Added ability to configure the document's output charset.
Re-ordered changelog
[maven-release-plugin] prepare release jsoup-1.2.3
[maven-release-plugin] prepare for next development iteration
Use jsoup escaper for attributes, not Apache's.
Optimise adding nodes to end of childnode list.
TokenQueue optimisations
Optimised document normalisation
Mini optimisations
Restored public access for Entities.EscapeMode
Javadoc fix
Removed dependency on Apache Commons-lang. Jsoup now has no external dependencies.
Optimised normaliseWhitespace
Optimised attribute html
Micro-optimise tag ancestor
Optimised textnodes to not hold attributes or childnodes unless required on use.
Fixed support for case-sensitive HTML escape entities.
Fixes #31
Fixed issue when parsing tags with keyless attributes.
Fixes #32
Entity doc
Draft / in progress implementation of Connection
Initial implementation of Connection
Working on http connection implementation
Implemented request headers
Implemented query string from data
Fixed Attributes.hmtl()
Added support for gzipped output.
Fixes #28
Connection timeout specified in millis, not seconds
Documented Connection interface methods
Tidied up Connection and Jsoup use
URL connection tests
Implemented Element#ownText()
Changelog
Added support for non-pretty-printed HTML output, to more closely mirror the input HTML.
Fixes #8
Changelog
Fixed html() method of Attribute
Added support for selectors :containsOwn(text) and :matchesOwn(regex), to supplement Element.ownText().
Updated the link example program to use Jsoup.connect()
Validations for Connection
Changelog release prep
[maven-release-plugin] prepare release jsoup-1.3.1
[maven-release-plugin] prepare for next development iteration
Doc
Treat HTTP headers as case insensitive in Jsoup.Connection. Improves compatibility for HTTP responses.
Tweaks
Improved malformed table parsing by implementing ignorable end tags.
More tests for Jsoup.Connection
[maven-release-plugin] prepare release jsoup-1.3.2
[maven-release-plugin] prepare for next development iteration
Implement Elements.empty() and Elements.remove().
Javadoc note for Elements.get(int)
Selector documentation tweak
Relaxed parse rules of H1 - H6 to allow nested content.
Relaxed parse rule of SPAN to treat as block, to allow nested block content.
Added ability to load and parse HTML from an input stream.
Test fix
Fixed issue in Entities when unescaping $ ("$")
Fixes #34
added EscapeMode.minimum
Added restricted XHTML output entity option
Changelog
[maven-release-plugin] prepare release jsoup-1.3.3
[maven-release-plugin] prepare for next development iteration
Implemented DataNode.setWholeData() to allow updating of script and style data contents.
Fixed support for jsoup.connect to follow redirects between http & https URLs.
Fixes #37
Fixed issue in jsoup.connect when extracting character set from content-type header; now supports quoted
charset declaration.
Javadoc example on absUrl
Document normalisation now more correctly enforces document structure.
- ensure only one head and one body element, both under html el
- allow html/head/noscript/img for some site's analytic pattern
Fixes #43
Support node.outerHtml() method when node has no parent.
Fixes #45
Fixed support for HTML entities with numbers in name (e.g. ¾, ¹)
Fixes #46
Fixes IndexArrayOutOfBoundException on response with empty headers
Implemented Node.clone() to create deep, independent copies of Nodes, Elements, and Documents.
Fixes #47
Testcase to confirm doctypes get cloned
Fixed absolute URL generation from relative URLs which are only query strings.
Fixes #49
Output format tweak
Added :not() selector, to find elements that do not match the selector. E.g. div:not(.logo) finds divs that
do not have the "logo" class name.
Fixes #36
Added Elements.not(selector) method, to remove undesired results from selector results.
Changelog update in launce preperation
Changelog tweak
[maven-release-plugin] prepare release jsoup-1.4.1
[maven-release-plugin] prepare for next development iteration
added .clone() for Elements
Initial add of new generation selectors(faster than original)
Added attribute selectors
added AttrSelector.AttrNamePrefixSelector
fix bug in element selector: incorrect behavior on multiple classes
removing as exists matching evaluator classes
changes wrt existing Evaluator class
Selectors update
removing boxing/unboxing
evaluators made public
Adding evaluators tests
added new tests
Renaming in some selectors
adding new ctors to And and Or
adding toString()
added base container
added .toString() to basic Evaluators
Adding Selector parser
removing char boxing
adding empty and .addAll
implemented :has selector
Working parser except the root node selector.
Added basic tests
removed unused constructor
parser update: normal order of selectors
fix non-void element parsing such as <a href=/https/github.com/link/>link text</a>
Evaluator.match(Element test) ->
Evaluator.match(Element root, Element test)
change
Character -> char change
restored all tests.
added RootSelector
updated tree selectors wrt subtree matching
update evaluator wrt subtree matching
Added RootSelector support
small optimizations
added javadocs
Added javadocs for Evaluators.
Updated tests.
Updated parser
Fixed issue when using descendant regex attribute selectors.
Fixes #52
Added a test to confirm combinators don't match in balanced contains queries
Updated OutputSettings inside of Document to be a static inner class.
This addresses the issue with using jsoup and scala 2.8 discussed here:
http://groups.google.com/group/jsoup/browse_thread/thread/3f7ec2fa41dfb87f
This change should be safe to commit and won't make a backwards incompatible
to the public interface of jsoup (you can always reference a static member
via a non-static path) for any existing users. A recompilation of their code
won't even be necessary.
All tests continue to pass for me after this change.
Fixed tokeniser optimisation when scanning for missing data element close tags.
Fixes #67
There are no valid (x)html tags that start with numbers
Reverted changes that only allow empty tags in pre-defined instances.
Markup like <tag /> needs to be parsed as an empty element.
Integrated new single-pass selector evaluators, contributed by knz (Anton Kazennikov).
Removed com.sun.xml.internal.ws.util.StringUtils to fix https://github.com/jhy/jsoup/issues/#issue/69
"jsoup/src/main/java/org/jsoup/select/selectors/AndSelector.java:[8,35] package com.sun.xml.internal.ws.util does not exist"
Force strict entity matching (must be &xxx; and not &xxx) in element attributes.
Fixes #71
Ensure that Jsoup.Connect handles relative redirects in cases where the
underlying HTTP stack doesn't automatically follow them.
Fixes #73
Allow Jsoup.Connect to parse application/xml and application/xhtml+xml responses.
Fixes #72
Defined U (underline) element as an inline tag.
Cleanup of selector class files
Updated Jsoup.Connection so that cookies set on a redirect response will be included on the redirected request and response.
Prevent infinite redirection loops in jsoup.connect.
Implemented TextNode.splitText
Moved .wrap, .before, and .after from Element to Node for flexibility. Overriding implementations in Element still return Element.
Don't run URL connectivity tests by default.
Added ability to change an element's tag with Element.tagName(String), and to change many at once with Elements.tagName(String).
Test to confirm that abs URL method works on img src attributes.
Generify empty child list.
Removed redundant empty array
Changelog updates
Readme update
[maven-release-plugin] prepare release jsoup-1.5.1
[maven-release-plugin] prepare for next development iteration
Fixed issue with selector parser where some boolean AND + OR combined queries (e.g. "meta[http-equiv], meta[content]") were being parsed incorrectly as OR only queries (e.g. former as "meta, [http-equiv], meta[content]")
Fixed issue where a content-tye specified in a meta tag may not be reliably detected, due to the above issue.
Allow <a> and <font> elements to be treated as flow/block elements, to match browser parse trees.
Updated copyright date
Updated Element.text() method to ensure <br> tags output as whitespace.
Tweaked Element.outerHtml() method to not generate initial newline on first output element.
Have <br> output as " " for Element.ownText()
Testcase for bug #63.
Fixes #63
Test to ensure that charset detection from <meta> tag works when preceeded by an irrelevant <meta> tag.
Changelog prep for 1.5.2
[maven-release-plugin] prepare release jsoup-1.5.2
[maven-release-plugin] prepare for next development iteration
Reimplementation of parser and tokeniser, to make jsoup a HTML5 conformat parser, against the
http://whatwg.org/html spec.
Added test to verify that solidus as end of unquoted attribute in tag is handled as part of attribute, and
not a self-closing tag, which was the old behaviour of jsoup.
Fixes #66
Added test to confirm that tbody in span does not create a new table.
Fixes #64
Improved "abs:" absolute URL handling in Elements.attr("abs:href") and Node.hasAttr("abs:href").
Fixes #97
Fixed issue in TokeniserState where the tokeniser could get trapped at EOF if in RCDataEndTag state.
Fixed cookie handling issue in jsoup.Connect where empty cookies would cause a validation exception.
Fixes #87
Allow 400-500 errors and response with no content-type to be parsed.
Documentation and test cases for jsoup.Connect ignoreHttpErrors and ignoreContentType options.
Cleanup datastream test.
Handle unclosed <title> tags in document by breaking out of the title at the next start tag, instead of
eating up to the end of the document.
Fixes #82
Added OSGi bundle support to the jsoup package jar.
Fixes #98
Test for #101
JavaDoc update
Node.after and Node.before(node) change log
Test to confirm that <textarea>s do not get implicit <form> tags with the new parser.
Verifies #102
Added Node.unwrap() and Elements.unwrap(), to remove a node but keep its contents.
Fixes #100
JavaDoc usage example
[maven-release-plugin] prepare release jsoup-1.6.0
[maven-release-plugin] prepare for next development iteration
Marked release date.
Fix Java 1.5 compatibility
Specify Felix Maven plugin verion
Javadoc update for DescendableLinkedList
Fixes #103
Fix an incorrect case fall-through, and add some not-null validations to prevent warnings.
For #103
Changelog for Java 1.5 fix
Changelog correction
Fixed an issue when parsing <script> tags.
When in body where the tokeniser wouldn't switch to the InScript state, which meant that data in a <script> wouldn't parse correctly.
Fixes #104
Refactor of script and rawtext end tag name states.
Fixed CharacterReader to handle unconsuming at EOF correctly. Additional <script> test at EOF.
Fixed an issue with a missing quote when serialising DocumentType nodes.
Fixes #109
Fixed issue where a single 0 character was lexed incorrectly as a null character.
Fixes #107
Fixed normalisation of carriage returns to newlines on input HTML
Fixes #110
Disabled memory mapped files
Disabled memory mapped files when loading files from disk, to improve compatibility in Windows environments.
Release date
Updated changelog
Updated POM for github push
[maven-release-plugin] prepare release jsoup-1.6.1
[maven-release-plugin] prepare for next development iteration
Javadoc typo fix.
Follow POST redirects as GETs
Fixes #120
Optionally preserve related links in elements when cleaning
Fixed handling of null characters within comments.
Fixes #121
Added jsoup.connect.cookies(Map) method, to set multiple cookies at once, possibly from a prior request.
Tweaked escaped entity detection in attributes to not treat &entity_... as an entity form.
Updated the Cleaner to support custom allowed protocols such as "cid:" and "data:".
Fixes #127
Fixed doctype tokeniser to allow whitespace between name and public identifier.
Tweaked HTML output of closing script and style tags to not add an extraneous newline when pretty-printing.
Tweaked Element#select documentation to reinforce CSS selector syntax.
Added Element.textNodes() and Element.dataNodes(), to easily access an element's children text nodes and data nodes.
Implemented an example HTML to plain-text converter.
Updated changelog for HtmlToPlainText example
Added documentation for NodeVisitor and NodeTraversor.
Corrected documentation of NodeTraversor to reflect depth-first order of node visitation.
Added Node.traverse() and Elements.traverse() methods, to iterate through a node's descendants.
Made Evaluator constructor public to allow custom implementations
Act on only the first base href in parse.
And make node.setBaseUri() recurse down to descendants.
First draft of a simple XML treebuilder / parser.
This provides an alternative to the HTML5 parser which enforces HTML
semantics on the parsed input. The simple XML parser has no understanding
of HTML, and will parse the input as-is into a DOM.
Added test coverage for XML parser.
Allow an alternate parser to be supplied for core use cases.
Fixed URL tests.
Fixed invocation of alternative parser in Jsoup.Connection.
Updated test to confirm.
Added test for invocation of alternate parser when loading from file input stream.
Added support to optionally keep track of errors while tokenising and tree-building.
Test XML (un)known self-closing behaviour.
Changelog for XML and error tracking.
Change what considered as "whitespace"
Changelog and code tweak for whitespace test.
Fixes issue #126 in jsoup, where comments inside table were replicated inside body
Changelog and test for pull #165 (comments in tables).
Fixes #126
Updated parser error tracking to cap the max size of errors tracked. Defaults to 0 (disabled).
Make ParseError public.
Drop BOM at start of byte data if present after decode.
Fixes #134.
Correctly handle content (ignore it) after frameset end.
Fixes #162
Reduced memory pre-allocation in Node.outerHtml from 32KB to 128B, to reduce memory pressure.
Fixes #143.
Changelog release prep
POM update for git url syntax
[maven-release-plugin] prepare release jsoup-1.6.2
[maven-release-plugin] prepare for next development iteration
Tidied up readme, for github presentation.
Fixed parsing of group-or commas in CSS selectors.
Fixes #179
Updated license date
Fixed precedence parsing of group OR (,) in CSS selectors.
Added tests, and repaired cheekily / hastily / incorrectly modified test.
If a node has no parent, return null on previousSibling and nextSibling instead of throwing a null pointer exception.
Fixes #184
Correct Elements.not() javadoc.
Fixes #177
Fixed HTML entity parser to correctly parse entities like frac14 (letter + number combo).
Fixes #145
Fixed issue where contents of a script tag within a comment could be incorrectly parsed.
Fixes #115
Fixed GAE support: load HTML entities from a file on startup, instead of embedding in the class.
Fix NPE in style fragment parse
Fixes #189
Fixed issue with :all pseudo-tag in HTML sanitizer when cleaning tags
Fixes #156
NPE changelog
Fixed NPE in Parser.parseFragment() when context parameter is null.
Fixes #195.
In HTML whitelists, when defining allowed attributes for a tag, automatically add the tag to the allowed list.
Fixes #192.
Copy .properties files, to fix mvn build of Entities.
Fixes #203
Fixed missing <td> in javadoc
Splelling.
Correct date for the last release.
Releae prep.
[maven-release-plugin] prepare release jsoup-1.6.3
[maven-release-plugin] prepare for next development iteration
Reflect actual release date.
Javadoc align headers lefJavadoc align headers leftt
Use English-locale settings when uppercasing charset.
Fixes a problem with Turkish characters, see: http://mattryall.net/blog/2009/02/the-infamous-turkish-locale-bug
Set Locale.ENGLISH when running upper/lowercase methods, to ensure locale independence.
Fixed whitespace preservation in <textarea> tags.
Fixes #167.
In jsoup.connect, fail faster if the return content type is not supported.
Fixes #153.
In jsoup.clean, allow custom OutputSettings, to control pretty printing, character set, and entity escaping.
Fixes #148
Use a StringBuilder to accumulate attribute values.
When initially implemented I expected attribute values to normally be read in one sweep,
removing the need to accumulate values. However profiling shows that attributes are
often accumulated, and so the string concatination implementation was very slow.
Using a StringBuilder here gives reduces parse times > 50%.
Reuse StringBuilders
No longer strip \r before parsing.
This saves memory and CPU time at start of parse.
Changelog for performance improvement.
Make copies of all strings returned, vs returning pointers to substrings of input.
The former method is a little faster, and creates less garbage. However when the input is large,
and apps retain data pulled from the DOM, the app may perceive a memory leak, as even a small string
is actually as large as the original input (although multiple strings are all backed by the one
original input).
So, this implementation is a little less performant, but has a potential for greater safety,
depending on how the library is used.
Replaced Strings with char array in CharacterReader, for well improved parse times.
Faster to scan, and less garbage created.
Only create attribute objects for end tag tokens when required.
Saves a bit of GC time.
Don't create Iterator objects in these tight Evaluator loops.
Saves a fair bit of GC time when selecting.
To save GC time in select, don't wrap childnodes in unmodifiable list.
Check for null body, possible in framesets.
Fixes #154
Changelog catch-up.
Javadoc fix
Upgrade to maven-resources-plugin version 2.4
Eclipse m2e doesn't work with the old version 2.3.
Updated Maven pom.xml to specify jar plugin version.
Fixed an issue when normalising whitespace for strings containing high-surrogate characters.
Don't throw an exception if no content-type specified.
Fixes #213
If the charset from the server is not supported, fail-back to UTF8.
Fixes #215
Remove unnecessary synchronisation in Tag.valueOf
Fixes #238
Micro-optimised Tag.valueOf
Only lower-case and trim tag names if their original form is not found
in the registered tags. Shaves parse time for the bulk of cases where
the tag is already in lowercase.
In DataUtil, check if body length > 0 before looking at docData
Prevents a string out of bounds exception.
Fixes #230
Refactored entity decoding.
Modified the heuristic entity decoder to be less greedy; does not
repeatedly chomp down the string until a match is found, and requires a
semicolon terminator for extended entities.
Updated Entities to use the entity decoder in Tokeniser, vs the legacy
decoder.
Fixes #224.
Extra wrap/unwrap tests
Whitespace normalise document.title() output.
Fixes #168
Tidy up javadoc one-liners for Element
Introduced finer granularity to Jsoup.connect exceptions.
Fixes #229
Don't run URL tests by default.
Confirm cleans Russian characters OK
[maven-release-plugin] prepare release jsoup-1.7.1
[maven-release-plugin] prepare for next development iteration
Refactored the Cleaner to traverse rather than recurse child nodes
Avoids the risk of overflowing the stack
Fixes #246
Typo in Doctype node
When parsing in XML mode, preserve XML declarations (<?xml ... ?>).
Fixes #242
Allow Whitelist test methods to be extended
Fixes #85
Javadoc typo
Fixed an issue when parsing <textarea>/RCData tags containing unescaped closing tags that would drop the traling >.
Added a maximum body response size to Jsoup.Connection
The Connection API is no longer in beta.
Modified maxBodySize to truncate at precise max.
Rather than previous implementation which was up to the internal buffer
size (130K) larger.
Corrected the javadoc for Element#child() to note that it throws IndexOutOfBounds.
Fixes #277
Added test to verify absolute URLs for file:// URIs
Fixes #255
Added Element.insertChildren
Also tidied up JavaDoc, and returned Node.childNodes to a unmodifiable
list.
Fixes #239 (with alternative implementation)
Added Node.childNodesCopy()
Don't clone the element's classnames
Fixes #278
Limit how far up the stack the formatting adoption agency algorithm will travel
Fixes #234
Modified Element.text() to build text by traversing child nodes rather than recursing.
Fixes #271
Introduced Parser.parseXmlFragment(), to allow easy parsing of XML fragments.
Updated copyright year.
Support ins, del and s tags inline
Tweaked koz's changes in merge prep.
Adds outline mode to Document.OutputSettings.
Fixes #274
Fixed overzealous indenting in outerHtmlTail
When parsing, allow all tags to self-close.
Tags that aren't expected to self-close will get an end tag.
Fixes #2458
changed return type of Tokeniser.consumeCharacterReference from Character to char[], and also changed TokeniserState accordingly
changed Entities.escape to escape String with supplementary characters correctly, and added two test cases to verify
added test cases to verify supplementary characters can be used as attribute name and value, as well as be selected by selector
added a containsOwn test
fixed incorrect code copy-and-paste
Tweaked mingfai's surrogate pair implementation for efficiency.
On the core cases where characters are not surrogate pairs, I've kept
to pushing chars around. This is to try and minimize the number of
short-lived String objects.
Output escape codes in hex instead of decimal.
For characters that need to be escaped, this produces a smaller more
legible escape code, and Unicode chars are generally described in hex,
not dec.
Output valid hex escapes this time.
Changelog for supplementary characters
Added support for css pseudo classes
:first-child, :last-child, :nth-child, :nth-last-child,
:first-of-type, :last-of-type, :nth-of-type, :nth-last-of-type,
:only-child, :only-of-type, :empty,
:root
Changelog for structural pseudo selectors
Note the Jsoup.Connect default max size
[maven-release-plugin] prepare release jsoup-1.7.2
[maven-release-plugin] prepare for next development iteration
Changelog 1.7.2 release date
Test that hostless URIs resolve to absolute URLs correctly.
Removed code duplication in data end tag handlers
Reduce code dupes for ScriptDataDoubleEscapeStart and ScriptDataDoubleEscapeEnd
First pass at a FomElement
The FormElement extends Element to provide ready access to a form's
controls, and to allow the form to be submitted. It also connects forms
to their controls in situations when the DOM tree created does not have
the form element be a parent of the control, like when the form tag is
in a TR but the control in a TD. In that case the form tag gets
reparented.
FormElement changelog note
Don't auto submit buttons
Because they weren't clicked on, we shouldn't submit their values. Do
need to find a convenient way to simulate a button press.
Added a forms() convenience method to Elements
This allows one to get at FormElements without casting.1 parent b25eb52 commit 24080dd
File tree
102 files changed
+29908
-1
lines changed- src
- main
- javadoc
- java/org/jsoup
- examples
- helper
- nodes
- parser
- safety
- select
- test
- java/org/jsoup
- helper
- integration
- nodes
- parser
- safety
- select
- resources/htmltests
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
102 files changed
+29908
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
0 commit comments