Commit Graph

412 Commits

Author SHA1 Message Date
Nick Wellnhofer
1b6362ea44 io: Fix memory lifetime issue with input buffers
xmlParserInputBufferCreateMem must make a copy of the buffer.

This fixes a regression from 2.11 which could cause reads from freed
memory depending on the use case.

Undeprecate xmlParserInputBufferCreateStatic which can avoid copying
the whole buffer.
2023-12-12 23:58:33 +01:00
Nick Wellnhofer
8672bf253b html: Reenable buggy detection of XML declarations
Switch to UTF-8 if a document starts with '<?xm' to match old behavior.
Also enable this check in the push parser.

Fixes #637.
2023-12-01 17:26:47 +01:00
Nick Wellnhofer
b9db3d7d02 parser: Simplify xmlStringCurrentChar
Start to move away from using this function.
2023-09-22 19:01:11 +02:00
Nick Wellnhofer
8c084ebdc7 doc: Make apibuild.py happy 2023-09-21 22:57:33 +02:00
Nick Wellnhofer
c5890716a6 html: Fix logic in htmlAutoClose
Note that the function is never called with a NULL newtag.

Fixes #591.
2023-09-21 17:01:35 +02:00
Nick Wellnhofer
9b5cce7a71 include: Remove more unnecessary includes 2023-09-21 01:50:53 +02:00
Nick Wellnhofer
11a1839ddd globals: Move remaining globals back to correct header files
This undoes a lot of damage.
2023-09-20 22:06:49 +02:00
Nick Wellnhofer
4e1c13ebfd debug: Remove debugging code
This is barely useful these days and only clutters the code base.
2023-09-19 17:35:09 +02:00
Nick Wellnhofer
e48f2695fe parser: Remove push parser debugging code 2023-08-29 18:17:09 +02:00
Nick Wellnhofer
0d24fc0a47 html: Remove encoding hack in htmlCreateFileParserCtxt
Switch encoding directly instead of calling htmlCheckEncoding with faked
content.
2023-08-14 12:53:49 +02:00
Nick Wellnhofer
5db5a704eb html: Fix UAF in htmlCurrentChar
Short-lived regression found by OSS-Fuzz.
2023-08-09 18:40:25 +02:00
Nick Wellnhofer
95e81a360c parser: Decode all data in xmlCharEncInput
Even with flush set to true, xmlCharEncInput didn't guarantee to decode
all data. This complicated the push parser.

Remove the flush flag and always decode all available data.

Also fix ICU code where the flush flag has a different meaning. Always
set flush to false and retry even with empty input buffers.
2023-08-08 15:21:31 +02:00
Nick Wellnhofer
834b8123ef parser: Stream data when reading from memory
Don't create a copy of the whole input buffer. Read the data chunk by
chunk to save memory.

Historically, it was probably envisioned to read data from memory
without additional copying. This doesn't work reliably with the current
design of the XML parser which requires a terminating null byte at the
end of input buffers. This lead to xmlReadMemory interfaces, which
expect pointer and size arguments, being changed to make a
zero-terminated copy of the input buffer. Interfaces based on
xmlReadDoc, which actually expect a zero-terminated string and
would make zero-copy operation work, were then simplified to rely on
xmlReadMemoryi, resulting in an unnecessary copy.

To avoid copying (possibly gigabytes) of memory temporarily, we now
stream in-memory input just like content read from files in a
chunk-by-chunk fashion (using a somewhat outdated INPUT_CHUNK size of
250 bytes). As a side effect, we also avoid another copy of the whole
input when handling non-UTF-8 data which was made possible by some
earlier commits.

Interfaces expecting zero-terminated strings now make use of strnlen
which unfortunately isn't part of the standard C library and only
mandated since POSIX 2008.
2023-08-08 15:21:28 +02:00
Nick Wellnhofer
facc2a06da parser: Don't overwrite EOF parser state 2023-08-08 15:21:21 +02:00
Nick Wellnhofer
59fa0bb383 parser: Simplify input pointer updates
The base member always points to the beginning of the buffer.
2023-08-08 15:21:14 +02:00
Nick Wellnhofer
ec7be50662 parser: Rework encoding detection
Introduce XML_INPUT_HAS_ENCODING flag for xmlParserInput which is set
when xmlSwitchEncoding is called. The parser can use the flag to
reliably detect whether an encoding was already set via user override,
BOM or other auto-detection. In this case, the encoding declaration
won't be used to switch the encoding.

Before, an inscrutable mix of ctxt->charset, ctxt->input->encoding
and ctxt->input->buf->encoder was used.

Introduce private helper functions to switch encodings used by both the
XML and HTML parser:

- xmlDetectEncoding which skips over the BOM, allowing to remove the
  BOM checks from other encoding functions.
- xmlSetDeclaredEncoding, replacing htmlCheckEncodingDirect, which warns
  about encoding mismatches.

If users override the encoding, store the declared instead of the actual
encoding in xmlDoc. In this case, the actual encoding is known and the
raw value from the doc is more useful.

Also use the input flags to store the ISO-8859-1 fallback state.
Restrict the fallback to cases where no encoding was specified. (The
fallback is only useful in recovery mode and these days broken UTF-8 is
probably more likely than ISO-8859-1, so it might eventually be removed
completely.)

The 'charset' member of xmlParserCtxt is now unused. The 'encoding'
member of xmlParserInput is now unused.

The 'standalone' member of xmlParserInput is renamed to 'flags'.

A new parser state XML_PARSER_XML_DECL is added for the push parser.
2023-08-08 15:19:46 +02:00
Nick Wellnhofer
3a64f39448 html: Remove some debugging code in htmlParseTryOrFinish 2023-08-08 15:19:25 +02:00
Nick Wellnhofer
20f5c73457 parser: Recover more input from encoding errors
Don't halt the parser in xmlParserGrow to allow more input to be
recovered in case of encoding errors.

Fixes #543.
2023-06-07 14:05:34 +02:00
Nick Wellnhofer
320f5084cd parser: Improve handling of encoding and IO errors
Make sure that xmlCharEncInput, xmlParserInputBufferPush and
xmlParserInputBufferGrow set the correct error code in the
xmlParserInputBuffer. Handle errors when calling these functions.
2023-04-30 21:31:54 +02:00
Nick Wellnhofer
1061537efd malloc-fail: Fix buffer overread with HTML doctype declarations
Found by OSS-Fuzz, see #344.
2023-03-26 22:42:13 +02:00
Nick Wellnhofer
7fbd454d9f parser: Grow input buffer earlier when reading characters
Make more bytes available after invoking CUR_CHAR or NEXT.
2023-03-21 21:35:53 +01:00
Nick Wellnhofer
04d1bedd8c parser: Rework shrinking of input buffers
Don't try to grow the input buffer in xmlParserShrink. This makes sure
that no memory allocations are made and the function always succeeds.

Remove unnecessary invocations of SHRINK. Invoke SHRINK at the end of
DTD parsing loops.

Shrink before growing.
2023-03-21 13:19:18 +01:00
Nick Wellnhofer
44ecefc8cc malloc-fail: Fix buffer overread after htmlParseScript
Found by OSS-Fuzz, see #344.
2023-03-20 15:53:42 +01:00
Nick Wellnhofer
067986fa67 parser: Fix regressions from previous commits
- Fix memory leak in xmlParseNmtoken.
- Fix buffer overread after htmlParseCharDataInternal.
2023-03-18 16:51:40 +01:00
Nick Wellnhofer
9ef2a9abf3 html: Rely on CUR_CHAR to grow the input buffer
- Remove useless invocations of GROW.
- Add some error checks.
- Fix invocations of SHRINK.
2023-03-17 14:14:04 +01:00
Nick Wellnhofer
62f199ed7d malloc-fail: Add error check in htmlParseHTMLAttribute
This function must return NULL is an error occurs.

Found by OSS-Fuzz, see #344.
2023-03-17 12:40:46 +01:00
Nick Wellnhofer
8090e58564 malloc-fail: Fix buffer overread in htmlParseScript
Found by OSS-Fuzz, see #344.
2023-03-17 12:27:07 +01:00
Nick Wellnhofer
ca2bfecea9 malloc-fail: Fix buffer overread when reading from input
Found by OSS-Fuzz, see #344.
2023-03-15 17:34:32 +01:00
Nick Wellnhofer
4b3452d171 html: Fix quadratic behavior in htmlParseTryOrFinish
Fix check for end of script content.

Found by OSS-Fuzz.
2023-03-15 17:02:46 +01:00
Nick Wellnhofer
14c62e0dd3 html: Use NEXTL in htmlParseHTMLAttribute
This is more efficient than NEXT.
2023-03-15 17:02:46 +01:00
Nick Wellnhofer
2099441f32 parser: Stop calling xmlParserInputShrink
Introduce xmlParserShrink which takes a parser context to simplify error
handling.
2023-03-13 17:51:13 +01:00
Nick Wellnhofer
cabde70f8b parser: Simplify calculation of available buffer space 2023-03-12 19:07:23 +01:00
Nick Wellnhofer
b75976e029 parser: Use size_t when subtracting input buffer pointers
Avoid integer overflows.
2023-03-12 19:06:19 +01:00
Nick Wellnhofer
9a6ca81612 parser: Check for integer overflow when updating checkIndex
Unfortunately, checkIndex is a long, not a size_t. Check for integer
overflow before updating the value.
2023-03-12 19:03:11 +01:00
Nick Wellnhofer
bd63d730b8 html: Impose some length limits
Impose length limits on names, attribute values, PIs and comments,
similar to the XML parser.
2023-03-12 17:40:55 +01:00
Nick Wellnhofer
3eb6bf0386 parser: Stop calling xmlParserInputGrow
Introduce xmlParserGrow which takes a parser context to simplify error
handling.
2023-03-12 17:05:51 +01:00
Nick Wellnhofer
53d1cc98cf malloc-fail: Fix error code in htmlParseChunk
Found with libFuzzer, see #344.
2023-02-17 17:18:51 +01:00
Nick Wellnhofer
15b0ed0815 malloc-fail: Fix infinite loop in htmlParseDocTypeDecl
Found with libFuzzer, see #344.
2023-02-17 17:18:47 +01:00
Nick Wellnhofer
041789d9ec malloc-fail: Fix null deref in htmlnamePush
Found with libFuzzer, see #344.
2023-02-17 17:18:43 +01:00
Nick Wellnhofer
0ec9c91064 malloc-fail: Fix infinite loop in htmlParseStartTag
Found with libFuzzer, see #344.
2023-02-17 17:18:38 +01:00
Nick Wellnhofer
04c2955197 malloc-fail: Fix infinite loop in htmlParseContentInternal
Found with libFuzzer, see #344.
2023-02-17 17:18:34 +01:00
Nick Wellnhofer
f3e62035d8 malloc-fail: Fix memory leak in htmlCreatePushParserCtxt
Found with libFuzzer, see #344.
2023-02-17 17:18:29 +01:00
Nick Wellnhofer
fc256953d2 malloc-fail: Fix memory leak in htmlCreateMemoryParserCtxt
Found with libFuzzer, see #344.
2023-02-17 17:18:25 +01:00
Nick Wellnhofer
643b4e90eb malloc-fail: Fix infinite loop in htmlParseStartTag
Found with libFuzzer, see #344.
2023-02-17 17:16:52 +01:00
Nick Wellnhofer
59b3366178 error: Limit number of parser errors
Reporting errors is expensive and some abusive test cases can generate
an error for each invalid input byte. This causes the parser to spend
most of the time with error handling. Limit the number of errors and
warnings to 100.
2022-12-27 14:41:19 +01:00
Alex Richardson
4b959ee168 Remove hacky heuristic from b2dc5675e9
Checking whether the context is close to the parent context by hardcoding
250 is not portable (I noticed tests were failing on Morello since the value
is 288 there due to pointers being 128 bits). Instead we should ensure
that the XML_VCTXT_USE_PCTXT flag is not set in cases where the user data
is not actually a parser context (or ideally add a separate field but that
would be an ABI break.

From what I can see in the source, the XML_VCTXT_USE_PCTXT is only set if
the userData field points to a valid context, and if this is not the case
the flag should be cleared when changing userData rather than relying on
the offset between the two. Looking at the history, I think
d7cb33cf44 fixed most of the need for this
workaround, but it looks like there are a few more locations that need
updating; This commit changes two more places to set/clear/copy the
XML_VCTXT_USE_PCTXT flag, so this heuristic should not be needed anymore.
I've also drop two = NULL assignment in xmllint since this is not needed
after a call to memset().

There was also an uninitialized vctxt.flags (and other fields) in
`xmlShellValidate()`, which I've fixed by adding a memset() call.
2022-12-01 15:31:25 +00:00
Alex Richardson
c715ded086 Avoid creating an out-of-bounds pointer by rewriting a check
Creating more than one-past-the-end pointers is undefined behaviour in C
and while this code is unlikely to be miscompiled, I discovered that an
out-of-bounds pointer is being created using UBSan on a CHERI-enabled
system.
2022-12-01 15:30:12 +00:00
Nick Wellnhofer
c7a9b85cbb html: Improve parsing of nested lists
Allow ul/ol as immediate children of ul/ol. This is more in line with
the HTML5 spec.

Fixes #447.
2022-11-30 17:11:33 +01:00
Nick Wellnhofer
e414f82585 html: Fix htmlInitAutoClose documentation 2022-11-27 02:11:07 +01:00
Nick Wellnhofer
c93679381c html: Fix check for end of comment in push parser
Make sure to reset checkIndex. Handle case where "--" or "--!" is at the
end of the buffer. Fix "avail" check in htmlParseOrTryFinish.
2022-11-20 21:27:59 +01:00