Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdf2markdown #200

Open
Cheryl33990 opened this issue Jan 15, 2025 · 0 comments
Open

pdf2markdown #200

Cheryl33990 opened this issue Jan 15, 2025 · 0 comments

Comments

@Cheryl33990
Copy link

您好~我按照PDF項目中進行Document Content Extraction,
步驟如 https://pdf-extract-kit.readthedocs.io/en/latest/project/pdf_extract.html 所示。

在output的部分是能夠解析出JSON的,但在Markdown輸出的部分會有UnicodeEncodeError:
UnicodeEncodeError: 'cp950' codec can't encode character '\u5706' in position 78: illegal multibyte sequence
測試demo中的資料發現是中文的問題,猜測要使用UTF-8 (但我還沒有debug成功),
故來請問有沒有解決方法,謝謝!

(此外想請問使用繁體中文會影響嗎?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant