Python3.0: tokenize & BytesIO

Question

Python3.0: tokenize & BytesIO

When trying to tokenize

string in python3.0, why am I getting the presenter 'utf-8'

before the tokens start?

From the python3 docs , tokenize

should now be used like this:

g = tokenize(BytesIO(s.encode('utf-8')).readline)

However, when trying to do this, the following happens on the terminal:

>>> from tokenize import tokenize
>>> from io import BytesIO
>>> g = tokenize(BytesIO('foo'.encode()).readline)
>>> next(g)
(57, 'utf-8', (0, 0), (0, 0), '')
>>> next(g)
(1, 'foo', (1, 0), (1, 3), 'foo')
>>> next(g)
(0, '', (2, 0), (2, 0), '')
>>> next(g)

What about the marker utf-8

that precedes the others? Is this going to happen? If so, should I just skip the first token?

[edit]

I found a token like 57 tokenize.ENCODING , which can be easily filtered out of the token stream if needed.

+1

python io tokenize bytesio

brad May 27 '09 at 12:22

a source to share

1 answer

Benjamin peterson · Accepted Answer · 2009-05-27T00:29:10+0000

This is the source cookie encoding. You can specify explicitly:

# -*- coding: utf-8 -*-
do_it()

Otherwise, Python assumes the default encoding, utf-8 in Python 3.

Python3.0: tokenize & BytesIO

[edit]

More articles: