Extract Python function source text from the source code string
Suppose I have valid Python source code, as a string:
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
""".strip()
Objective: I would like to obtain the lines containing the source code of the function definitions, preserving whitespace. For the code string above, I would like to get the strings
def foo(a, b):
return a + b
and
def __init__(self):
self.my_list = [
'a',
'b',
]
Or, equivalently, I'd be happy to get the line numbers of functions in the code string: foo
spans lines 2-3, and __init__
spans lines 5-9.
Attempts
I can parse the code string into its AST:
code_ast = ast.parse(code_string)
And I can find the FunctionDef
nodes, e.g.:
function_def_nodes = [node for node in ast.walk(code_ast)
if isinstance(node, ast.FunctionDef)]
Each FunctionDef
node's lineno
attribute tells us the first line for that function. We can estimate the last line of that function with:
last_line = max(node.lineno for node in ast.walk(function_def_node)
if hasattr(node, 'lineno'))
but this doesn't work perfectly when the function ends with syntactic elements that don't show up as AST nodes, for instance the last ]
in __init__
.
I doubt there is an approach that only uses the AST, because the AST fundamentally does not have enough information in cases like __init__
.
I cannot use the inspect
module because that only works on "live objects" and I only have the Python code as a string. I cannot eval
the code because that's a huge security headache.
In theory I could write a parser for Python but that really seems like overkill.
A heuristic suggested in the comments is to use the leading whitespace of lines. However, that can break for strange but valid functions with weird indentation like:
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
python
|
show 6 more comments
Suppose I have valid Python source code, as a string:
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
""".strip()
Objective: I would like to obtain the lines containing the source code of the function definitions, preserving whitespace. For the code string above, I would like to get the strings
def foo(a, b):
return a + b
and
def __init__(self):
self.my_list = [
'a',
'b',
]
Or, equivalently, I'd be happy to get the line numbers of functions in the code string: foo
spans lines 2-3, and __init__
spans lines 5-9.
Attempts
I can parse the code string into its AST:
code_ast = ast.parse(code_string)
And I can find the FunctionDef
nodes, e.g.:
function_def_nodes = [node for node in ast.walk(code_ast)
if isinstance(node, ast.FunctionDef)]
Each FunctionDef
node's lineno
attribute tells us the first line for that function. We can estimate the last line of that function with:
last_line = max(node.lineno for node in ast.walk(function_def_node)
if hasattr(node, 'lineno'))
but this doesn't work perfectly when the function ends with syntactic elements that don't show up as AST nodes, for instance the last ]
in __init__
.
I doubt there is an approach that only uses the AST, because the AST fundamentally does not have enough information in cases like __init__
.
I cannot use the inspect
module because that only works on "live objects" and I only have the Python code as a string. I cannot eval
the code because that's a huge security headache.
In theory I could write a parser for Python but that really seems like overkill.
A heuristic suggested in the comments is to use the leading whitespace of lines. However, that can break for strange but valid functions with weird indentation like:
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
python
I suppose you could just iterate lines, and when one matches^(s*)defs.*$
, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines thatstartWith(thatWhitespace)
– Blorgbeard
Jan 26 at 0:30
You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level
– pkpnd
Jan 26 at 0:36
Oops, yes. You get the idea, anyway.
– Blorgbeard
Jan 26 at 0:40
Hmm, doesn't work if the function has weird indentation inside, for exampledef baz():n return [n1,n ]
– pkpnd
Jan 26 at 0:48
Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.
– Blorgbeard
Jan 26 at 0:51
|
show 6 more comments
Suppose I have valid Python source code, as a string:
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
""".strip()
Objective: I would like to obtain the lines containing the source code of the function definitions, preserving whitespace. For the code string above, I would like to get the strings
def foo(a, b):
return a + b
and
def __init__(self):
self.my_list = [
'a',
'b',
]
Or, equivalently, I'd be happy to get the line numbers of functions in the code string: foo
spans lines 2-3, and __init__
spans lines 5-9.
Attempts
I can parse the code string into its AST:
code_ast = ast.parse(code_string)
And I can find the FunctionDef
nodes, e.g.:
function_def_nodes = [node for node in ast.walk(code_ast)
if isinstance(node, ast.FunctionDef)]
Each FunctionDef
node's lineno
attribute tells us the first line for that function. We can estimate the last line of that function with:
last_line = max(node.lineno for node in ast.walk(function_def_node)
if hasattr(node, 'lineno'))
but this doesn't work perfectly when the function ends with syntactic elements that don't show up as AST nodes, for instance the last ]
in __init__
.
I doubt there is an approach that only uses the AST, because the AST fundamentally does not have enough information in cases like __init__
.
I cannot use the inspect
module because that only works on "live objects" and I only have the Python code as a string. I cannot eval
the code because that's a huge security headache.
In theory I could write a parser for Python but that really seems like overkill.
A heuristic suggested in the comments is to use the leading whitespace of lines. However, that can break for strange but valid functions with weird indentation like:
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
python
Suppose I have valid Python source code, as a string:
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
""".strip()
Objective: I would like to obtain the lines containing the source code of the function definitions, preserving whitespace. For the code string above, I would like to get the strings
def foo(a, b):
return a + b
and
def __init__(self):
self.my_list = [
'a',
'b',
]
Or, equivalently, I'd be happy to get the line numbers of functions in the code string: foo
spans lines 2-3, and __init__
spans lines 5-9.
Attempts
I can parse the code string into its AST:
code_ast = ast.parse(code_string)
And I can find the FunctionDef
nodes, e.g.:
function_def_nodes = [node for node in ast.walk(code_ast)
if isinstance(node, ast.FunctionDef)]
Each FunctionDef
node's lineno
attribute tells us the first line for that function. We can estimate the last line of that function with:
last_line = max(node.lineno for node in ast.walk(function_def_node)
if hasattr(node, 'lineno'))
but this doesn't work perfectly when the function ends with syntactic elements that don't show up as AST nodes, for instance the last ]
in __init__
.
I doubt there is an approach that only uses the AST, because the AST fundamentally does not have enough information in cases like __init__
.
I cannot use the inspect
module because that only works on "live objects" and I only have the Python code as a string. I cannot eval
the code because that's a huge security headache.
In theory I could write a parser for Python but that really seems like overkill.
A heuristic suggested in the comments is to use the leading whitespace of lines. However, that can break for strange but valid functions with weird indentation like:
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
python
python
edited Jan 26 at 1:23
pkpnd
asked Jan 25 at 23:50
pkpndpkpnd
4,6831141
4,6831141
I suppose you could just iterate lines, and when one matches^(s*)defs.*$
, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines thatstartWith(thatWhitespace)
– Blorgbeard
Jan 26 at 0:30
You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level
– pkpnd
Jan 26 at 0:36
Oops, yes. You get the idea, anyway.
– Blorgbeard
Jan 26 at 0:40
Hmm, doesn't work if the function has weird indentation inside, for exampledef baz():n return [n1,n ]
– pkpnd
Jan 26 at 0:48
Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.
– Blorgbeard
Jan 26 at 0:51
|
show 6 more comments
I suppose you could just iterate lines, and when one matches^(s*)defs.*$
, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines thatstartWith(thatWhitespace)
– Blorgbeard
Jan 26 at 0:30
You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level
– pkpnd
Jan 26 at 0:36
Oops, yes. You get the idea, anyway.
– Blorgbeard
Jan 26 at 0:40
Hmm, doesn't work if the function has weird indentation inside, for exampledef baz():n return [n1,n ]
– pkpnd
Jan 26 at 0:48
Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.
– Blorgbeard
Jan 26 at 0:51
I suppose you could just iterate lines, and when one matches
^(s*)defs.*$
, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines that startWith(thatWhitespace)
– Blorgbeard
Jan 26 at 0:30
I suppose you could just iterate lines, and when one matches
^(s*)defs.*$
, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines that startWith(thatWhitespace)
– Blorgbeard
Jan 26 at 0:30
You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level
– pkpnd
Jan 26 at 0:36
You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level
– pkpnd
Jan 26 at 0:36
Oops, yes. You get the idea, anyway.
– Blorgbeard
Jan 26 at 0:40
Oops, yes. You get the idea, anyway.
– Blorgbeard
Jan 26 at 0:40
Hmm, doesn't work if the function has weird indentation inside, for example
def baz():n return [n1,n ]
– pkpnd
Jan 26 at 0:48
Hmm, doesn't work if the function has weird indentation inside, for example
def baz():n return [n1,n ]
– pkpnd
Jan 26 at 0:48
Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.
– Blorgbeard
Jan 26 at 0:51
Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.
– Blorgbeard
Jan 26 at 0:51
|
show 6 more comments
3 Answers
3
active
oldest
votes
A much more robust solution would be to use the tokenize
module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:
import tokenize
from io import BytesIO
from collections import deque
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def test(self): pass
def abc(self):
'''multi-
line token'''
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
a =
1
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
pass
# unmatched parenthesis: (
""".strip()
file = BytesIO(code_string.encode())
tokens = deque(tokenize.tokenize(file.readline))
lines =
while tokens:
token = tokens.popleft()
if token.type == tokenize.NAME and token.string == 'def':
start_line, _ = token.start
last_token = token
while tokens:
token = tokens.popleft()
if token.type == tokenize.NEWLINE:
break
last_token = token
if last_token.type == tokenize.OP and last_token.string == ':':
indents = 0
while tokens:
token = tokens.popleft()
if token.type == tokenize.NL:
continue
if token.type == tokenize.INDENT:
indents += 1
elif token.type == tokenize.DEDENT:
indents -= 1
if not indents:
break
else:
last_token = token
lines.append((start_line, last_token.end[0]))
print(lines)
This outputs:
[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 27), (29, 33)]
Note however that the continuation line:
a =
1
is treated by tokenize
as one line even though it is in fact two lines, since if you print the tokens:
TokenInfo(type=53 (OP), string=':', start=(24, 20), end=(24, 21), line=' def hello(self, x):n')
TokenInfo(type=4 (NEWLINE), string='n', start=(24, 21), end=(24, 22), line=' def hello(self, x):n')
TokenInfo(type=5 (INDENT), string=' ', start=(25, 0), end=(25, 4), line=' a = 1n')
TokenInfo(type=1 (NAME), string='a', start=(25, 4), end=(25, 5), line=' a = 1n')
TokenInfo(type=53 (OP), string='=', start=(25, 6), end=(25, 7), line=' a = 1n')
TokenInfo(type=2 (NUMBER), string='1', start=(25, 8), end=(25, 9), line=' a = 1n')
TokenInfo(type=4 (NEWLINE), string='n', start=(25, 9), end=(25, 10), line=' a = 1n')
TokenInfo(type=1 (NAME), string='return', start=(26, 4), end=(26, 10), line=' return self.hello(n')
you can see that the continuation line is literally treated as one line of ' a = 1n'
, with only one line number 25
. This is apparently a bug/limitation of the tokenize
module unfortunately.
This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.
– pkpnd
Jan 26 at 2:12
Oops did not actually have any logic to handle weird indentation. Added now.
– blhsing
Jan 26 at 4:22
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.
– user2357112
Jan 26 at 4:41
@user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line bytokenize
even though it is literally multiple lines, so the line numbers returned bytokenize
would be off in such a case. It's apparently a bug/limitation of thetokenize
module unfortunately.
– blhsing
Jan 26 at 6:08
add a comment |
Rather than reinventing a parser, I would use python itself.
Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def
to the farther line which does not fail to compile.
code_string = """
#A comment
def foo(a, b):
return a + b
def bir(a, b):
c = a + b
return c
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
""".strip()
lines = code_string.split('n')
#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]
#getting the indentation of each 'def'
indents = {}
for i in defidxs:
ll = lines[i].split('def')
indents[i] = len(ll[0])
#extracting the strings
end = len(lines)-1
while end > 0:
if end < defidxs[-1]:
defidxs.pop()
try:
start = defidxs[-1]
except IndexError: #break if there are no more 'def'
break
#empty lines between functions will cause an error, let's remove them
if len(lines[end].strip()) == 0:
end = end -1
continue
try:
#fix lines removing indentation or compile will not compile
fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
body = 'n'.join(fixlines)
compile(body, '<string>', 'exec') #if it fails, throws an exception
print(body)
end = start #no need to parse less line if it succeed.
except:
pass
end = end -1
It is a bit nasty because of the except
clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile
to fail, so I do not know how to avoid it.
This will prints
def baz():
return [
1,
]
def __init__(self):
self.my_list = [
'a',
'b',
]
def bir(a, b):
c = a + b
return c
def foo(a, b):
return a + b
Note that the functions are printed in reverse order than those they appear inside code_strings
This should handle even the weird indentation code, but I think it will fails if you have nested functions.
add a comment |
I think a small parser is in order to try and take into account this weird exceptions:
import re
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
def test_multiline():
"""
asdasdada
sdadd
"""
pass
def test_comment(
a #)
):
return [a,
# ]
a]
def test_escaped_endline():
return "asdad
asdsad
asdas"
def test_nested():
return {():[,
{
}
]
}
def test_strings():
return '""" asdasd' + """
12asd
12312
"asd2" [
"""
"""
def test_fake_def_in_multiline()
"""
print(123)
a = "def in_string():"
def after().
print("NOPE")
"""Phew this ain't valid syntax""" def something(): pass
""".strip()
code_string += 'n'
func_list=
func = ''
tab = ''
brackets = {'(':0, '[':0, '{':0}
close = {')':'(', ']':'[', '}':'{'}
string=''
tab_f=''
c1=''
multiline=False
check=False
for line in code_string.split('n'):
tab = re.findall(r'^s*',line)[0]
if re.findall(r'^s*def', line) and not string and not multiline:
func += line + 'n'
tab_f = tab
check=True
if func:
if not check:
if sum(brackets.values()) == 0 and not string and not multiline:
if len(tab) <= len(tab_f):
func_list.append(func)
func=''
c1=''
c2=''
continue
func += line + 'n'
check = False
for c0 in line:
if c0 == '#' and not string and not multiline:
break
if c1 != '\':
if c0 in ['"', "'"]:
if c2 == c1 == c0 == '"' and string != "'":
multiline = not multiline
string = ''
continue
if not multiline:
if c0 in string:
string = ''
else:
if not string:
string = c0
if not string and not multiline:
if c0 in brackets:
brackets[c0] += 1
if c0 in close:
b = close[c0]
brackets[b] -= 1
c2=c1
c1=c0
for f in func_list:
print('-'*40)
print(f)
output:
----------------------------------------
def foo(a, b):
return a + b
----------------------------------------
def __init__(self):
self.my_list = [
'a',
'b',
]
----------------------------------------
def baz():
return [
1,
]
----------------------------------------
def hello(self, x):
return self.hello(
x - 1)
----------------------------------------
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
----------------------------------------
def test_multiline():
"""
asdasdada
sdadd
"""
pass
----------------------------------------
def test_comment(
a #)
):
return [a,
# ]
a]
----------------------------------------
def test_escaped_endline():
return "asdad asdsad asdas"
----------------------------------------
def test_nested():
return {():[,
{
}
]
}
----------------------------------------
def test_strings():
return '""" asdasd' + """
12asd
12312
"asd2" [
"""
----------------------------------------
def after():
print("NOPE")
Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with"""
) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).
– pkpnd
Jan 26 at 2:35
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it
– Crivella
Jan 26 at 2:38
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).
– pkpnd
Jan 26 at 2:40
1
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize
– Crivella
Jan 26 at 2:44
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54374296%2fextract-python-function-source-text-from-the-source-code-string%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
A much more robust solution would be to use the tokenize
module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:
import tokenize
from io import BytesIO
from collections import deque
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def test(self): pass
def abc(self):
'''multi-
line token'''
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
a =
1
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
pass
# unmatched parenthesis: (
""".strip()
file = BytesIO(code_string.encode())
tokens = deque(tokenize.tokenize(file.readline))
lines =
while tokens:
token = tokens.popleft()
if token.type == tokenize.NAME and token.string == 'def':
start_line, _ = token.start
last_token = token
while tokens:
token = tokens.popleft()
if token.type == tokenize.NEWLINE:
break
last_token = token
if last_token.type == tokenize.OP and last_token.string == ':':
indents = 0
while tokens:
token = tokens.popleft()
if token.type == tokenize.NL:
continue
if token.type == tokenize.INDENT:
indents += 1
elif token.type == tokenize.DEDENT:
indents -= 1
if not indents:
break
else:
last_token = token
lines.append((start_line, last_token.end[0]))
print(lines)
This outputs:
[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 27), (29, 33)]
Note however that the continuation line:
a =
1
is treated by tokenize
as one line even though it is in fact two lines, since if you print the tokens:
TokenInfo(type=53 (OP), string=':', start=(24, 20), end=(24, 21), line=' def hello(self, x):n')
TokenInfo(type=4 (NEWLINE), string='n', start=(24, 21), end=(24, 22), line=' def hello(self, x):n')
TokenInfo(type=5 (INDENT), string=' ', start=(25, 0), end=(25, 4), line=' a = 1n')
TokenInfo(type=1 (NAME), string='a', start=(25, 4), end=(25, 5), line=' a = 1n')
TokenInfo(type=53 (OP), string='=', start=(25, 6), end=(25, 7), line=' a = 1n')
TokenInfo(type=2 (NUMBER), string='1', start=(25, 8), end=(25, 9), line=' a = 1n')
TokenInfo(type=4 (NEWLINE), string='n', start=(25, 9), end=(25, 10), line=' a = 1n')
TokenInfo(type=1 (NAME), string='return', start=(26, 4), end=(26, 10), line=' return self.hello(n')
you can see that the continuation line is literally treated as one line of ' a = 1n'
, with only one line number 25
. This is apparently a bug/limitation of the tokenize
module unfortunately.
This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.
– pkpnd
Jan 26 at 2:12
Oops did not actually have any logic to handle weird indentation. Added now.
– blhsing
Jan 26 at 4:22
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.
– user2357112
Jan 26 at 4:41
@user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line bytokenize
even though it is literally multiple lines, so the line numbers returned bytokenize
would be off in such a case. It's apparently a bug/limitation of thetokenize
module unfortunately.
– blhsing
Jan 26 at 6:08
add a comment |
A much more robust solution would be to use the tokenize
module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:
import tokenize
from io import BytesIO
from collections import deque
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def test(self): pass
def abc(self):
'''multi-
line token'''
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
a =
1
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
pass
# unmatched parenthesis: (
""".strip()
file = BytesIO(code_string.encode())
tokens = deque(tokenize.tokenize(file.readline))
lines =
while tokens:
token = tokens.popleft()
if token.type == tokenize.NAME and token.string == 'def':
start_line, _ = token.start
last_token = token
while tokens:
token = tokens.popleft()
if token.type == tokenize.NEWLINE:
break
last_token = token
if last_token.type == tokenize.OP and last_token.string == ':':
indents = 0
while tokens:
token = tokens.popleft()
if token.type == tokenize.NL:
continue
if token.type == tokenize.INDENT:
indents += 1
elif token.type == tokenize.DEDENT:
indents -= 1
if not indents:
break
else:
last_token = token
lines.append((start_line, last_token.end[0]))
print(lines)
This outputs:
[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 27), (29, 33)]
Note however that the continuation line:
a =
1
is treated by tokenize
as one line even though it is in fact two lines, since if you print the tokens:
TokenInfo(type=53 (OP), string=':', start=(24, 20), end=(24, 21), line=' def hello(self, x):n')
TokenInfo(type=4 (NEWLINE), string='n', start=(24, 21), end=(24, 22), line=' def hello(self, x):n')
TokenInfo(type=5 (INDENT), string=' ', start=(25, 0), end=(25, 4), line=' a = 1n')
TokenInfo(type=1 (NAME), string='a', start=(25, 4), end=(25, 5), line=' a = 1n')
TokenInfo(type=53 (OP), string='=', start=(25, 6), end=(25, 7), line=' a = 1n')
TokenInfo(type=2 (NUMBER), string='1', start=(25, 8), end=(25, 9), line=' a = 1n')
TokenInfo(type=4 (NEWLINE), string='n', start=(25, 9), end=(25, 10), line=' a = 1n')
TokenInfo(type=1 (NAME), string='return', start=(26, 4), end=(26, 10), line=' return self.hello(n')
you can see that the continuation line is literally treated as one line of ' a = 1n'
, with only one line number 25
. This is apparently a bug/limitation of the tokenize
module unfortunately.
This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.
– pkpnd
Jan 26 at 2:12
Oops did not actually have any logic to handle weird indentation. Added now.
– blhsing
Jan 26 at 4:22
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.
– user2357112
Jan 26 at 4:41
@user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line bytokenize
even though it is literally multiple lines, so the line numbers returned bytokenize
would be off in such a case. It's apparently a bug/limitation of thetokenize
module unfortunately.
– blhsing
Jan 26 at 6:08
add a comment |
A much more robust solution would be to use the tokenize
module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:
import tokenize
from io import BytesIO
from collections import deque
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def test(self): pass
def abc(self):
'''multi-
line token'''
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
a =
1
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
pass
# unmatched parenthesis: (
""".strip()
file = BytesIO(code_string.encode())
tokens = deque(tokenize.tokenize(file.readline))
lines =
while tokens:
token = tokens.popleft()
if token.type == tokenize.NAME and token.string == 'def':
start_line, _ = token.start
last_token = token
while tokens:
token = tokens.popleft()
if token.type == tokenize.NEWLINE:
break
last_token = token
if last_token.type == tokenize.OP and last_token.string == ':':
indents = 0
while tokens:
token = tokens.popleft()
if token.type == tokenize.NL:
continue
if token.type == tokenize.INDENT:
indents += 1
elif token.type == tokenize.DEDENT:
indents -= 1
if not indents:
break
else:
last_token = token
lines.append((start_line, last_token.end[0]))
print(lines)
This outputs:
[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 27), (29, 33)]
Note however that the continuation line:
a =
1
is treated by tokenize
as one line even though it is in fact two lines, since if you print the tokens:
TokenInfo(type=53 (OP), string=':', start=(24, 20), end=(24, 21), line=' def hello(self, x):n')
TokenInfo(type=4 (NEWLINE), string='n', start=(24, 21), end=(24, 22), line=' def hello(self, x):n')
TokenInfo(type=5 (INDENT), string=' ', start=(25, 0), end=(25, 4), line=' a = 1n')
TokenInfo(type=1 (NAME), string='a', start=(25, 4), end=(25, 5), line=' a = 1n')
TokenInfo(type=53 (OP), string='=', start=(25, 6), end=(25, 7), line=' a = 1n')
TokenInfo(type=2 (NUMBER), string='1', start=(25, 8), end=(25, 9), line=' a = 1n')
TokenInfo(type=4 (NEWLINE), string='n', start=(25, 9), end=(25, 10), line=' a = 1n')
TokenInfo(type=1 (NAME), string='return', start=(26, 4), end=(26, 10), line=' return self.hello(n')
you can see that the continuation line is literally treated as one line of ' a = 1n'
, with only one line number 25
. This is apparently a bug/limitation of the tokenize
module unfortunately.
A much more robust solution would be to use the tokenize
module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:
import tokenize
from io import BytesIO
from collections import deque
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def test(self): pass
def abc(self):
'''multi-
line token'''
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
a =
1
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
pass
# unmatched parenthesis: (
""".strip()
file = BytesIO(code_string.encode())
tokens = deque(tokenize.tokenize(file.readline))
lines =
while tokens:
token = tokens.popleft()
if token.type == tokenize.NAME and token.string == 'def':
start_line, _ = token.start
last_token = token
while tokens:
token = tokens.popleft()
if token.type == tokenize.NEWLINE:
break
last_token = token
if last_token.type == tokenize.OP and last_token.string == ':':
indents = 0
while tokens:
token = tokens.popleft()
if token.type == tokenize.NL:
continue
if token.type == tokenize.INDENT:
indents += 1
elif token.type == tokenize.DEDENT:
indents -= 1
if not indents:
break
else:
last_token = token
lines.append((start_line, last_token.end[0]))
print(lines)
This outputs:
[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 27), (29, 33)]
Note however that the continuation line:
a =
1
is treated by tokenize
as one line even though it is in fact two lines, since if you print the tokens:
TokenInfo(type=53 (OP), string=':', start=(24, 20), end=(24, 21), line=' def hello(self, x):n')
TokenInfo(type=4 (NEWLINE), string='n', start=(24, 21), end=(24, 22), line=' def hello(self, x):n')
TokenInfo(type=5 (INDENT), string=' ', start=(25, 0), end=(25, 4), line=' a = 1n')
TokenInfo(type=1 (NAME), string='a', start=(25, 4), end=(25, 5), line=' a = 1n')
TokenInfo(type=53 (OP), string='=', start=(25, 6), end=(25, 7), line=' a = 1n')
TokenInfo(type=2 (NUMBER), string='1', start=(25, 8), end=(25, 9), line=' a = 1n')
TokenInfo(type=4 (NEWLINE), string='n', start=(25, 9), end=(25, 10), line=' a = 1n')
TokenInfo(type=1 (NAME), string='return', start=(26, 4), end=(26, 10), line=' return self.hello(n')
you can see that the continuation line is literally treated as one line of ' a = 1n'
, with only one line number 25
. This is apparently a bug/limitation of the tokenize
module unfortunately.
edited Jan 26 at 14:58
answered Jan 26 at 2:06
blhsingblhsing
33.5k41437
33.5k41437
This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.
– pkpnd
Jan 26 at 2:12
Oops did not actually have any logic to handle weird indentation. Added now.
– blhsing
Jan 26 at 4:22
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.
– user2357112
Jan 26 at 4:41
@user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line bytokenize
even though it is literally multiple lines, so the line numbers returned bytokenize
would be off in such a case. It's apparently a bug/limitation of thetokenize
module unfortunately.
– blhsing
Jan 26 at 6:08
add a comment |
This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.
– pkpnd
Jan 26 at 2:12
Oops did not actually have any logic to handle weird indentation. Added now.
– blhsing
Jan 26 at 4:22
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.
– user2357112
Jan 26 at 4:41
@user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line bytokenize
even though it is literally multiple lines, so the line numbers returned bytokenize
would be off in such a case. It's apparently a bug/limitation of thetokenize
module unfortunately.
– blhsing
Jan 26 at 6:08
This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.
– pkpnd
Jan 26 at 2:12
This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.
– pkpnd
Jan 26 at 2:12
Oops did not actually have any logic to handle weird indentation. Added now.
– blhsing
Jan 26 at 4:22
Oops did not actually have any logic to handle weird indentation. Added now.
– blhsing
Jan 26 at 4:22
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.
– user2357112
Jan 26 at 4:41
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.
– user2357112
Jan 26 at 4:41
@user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line by
tokenize
even though it is literally multiple lines, so the line numbers returned by tokenize
would be off in such a case. It's apparently a bug/limitation of the tokenize
module unfortunately.– blhsing
Jan 26 at 6:08
@user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line by
tokenize
even though it is literally multiple lines, so the line numbers returned by tokenize
would be off in such a case. It's apparently a bug/limitation of the tokenize
module unfortunately.– blhsing
Jan 26 at 6:08
add a comment |
Rather than reinventing a parser, I would use python itself.
Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def
to the farther line which does not fail to compile.
code_string = """
#A comment
def foo(a, b):
return a + b
def bir(a, b):
c = a + b
return c
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
""".strip()
lines = code_string.split('n')
#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]
#getting the indentation of each 'def'
indents = {}
for i in defidxs:
ll = lines[i].split('def')
indents[i] = len(ll[0])
#extracting the strings
end = len(lines)-1
while end > 0:
if end < defidxs[-1]:
defidxs.pop()
try:
start = defidxs[-1]
except IndexError: #break if there are no more 'def'
break
#empty lines between functions will cause an error, let's remove them
if len(lines[end].strip()) == 0:
end = end -1
continue
try:
#fix lines removing indentation or compile will not compile
fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
body = 'n'.join(fixlines)
compile(body, '<string>', 'exec') #if it fails, throws an exception
print(body)
end = start #no need to parse less line if it succeed.
except:
pass
end = end -1
It is a bit nasty because of the except
clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile
to fail, so I do not know how to avoid it.
This will prints
def baz():
return [
1,
]
def __init__(self):
self.my_list = [
'a',
'b',
]
def bir(a, b):
c = a + b
return c
def foo(a, b):
return a + b
Note that the functions are printed in reverse order than those they appear inside code_strings
This should handle even the weird indentation code, but I think it will fails if you have nested functions.
add a comment |
Rather than reinventing a parser, I would use python itself.
Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def
to the farther line which does not fail to compile.
code_string = """
#A comment
def foo(a, b):
return a + b
def bir(a, b):
c = a + b
return c
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
""".strip()
lines = code_string.split('n')
#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]
#getting the indentation of each 'def'
indents = {}
for i in defidxs:
ll = lines[i].split('def')
indents[i] = len(ll[0])
#extracting the strings
end = len(lines)-1
while end > 0:
if end < defidxs[-1]:
defidxs.pop()
try:
start = defidxs[-1]
except IndexError: #break if there are no more 'def'
break
#empty lines between functions will cause an error, let's remove them
if len(lines[end].strip()) == 0:
end = end -1
continue
try:
#fix lines removing indentation or compile will not compile
fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
body = 'n'.join(fixlines)
compile(body, '<string>', 'exec') #if it fails, throws an exception
print(body)
end = start #no need to parse less line if it succeed.
except:
pass
end = end -1
It is a bit nasty because of the except
clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile
to fail, so I do not know how to avoid it.
This will prints
def baz():
return [
1,
]
def __init__(self):
self.my_list = [
'a',
'b',
]
def bir(a, b):
c = a + b
return c
def foo(a, b):
return a + b
Note that the functions are printed in reverse order than those they appear inside code_strings
This should handle even the weird indentation code, but I think it will fails if you have nested functions.
add a comment |
Rather than reinventing a parser, I would use python itself.
Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def
to the farther line which does not fail to compile.
code_string = """
#A comment
def foo(a, b):
return a + b
def bir(a, b):
c = a + b
return c
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
""".strip()
lines = code_string.split('n')
#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]
#getting the indentation of each 'def'
indents = {}
for i in defidxs:
ll = lines[i].split('def')
indents[i] = len(ll[0])
#extracting the strings
end = len(lines)-1
while end > 0:
if end < defidxs[-1]:
defidxs.pop()
try:
start = defidxs[-1]
except IndexError: #break if there are no more 'def'
break
#empty lines between functions will cause an error, let's remove them
if len(lines[end].strip()) == 0:
end = end -1
continue
try:
#fix lines removing indentation or compile will not compile
fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
body = 'n'.join(fixlines)
compile(body, '<string>', 'exec') #if it fails, throws an exception
print(body)
end = start #no need to parse less line if it succeed.
except:
pass
end = end -1
It is a bit nasty because of the except
clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile
to fail, so I do not know how to avoid it.
This will prints
def baz():
return [
1,
]
def __init__(self):
self.my_list = [
'a',
'b',
]
def bir(a, b):
c = a + b
return c
def foo(a, b):
return a + b
Note that the functions are printed in reverse order than those they appear inside code_strings
This should handle even the weird indentation code, but I think it will fails if you have nested functions.
Rather than reinventing a parser, I would use python itself.
Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def
to the farther line which does not fail to compile.
code_string = """
#A comment
def foo(a, b):
return a + b
def bir(a, b):
c = a + b
return c
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
""".strip()
lines = code_string.split('n')
#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]
#getting the indentation of each 'def'
indents = {}
for i in defidxs:
ll = lines[i].split('def')
indents[i] = len(ll[0])
#extracting the strings
end = len(lines)-1
while end > 0:
if end < defidxs[-1]:
defidxs.pop()
try:
start = defidxs[-1]
except IndexError: #break if there are no more 'def'
break
#empty lines between functions will cause an error, let's remove them
if len(lines[end].strip()) == 0:
end = end -1
continue
try:
#fix lines removing indentation or compile will not compile
fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
body = 'n'.join(fixlines)
compile(body, '<string>', 'exec') #if it fails, throws an exception
print(body)
end = start #no need to parse less line if it succeed.
except:
pass
end = end -1
It is a bit nasty because of the except
clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile
to fail, so I do not know how to avoid it.
This will prints
def baz():
return [
1,
]
def __init__(self):
self.my_list = [
'a',
'b',
]
def bir(a, b):
c = a + b
return c
def foo(a, b):
return a + b
Note that the functions are printed in reverse order than those they appear inside code_strings
This should handle even the weird indentation code, but I think it will fails if you have nested functions.
answered Jan 26 at 3:08
ValentinoValentino
7131313
7131313
add a comment |
add a comment |
I think a small parser is in order to try and take into account this weird exceptions:
import re
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
def test_multiline():
"""
asdasdada
sdadd
"""
pass
def test_comment(
a #)
):
return [a,
# ]
a]
def test_escaped_endline():
return "asdad
asdsad
asdas"
def test_nested():
return {():[,
{
}
]
}
def test_strings():
return '""" asdasd' + """
12asd
12312
"asd2" [
"""
"""
def test_fake_def_in_multiline()
"""
print(123)
a = "def in_string():"
def after().
print("NOPE")
"""Phew this ain't valid syntax""" def something(): pass
""".strip()
code_string += 'n'
func_list=
func = ''
tab = ''
brackets = {'(':0, '[':0, '{':0}
close = {')':'(', ']':'[', '}':'{'}
string=''
tab_f=''
c1=''
multiline=False
check=False
for line in code_string.split('n'):
tab = re.findall(r'^s*',line)[0]
if re.findall(r'^s*def', line) and not string and not multiline:
func += line + 'n'
tab_f = tab
check=True
if func:
if not check:
if sum(brackets.values()) == 0 and not string and not multiline:
if len(tab) <= len(tab_f):
func_list.append(func)
func=''
c1=''
c2=''
continue
func += line + 'n'
check = False
for c0 in line:
if c0 == '#' and not string and not multiline:
break
if c1 != '\':
if c0 in ['"', "'"]:
if c2 == c1 == c0 == '"' and string != "'":
multiline = not multiline
string = ''
continue
if not multiline:
if c0 in string:
string = ''
else:
if not string:
string = c0
if not string and not multiline:
if c0 in brackets:
brackets[c0] += 1
if c0 in close:
b = close[c0]
brackets[b] -= 1
c2=c1
c1=c0
for f in func_list:
print('-'*40)
print(f)
output:
----------------------------------------
def foo(a, b):
return a + b
----------------------------------------
def __init__(self):
self.my_list = [
'a',
'b',
]
----------------------------------------
def baz():
return [
1,
]
----------------------------------------
def hello(self, x):
return self.hello(
x - 1)
----------------------------------------
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
----------------------------------------
def test_multiline():
"""
asdasdada
sdadd
"""
pass
----------------------------------------
def test_comment(
a #)
):
return [a,
# ]
a]
----------------------------------------
def test_escaped_endline():
return "asdad asdsad asdas"
----------------------------------------
def test_nested():
return {():[,
{
}
]
}
----------------------------------------
def test_strings():
return '""" asdasd' + """
12asd
12312
"asd2" [
"""
----------------------------------------
def after():
print("NOPE")
Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with"""
) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).
– pkpnd
Jan 26 at 2:35
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it
– Crivella
Jan 26 at 2:38
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).
– pkpnd
Jan 26 at 2:40
1
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize
– Crivella
Jan 26 at 2:44
add a comment |
I think a small parser is in order to try and take into account this weird exceptions:
import re
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
def test_multiline():
"""
asdasdada
sdadd
"""
pass
def test_comment(
a #)
):
return [a,
# ]
a]
def test_escaped_endline():
return "asdad
asdsad
asdas"
def test_nested():
return {():[,
{
}
]
}
def test_strings():
return '""" asdasd' + """
12asd
12312
"asd2" [
"""
"""
def test_fake_def_in_multiline()
"""
print(123)
a = "def in_string():"
def after().
print("NOPE")
"""Phew this ain't valid syntax""" def something(): pass
""".strip()
code_string += 'n'
func_list=
func = ''
tab = ''
brackets = {'(':0, '[':0, '{':0}
close = {')':'(', ']':'[', '}':'{'}
string=''
tab_f=''
c1=''
multiline=False
check=False
for line in code_string.split('n'):
tab = re.findall(r'^s*',line)[0]
if re.findall(r'^s*def', line) and not string and not multiline:
func += line + 'n'
tab_f = tab
check=True
if func:
if not check:
if sum(brackets.values()) == 0 and not string and not multiline:
if len(tab) <= len(tab_f):
func_list.append(func)
func=''
c1=''
c2=''
continue
func += line + 'n'
check = False
for c0 in line:
if c0 == '#' and not string and not multiline:
break
if c1 != '\':
if c0 in ['"', "'"]:
if c2 == c1 == c0 == '"' and string != "'":
multiline = not multiline
string = ''
continue
if not multiline:
if c0 in string:
string = ''
else:
if not string:
string = c0
if not string and not multiline:
if c0 in brackets:
brackets[c0] += 1
if c0 in close:
b = close[c0]
brackets[b] -= 1
c2=c1
c1=c0
for f in func_list:
print('-'*40)
print(f)
output:
----------------------------------------
def foo(a, b):
return a + b
----------------------------------------
def __init__(self):
self.my_list = [
'a',
'b',
]
----------------------------------------
def baz():
return [
1,
]
----------------------------------------
def hello(self, x):
return self.hello(
x - 1)
----------------------------------------
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
----------------------------------------
def test_multiline():
"""
asdasdada
sdadd
"""
pass
----------------------------------------
def test_comment(
a #)
):
return [a,
# ]
a]
----------------------------------------
def test_escaped_endline():
return "asdad asdsad asdas"
----------------------------------------
def test_nested():
return {():[,
{
}
]
}
----------------------------------------
def test_strings():
return '""" asdasd' + """
12asd
12312
"asd2" [
"""
----------------------------------------
def after():
print("NOPE")
Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with"""
) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).
– pkpnd
Jan 26 at 2:35
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it
– Crivella
Jan 26 at 2:38
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).
– pkpnd
Jan 26 at 2:40
1
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize
– Crivella
Jan 26 at 2:44
add a comment |
I think a small parser is in order to try and take into account this weird exceptions:
import re
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
def test_multiline():
"""
asdasdada
sdadd
"""
pass
def test_comment(
a #)
):
return [a,
# ]
a]
def test_escaped_endline():
return "asdad
asdsad
asdas"
def test_nested():
return {():[,
{
}
]
}
def test_strings():
return '""" asdasd' + """
12asd
12312
"asd2" [
"""
"""
def test_fake_def_in_multiline()
"""
print(123)
a = "def in_string():"
def after().
print("NOPE")
"""Phew this ain't valid syntax""" def something(): pass
""".strip()
code_string += 'n'
func_list=
func = ''
tab = ''
brackets = {'(':0, '[':0, '{':0}
close = {')':'(', ']':'[', '}':'{'}
string=''
tab_f=''
c1=''
multiline=False
check=False
for line in code_string.split('n'):
tab = re.findall(r'^s*',line)[0]
if re.findall(r'^s*def', line) and not string and not multiline:
func += line + 'n'
tab_f = tab
check=True
if func:
if not check:
if sum(brackets.values()) == 0 and not string and not multiline:
if len(tab) <= len(tab_f):
func_list.append(func)
func=''
c1=''
c2=''
continue
func += line + 'n'
check = False
for c0 in line:
if c0 == '#' and not string and not multiline:
break
if c1 != '\':
if c0 in ['"', "'"]:
if c2 == c1 == c0 == '"' and string != "'":
multiline = not multiline
string = ''
continue
if not multiline:
if c0 in string:
string = ''
else:
if not string:
string = c0
if not string and not multiline:
if c0 in brackets:
brackets[c0] += 1
if c0 in close:
b = close[c0]
brackets[b] -= 1
c2=c1
c1=c0
for f in func_list:
print('-'*40)
print(f)
output:
----------------------------------------
def foo(a, b):
return a + b
----------------------------------------
def __init__(self):
self.my_list = [
'a',
'b',
]
----------------------------------------
def baz():
return [
1,
]
----------------------------------------
def hello(self, x):
return self.hello(
x - 1)
----------------------------------------
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
----------------------------------------
def test_multiline():
"""
asdasdada
sdadd
"""
pass
----------------------------------------
def test_comment(
a #)
):
return [a,
# ]
a]
----------------------------------------
def test_escaped_endline():
return "asdad asdsad asdas"
----------------------------------------
def test_nested():
return {():[,
{
}
]
}
----------------------------------------
def test_strings():
return '""" asdasd' + """
12asd
12312
"asd2" [
"""
----------------------------------------
def after():
print("NOPE")
I think a small parser is in order to try and take into account this weird exceptions:
import re
code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
def baz():
return [
1,
]
class Baz(object):
def hello(self, x):
return self.hello(
x - 1)
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
def test_multiline():
"""
asdasdada
sdadd
"""
pass
def test_comment(
a #)
):
return [a,
# ]
a]
def test_escaped_endline():
return "asdad
asdsad
asdas"
def test_nested():
return {():[,
{
}
]
}
def test_strings():
return '""" asdasd' + """
12asd
12312
"asd2" [
"""
"""
def test_fake_def_in_multiline()
"""
print(123)
a = "def in_string():"
def after().
print("NOPE")
"""Phew this ain't valid syntax""" def something(): pass
""".strip()
code_string += 'n'
func_list=
func = ''
tab = ''
brackets = {'(':0, '[':0, '{':0}
close = {')':'(', ']':'[', '}':'{'}
string=''
tab_f=''
c1=''
multiline=False
check=False
for line in code_string.split('n'):
tab = re.findall(r'^s*',line)[0]
if re.findall(r'^s*def', line) and not string and not multiline:
func += line + 'n'
tab_f = tab
check=True
if func:
if not check:
if sum(brackets.values()) == 0 and not string and not multiline:
if len(tab) <= len(tab_f):
func_list.append(func)
func=''
c1=''
c2=''
continue
func += line + 'n'
check = False
for c0 in line:
if c0 == '#' and not string and not multiline:
break
if c1 != '\':
if c0 in ['"', "'"]:
if c2 == c1 == c0 == '"' and string != "'":
multiline = not multiline
string = ''
continue
if not multiline:
if c0 in string:
string = ''
else:
if not string:
string = c0
if not string and not multiline:
if c0 in brackets:
brackets[c0] += 1
if c0 in close:
b = close[c0]
brackets[b] -= 1
c2=c1
c1=c0
for f in func_list:
print('-'*40)
print(f)
output:
----------------------------------------
def foo(a, b):
return a + b
----------------------------------------
def __init__(self):
self.my_list = [
'a',
'b',
]
----------------------------------------
def baz():
return [
1,
]
----------------------------------------
def hello(self, x):
return self.hello(
x - 1)
----------------------------------------
def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass
----------------------------------------
def test_multiline():
"""
asdasdada
sdadd
"""
pass
----------------------------------------
def test_comment(
a #)
):
return [a,
# ]
a]
----------------------------------------
def test_escaped_endline():
return "asdad asdsad asdas"
----------------------------------------
def test_nested():
return {():[,
{
}
]
}
----------------------------------------
def test_strings():
return '""" asdasd' + """
12asd
12312
"asd2" [
"""
----------------------------------------
def after():
print("NOPE")
edited Jan 26 at 9:39
answered Jan 26 at 2:28
CrivellaCrivella
541310
541310
Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with"""
) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).
– pkpnd
Jan 26 at 2:35
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it
– Crivella
Jan 26 at 2:38
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).
– pkpnd
Jan 26 at 2:40
1
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize
– Crivella
Jan 26 at 2:44
add a comment |
Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with"""
) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).
– pkpnd
Jan 26 at 2:35
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it
– Crivella
Jan 26 at 2:38
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).
– pkpnd
Jan 26 at 2:40
1
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize
– Crivella
Jan 26 at 2:44
Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with
"""
) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).– pkpnd
Jan 26 at 2:35
Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with
"""
) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).– pkpnd
Jan 26 at 2:35
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it
– Crivella
Jan 26 at 2:38
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it
– Crivella
Jan 26 at 2:38
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).
– pkpnd
Jan 26 at 2:40
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).
– pkpnd
Jan 26 at 2:40
1
1
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize
– Crivella
Jan 26 at 2:44
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize
– Crivella
Jan 26 at 2:44
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54374296%2fextract-python-function-source-text-from-the-source-code-string%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I suppose you could just iterate lines, and when one matches
^(s*)defs.*$
, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines thatstartWith(thatWhitespace)
– Blorgbeard
Jan 26 at 0:30
You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level
– pkpnd
Jan 26 at 0:36
Oops, yes. You get the idea, anyway.
– Blorgbeard
Jan 26 at 0:40
Hmm, doesn't work if the function has weird indentation inside, for example
def baz():n return [n1,n ]
– pkpnd
Jan 26 at 0:48
Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.
– Blorgbeard
Jan 26 at 0:51