Extract Python function source text from the source code string












12















Suppose I have valid Python source code, as a string:



code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
""".strip()


Objective: I would like to obtain the lines containing the source code of the function definitions, preserving whitespace. For the code string above, I would like to get the strings



def foo(a, b):
return a + b


and



  def __init__(self):
self.my_list = [
'a',
'b',
]


Or, equivalently, I'd be happy to get the line numbers of functions in the code string: foo spans lines 2-3, and __init__ spans lines 5-9.



Attempts



I can parse the code string into its AST:



code_ast = ast.parse(code_string)


And I can find the FunctionDef nodes, e.g.:



function_def_nodes = [node for node in ast.walk(code_ast)
if isinstance(node, ast.FunctionDef)]


Each FunctionDef node's lineno attribute tells us the first line for that function. We can estimate the last line of that function with:



last_line = max(node.lineno for node in ast.walk(function_def_node)
if hasattr(node, 'lineno'))


but this doesn't work perfectly when the function ends with syntactic elements that don't show up as AST nodes, for instance the last ] in __init__.



I doubt there is an approach that only uses the AST, because the AST fundamentally does not have enough information in cases like __init__.



I cannot use the inspect module because that only works on "live objects" and I only have the Python code as a string. I cannot eval the code because that's a huge security headache.



In theory I could write a parser for Python but that really seems like overkill.



A heuristic suggested in the comments is to use the leading whitespace of lines. However, that can break for strange but valid functions with weird indentation like:



def baz():
return [
1,
]

class Baz(object):
def hello(self, x):
return self.hello(
x - 1)

def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass









share|improve this question

























  • I suppose you could just iterate lines, and when one matches ^(s*)defs.*$, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines that startWith(thatWhitespace)

    – Blorgbeard
    Jan 26 at 0:30











  • You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level

    – pkpnd
    Jan 26 at 0:36











  • Oops, yes. You get the idea, anyway.

    – Blorgbeard
    Jan 26 at 0:40











  • Hmm, doesn't work if the function has weird indentation inside, for example def baz():n return [n1,n ]

    – pkpnd
    Jan 26 at 0:48











  • Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.

    – Blorgbeard
    Jan 26 at 0:51
















12















Suppose I have valid Python source code, as a string:



code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
""".strip()


Objective: I would like to obtain the lines containing the source code of the function definitions, preserving whitespace. For the code string above, I would like to get the strings



def foo(a, b):
return a + b


and



  def __init__(self):
self.my_list = [
'a',
'b',
]


Or, equivalently, I'd be happy to get the line numbers of functions in the code string: foo spans lines 2-3, and __init__ spans lines 5-9.



Attempts



I can parse the code string into its AST:



code_ast = ast.parse(code_string)


And I can find the FunctionDef nodes, e.g.:



function_def_nodes = [node for node in ast.walk(code_ast)
if isinstance(node, ast.FunctionDef)]


Each FunctionDef node's lineno attribute tells us the first line for that function. We can estimate the last line of that function with:



last_line = max(node.lineno for node in ast.walk(function_def_node)
if hasattr(node, 'lineno'))


but this doesn't work perfectly when the function ends with syntactic elements that don't show up as AST nodes, for instance the last ] in __init__.



I doubt there is an approach that only uses the AST, because the AST fundamentally does not have enough information in cases like __init__.



I cannot use the inspect module because that only works on "live objects" and I only have the Python code as a string. I cannot eval the code because that's a huge security headache.



In theory I could write a parser for Python but that really seems like overkill.



A heuristic suggested in the comments is to use the leading whitespace of lines. However, that can break for strange but valid functions with weird indentation like:



def baz():
return [
1,
]

class Baz(object):
def hello(self, x):
return self.hello(
x - 1)

def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass









share|improve this question

























  • I suppose you could just iterate lines, and when one matches ^(s*)defs.*$, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines that startWith(thatWhitespace)

    – Blorgbeard
    Jan 26 at 0:30











  • You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level

    – pkpnd
    Jan 26 at 0:36











  • Oops, yes. You get the idea, anyway.

    – Blorgbeard
    Jan 26 at 0:40











  • Hmm, doesn't work if the function has weird indentation inside, for example def baz():n return [n1,n ]

    – pkpnd
    Jan 26 at 0:48











  • Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.

    – Blorgbeard
    Jan 26 at 0:51














12












12








12


3






Suppose I have valid Python source code, as a string:



code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
""".strip()


Objective: I would like to obtain the lines containing the source code of the function definitions, preserving whitespace. For the code string above, I would like to get the strings



def foo(a, b):
return a + b


and



  def __init__(self):
self.my_list = [
'a',
'b',
]


Or, equivalently, I'd be happy to get the line numbers of functions in the code string: foo spans lines 2-3, and __init__ spans lines 5-9.



Attempts



I can parse the code string into its AST:



code_ast = ast.parse(code_string)


And I can find the FunctionDef nodes, e.g.:



function_def_nodes = [node for node in ast.walk(code_ast)
if isinstance(node, ast.FunctionDef)]


Each FunctionDef node's lineno attribute tells us the first line for that function. We can estimate the last line of that function with:



last_line = max(node.lineno for node in ast.walk(function_def_node)
if hasattr(node, 'lineno'))


but this doesn't work perfectly when the function ends with syntactic elements that don't show up as AST nodes, for instance the last ] in __init__.



I doubt there is an approach that only uses the AST, because the AST fundamentally does not have enough information in cases like __init__.



I cannot use the inspect module because that only works on "live objects" and I only have the Python code as a string. I cannot eval the code because that's a huge security headache.



In theory I could write a parser for Python but that really seems like overkill.



A heuristic suggested in the comments is to use the leading whitespace of lines. However, that can break for strange but valid functions with weird indentation like:



def baz():
return [
1,
]

class Baz(object):
def hello(self, x):
return self.hello(
x - 1)

def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass









share|improve this question
















Suppose I have valid Python source code, as a string:



code_string = """
# A comment.
def foo(a, b):
return a + b
class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]
""".strip()


Objective: I would like to obtain the lines containing the source code of the function definitions, preserving whitespace. For the code string above, I would like to get the strings



def foo(a, b):
return a + b


and



  def __init__(self):
self.my_list = [
'a',
'b',
]


Or, equivalently, I'd be happy to get the line numbers of functions in the code string: foo spans lines 2-3, and __init__ spans lines 5-9.



Attempts



I can parse the code string into its AST:



code_ast = ast.parse(code_string)


And I can find the FunctionDef nodes, e.g.:



function_def_nodes = [node for node in ast.walk(code_ast)
if isinstance(node, ast.FunctionDef)]


Each FunctionDef node's lineno attribute tells us the first line for that function. We can estimate the last line of that function with:



last_line = max(node.lineno for node in ast.walk(function_def_node)
if hasattr(node, 'lineno'))


but this doesn't work perfectly when the function ends with syntactic elements that don't show up as AST nodes, for instance the last ] in __init__.



I doubt there is an approach that only uses the AST, because the AST fundamentally does not have enough information in cases like __init__.



I cannot use the inspect module because that only works on "live objects" and I only have the Python code as a string. I cannot eval the code because that's a huge security headache.



In theory I could write a parser for Python but that really seems like overkill.



A heuristic suggested in the comments is to use the leading whitespace of lines. However, that can break for strange but valid functions with weird indentation like:



def baz():
return [
1,
]

class Baz(object):
def hello(self, x):
return self.hello(
x - 1)

def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
# This function's indentation isn't unusual at all.
pass






python






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 26 at 1:23







pkpnd

















asked Jan 25 at 23:50









pkpndpkpnd

4,6831141




4,6831141













  • I suppose you could just iterate lines, and when one matches ^(s*)defs.*$, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines that startWith(thatWhitespace)

    – Blorgbeard
    Jan 26 at 0:30











  • You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level

    – pkpnd
    Jan 26 at 0:36











  • Oops, yes. You get the idea, anyway.

    – Blorgbeard
    Jan 26 at 0:40











  • Hmm, doesn't work if the function has weird indentation inside, for example def baz():n return [n1,n ]

    – pkpnd
    Jan 26 at 0:48











  • Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.

    – Blorgbeard
    Jan 26 at 0:51



















  • I suppose you could just iterate lines, and when one matches ^(s*)defs.*$, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines that startWith(thatWhitespace)

    – Blorgbeard
    Jan 26 at 0:30











  • You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level

    – pkpnd
    Jan 26 at 0:36











  • Oops, yes. You get the idea, anyway.

    – Blorgbeard
    Jan 26 at 0:40











  • Hmm, doesn't work if the function has weird indentation inside, for example def baz():n return [n1,n ]

    – pkpnd
    Jan 26 at 0:48











  • Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.

    – Blorgbeard
    Jan 26 at 0:51

















I suppose you could just iterate lines, and when one matches ^(s*)defs.*$, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines that startWith(thatWhitespace)

– Blorgbeard
Jan 26 at 0:30





I suppose you could just iterate lines, and when one matches ^(s*)defs.*$, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines that startWith(thatWhitespace)

– Blorgbeard
Jan 26 at 0:30













You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level

– pkpnd
Jan 26 at 0:36





You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level

– pkpnd
Jan 26 at 0:36













Oops, yes. You get the idea, anyway.

– Blorgbeard
Jan 26 at 0:40





Oops, yes. You get the idea, anyway.

– Blorgbeard
Jan 26 at 0:40













Hmm, doesn't work if the function has weird indentation inside, for example def baz():n return [n1,n ]

– pkpnd
Jan 26 at 0:48





Hmm, doesn't work if the function has weird indentation inside, for example def baz():n return [n1,n ]

– pkpnd
Jan 26 at 0:48













Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.

– Blorgbeard
Jan 26 at 0:51





Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then.

– Blorgbeard
Jan 26 at 0:51












3 Answers
3






active

oldest

votes


















5














A much more robust solution would be to use the tokenize module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:



import tokenize
from io import BytesIO
from collections import deque
code_string = """
# A comment.
def foo(a, b):
return a + b

class Bar(object):
def __init__(self):

self.my_list = [
'a',
'b',
]

def test(self): pass
def abc(self):
'''multi-
line token'''

def baz():
return [
1,
]

class Baz(object):
def hello(self, x):
a =
1
return self.hello(
x - 1)

def my_type_annotated_function(
my_long_argument_name: SomeLongArgumentTypeName
) -> SomeLongReturnTypeName:
pass
# unmatched parenthesis: (
""".strip()
file = BytesIO(code_string.encode())
tokens = deque(tokenize.tokenize(file.readline))
lines =
while tokens:
token = tokens.popleft()
if token.type == tokenize.NAME and token.string == 'def':
start_line, _ = token.start
last_token = token
while tokens:
token = tokens.popleft()
if token.type == tokenize.NEWLINE:
break
last_token = token
if last_token.type == tokenize.OP and last_token.string == ':':
indents = 0
while tokens:
token = tokens.popleft()
if token.type == tokenize.NL:
continue
if token.type == tokenize.INDENT:
indents += 1
elif token.type == tokenize.DEDENT:
indents -= 1
if not indents:
break
else:
last_token = token
lines.append((start_line, last_token.end[0]))
print(lines)


This outputs:



[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 27), (29, 33)]


Note however that the continuation line:



a = 
1


is treated by tokenize as one line even though it is in fact two lines, since if you print the tokens:



TokenInfo(type=53 (OP), string=':', start=(24, 20), end=(24, 21), line='  def hello(self, x):n')
TokenInfo(type=4 (NEWLINE), string='n', start=(24, 21), end=(24, 22), line=' def hello(self, x):n')
TokenInfo(type=5 (INDENT), string=' ', start=(25, 0), end=(25, 4), line=' a = 1n')
TokenInfo(type=1 (NAME), string='a', start=(25, 4), end=(25, 5), line=' a = 1n')
TokenInfo(type=53 (OP), string='=', start=(25, 6), end=(25, 7), line=' a = 1n')
TokenInfo(type=2 (NUMBER), string='1', start=(25, 8), end=(25, 9), line=' a = 1n')
TokenInfo(type=4 (NEWLINE), string='n', start=(25, 9), end=(25, 10), line=' a = 1n')
TokenInfo(type=1 (NAME), string='return', start=(26, 4), end=(26, 10), line=' return self.hello(n')


you can see that the continuation line is literally treated as one line of ' a = 1n', with only one line number 25. This is apparently a bug/limitation of the tokenize module unfortunately.






share|improve this answer


























  • This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.

    – pkpnd
    Jan 26 at 2:12











  • Oops did not actually have any logic to handle weird indentation. Added now.

    – blhsing
    Jan 26 at 4:22











  • This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.

    – user2357112
    Jan 26 at 4:41











  • @user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line by tokenize even though it is literally multiple lines, so the line numbers returned by tokenize would be off in such a case. It's apparently a bug/limitation of the tokenize module unfortunately.

    – blhsing
    Jan 26 at 6:08





















1














Rather than reinventing a parser, I would use python itself.



Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def to the farther line which does not fail to compile.



code_string = """
#A comment
def foo(a, b):
return a + b

def bir(a, b):
c = a + b
return c

class Bar(object):
def __init__(self):
self.my_list = [
'a',
'b',
]

def baz():
return [
1,
]

""".strip()

lines = code_string.split('n')

#looking for lines with 'def' keywords
defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]

#getting the indentation of each 'def'
indents = {}
for i in defidxs:
ll = lines[i].split('def')
indents[i] = len(ll[0])

#extracting the strings
end = len(lines)-1
while end > 0:
if end < defidxs[-1]:
defidxs.pop()
try:
start = defidxs[-1]
except IndexError: #break if there are no more 'def'
break

#empty lines between functions will cause an error, let's remove them
if len(lines[end].strip()) == 0:
end = end -1
continue

try:
#fix lines removing indentation or compile will not compile
fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
body = 'n'.join(fixlines)
compile(body, '<string>', 'exec') #if it fails, throws an exception
print(body)
end = start #no need to parse less line if it succeed.
except:
pass

end = end -1


It is a bit nasty because of the except clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile to fail, so I do not know how to avoid it.



This will prints



def baz():
return [
1,
]
def __init__(self):
self.my_list = [
'a',
'b',
]
def bir(a, b):
c = a + b
return c
def foo(a, b):
return a + b


Note that the functions are printed in reverse order than those they appear inside code_strings



This should handle even the weird indentation code, but I think it will fails if you have nested functions.






share|improve this answer































    1














    I think a small parser is in order to try and take into account this weird exceptions:



    import re

    code_string = """
    # A comment.
    def foo(a, b):
    return a + b
    class Bar(object):
    def __init__(self):
    self.my_list = [
    'a',
    'b',
    ]

    def baz():
    return [
    1,
    ]

    class Baz(object):
    def hello(self, x):
    return self.hello(
    x - 1)

    def my_type_annotated_function(
    my_long_argument_name: SomeLongArgumentTypeName
    ) -> SomeLongReturnTypeName:
    # This function's indentation isn't unusual at all.
    pass

    def test_multiline():
    """
    asdasdada
    sdadd
    """
    pass

    def test_comment(
    a #)
    ):
    return [a,
    # ]
    a]

    def test_escaped_endline():
    return "asdad
    asdsad
    asdas"

    def test_nested():
    return {():[,
    {
    }
    ]
    }

    def test_strings():
    return '""" asdasd' + """
    12asd
    12312
    "asd2" [
    """

    """
    def test_fake_def_in_multiline()
    """
    print(123)
    a = "def in_string():"
    def after().
    print("NOPE")

    """Phew this ain't valid syntax""" def something(): pass

    """.strip()

    code_string += 'n'


    func_list=
    func = ''
    tab = ''
    brackets = {'(':0, '[':0, '{':0}
    close = {')':'(', ']':'[', '}':'{'}
    string=''
    tab_f=''
    c1=''
    multiline=False
    check=False
    for line in code_string.split('n'):
    tab = re.findall(r'^s*',line)[0]
    if re.findall(r'^s*def', line) and not string and not multiline:
    func += line + 'n'
    tab_f = tab
    check=True
    if func:
    if not check:
    if sum(brackets.values()) == 0 and not string and not multiline:
    if len(tab) <= len(tab_f):
    func_list.append(func)
    func=''
    c1=''
    c2=''
    continue
    func += line + 'n'
    check = False
    for c0 in line:
    if c0 == '#' and not string and not multiline:
    break
    if c1 != '\':
    if c0 in ['"', "'"]:
    if c2 == c1 == c0 == '"' and string != "'":
    multiline = not multiline
    string = ''
    continue
    if not multiline:
    if c0 in string:
    string = ''
    else:
    if not string:
    string = c0
    if not string and not multiline:
    if c0 in brackets:
    brackets[c0] += 1
    if c0 in close:
    b = close[c0]
    brackets[b] -= 1
    c2=c1
    c1=c0

    for f in func_list:
    print('-'*40)
    print(f)


    output:



    ----------------------------------------
    def foo(a, b):
    return a + b

    ----------------------------------------
    def __init__(self):
    self.my_list = [
    'a',
    'b',
    ]

    ----------------------------------------
    def baz():
    return [
    1,
    ]

    ----------------------------------------
    def hello(self, x):
    return self.hello(
    x - 1)

    ----------------------------------------
    def my_type_annotated_function(
    my_long_argument_name: SomeLongArgumentTypeName
    ) -> SomeLongReturnTypeName:
    # This function's indentation isn't unusual at all.
    pass

    ----------------------------------------
    def test_multiline():
    """
    asdasdada
    sdadd
    """
    pass

    ----------------------------------------
    def test_comment(
    a #)
    ):
    return [a,
    # ]
    a]

    ----------------------------------------
    def test_escaped_endline():
    return "asdad asdsad asdas"

    ----------------------------------------
    def test_nested():
    return {():[,
    {
    }
    ]
    }

    ----------------------------------------
    def test_strings():
    return '""" asdasd' + """
    12asd
    12312
    "asd2" [
    """

    ----------------------------------------
    def after():
    print("NOPE")





    share|improve this answer


























    • Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with """) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).

      – pkpnd
      Jan 26 at 2:35











    • Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it

      – Crivella
      Jan 26 at 2:38













    • You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).

      – pkpnd
      Jan 26 at 2:40






    • 1





      Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize

      – Crivella
      Jan 26 at 2:44











    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54374296%2fextract-python-function-source-text-from-the-source-code-string%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    5














    A much more robust solution would be to use the tokenize module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:



    import tokenize
    from io import BytesIO
    from collections import deque
    code_string = """
    # A comment.
    def foo(a, b):
    return a + b

    class Bar(object):
    def __init__(self):

    self.my_list = [
    'a',
    'b',
    ]

    def test(self): pass
    def abc(self):
    '''multi-
    line token'''

    def baz():
    return [
    1,
    ]

    class Baz(object):
    def hello(self, x):
    a =
    1
    return self.hello(
    x - 1)

    def my_type_annotated_function(
    my_long_argument_name: SomeLongArgumentTypeName
    ) -> SomeLongReturnTypeName:
    pass
    # unmatched parenthesis: (
    """.strip()
    file = BytesIO(code_string.encode())
    tokens = deque(tokenize.tokenize(file.readline))
    lines =
    while tokens:
    token = tokens.popleft()
    if token.type == tokenize.NAME and token.string == 'def':
    start_line, _ = token.start
    last_token = token
    while tokens:
    token = tokens.popleft()
    if token.type == tokenize.NEWLINE:
    break
    last_token = token
    if last_token.type == tokenize.OP and last_token.string == ':':
    indents = 0
    while tokens:
    token = tokens.popleft()
    if token.type == tokenize.NL:
    continue
    if token.type == tokenize.INDENT:
    indents += 1
    elif token.type == tokenize.DEDENT:
    indents -= 1
    if not indents:
    break
    else:
    last_token = token
    lines.append((start_line, last_token.end[0]))
    print(lines)


    This outputs:



    [(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 27), (29, 33)]


    Note however that the continuation line:



    a = 
    1


    is treated by tokenize as one line even though it is in fact two lines, since if you print the tokens:



    TokenInfo(type=53 (OP), string=':', start=(24, 20), end=(24, 21), line='  def hello(self, x):n')
    TokenInfo(type=4 (NEWLINE), string='n', start=(24, 21), end=(24, 22), line=' def hello(self, x):n')
    TokenInfo(type=5 (INDENT), string=' ', start=(25, 0), end=(25, 4), line=' a = 1n')
    TokenInfo(type=1 (NAME), string='a', start=(25, 4), end=(25, 5), line=' a = 1n')
    TokenInfo(type=53 (OP), string='=', start=(25, 6), end=(25, 7), line=' a = 1n')
    TokenInfo(type=2 (NUMBER), string='1', start=(25, 8), end=(25, 9), line=' a = 1n')
    TokenInfo(type=4 (NEWLINE), string='n', start=(25, 9), end=(25, 10), line=' a = 1n')
    TokenInfo(type=1 (NAME), string='return', start=(26, 4), end=(26, 10), line=' return self.hello(n')


    you can see that the continuation line is literally treated as one line of ' a = 1n', with only one line number 25. This is apparently a bug/limitation of the tokenize module unfortunately.






    share|improve this answer


























    • This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.

      – pkpnd
      Jan 26 at 2:12











    • Oops did not actually have any logic to handle weird indentation. Added now.

      – blhsing
      Jan 26 at 4:22











    • This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.

      – user2357112
      Jan 26 at 4:41











    • @user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line by tokenize even though it is literally multiple lines, so the line numbers returned by tokenize would be off in such a case. It's apparently a bug/limitation of the tokenize module unfortunately.

      – blhsing
      Jan 26 at 6:08


















    5














    A much more robust solution would be to use the tokenize module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:



    import tokenize
    from io import BytesIO
    from collections import deque
    code_string = """
    # A comment.
    def foo(a, b):
    return a + b

    class Bar(object):
    def __init__(self):

    self.my_list = [
    'a',
    'b',
    ]

    def test(self): pass
    def abc(self):
    '''multi-
    line token'''

    def baz():
    return [
    1,
    ]

    class Baz(object):
    def hello(self, x):
    a =
    1
    return self.hello(
    x - 1)

    def my_type_annotated_function(
    my_long_argument_name: SomeLongArgumentTypeName
    ) -> SomeLongReturnTypeName:
    pass
    # unmatched parenthesis: (
    """.strip()
    file = BytesIO(code_string.encode())
    tokens = deque(tokenize.tokenize(file.readline))
    lines =
    while tokens:
    token = tokens.popleft()
    if token.type == tokenize.NAME and token.string == 'def':
    start_line, _ = token.start
    last_token = token
    while tokens:
    token = tokens.popleft()
    if token.type == tokenize.NEWLINE:
    break
    last_token = token
    if last_token.type == tokenize.OP and last_token.string == ':':
    indents = 0
    while tokens:
    token = tokens.popleft()
    if token.type == tokenize.NL:
    continue
    if token.type == tokenize.INDENT:
    indents += 1
    elif token.type == tokenize.DEDENT:
    indents -= 1
    if not indents:
    break
    else:
    last_token = token
    lines.append((start_line, last_token.end[0]))
    print(lines)


    This outputs:



    [(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 27), (29, 33)]


    Note however that the continuation line:



    a = 
    1


    is treated by tokenize as one line even though it is in fact two lines, since if you print the tokens:



    TokenInfo(type=53 (OP), string=':', start=(24, 20), end=(24, 21), line='  def hello(self, x):n')
    TokenInfo(type=4 (NEWLINE), string='n', start=(24, 21), end=(24, 22), line=' def hello(self, x):n')
    TokenInfo(type=5 (INDENT), string=' ', start=(25, 0), end=(25, 4), line=' a = 1n')
    TokenInfo(type=1 (NAME), string='a', start=(25, 4), end=(25, 5), line=' a = 1n')
    TokenInfo(type=53 (OP), string='=', start=(25, 6), end=(25, 7), line=' a = 1n')
    TokenInfo(type=2 (NUMBER), string='1', start=(25, 8), end=(25, 9), line=' a = 1n')
    TokenInfo(type=4 (NEWLINE), string='n', start=(25, 9), end=(25, 10), line=' a = 1n')
    TokenInfo(type=1 (NAME), string='return', start=(26, 4), end=(26, 10), line=' return self.hello(n')


    you can see that the continuation line is literally treated as one line of ' a = 1n', with only one line number 25. This is apparently a bug/limitation of the tokenize module unfortunately.






    share|improve this answer


























    • This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.

      – pkpnd
      Jan 26 at 2:12











    • Oops did not actually have any logic to handle weird indentation. Added now.

      – blhsing
      Jan 26 at 4:22











    • This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.

      – user2357112
      Jan 26 at 4:41











    • @user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line by tokenize even though it is literally multiple lines, so the line numbers returned by tokenize would be off in such a case. It's apparently a bug/limitation of the tokenize module unfortunately.

      – blhsing
      Jan 26 at 6:08
















    5












    5








    5







    A much more robust solution would be to use the tokenize module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:



    import tokenize
    from io import BytesIO
    from collections import deque
    code_string = """
    # A comment.
    def foo(a, b):
    return a + b

    class Bar(object):
    def __init__(self):

    self.my_list = [
    'a',
    'b',
    ]

    def test(self): pass
    def abc(self):
    '''multi-
    line token'''

    def baz():
    return [
    1,
    ]

    class Baz(object):
    def hello(self, x):
    a =
    1
    return self.hello(
    x - 1)

    def my_type_annotated_function(
    my_long_argument_name: SomeLongArgumentTypeName
    ) -> SomeLongReturnTypeName:
    pass
    # unmatched parenthesis: (
    """.strip()
    file = BytesIO(code_string.encode())
    tokens = deque(tokenize.tokenize(file.readline))
    lines =
    while tokens:
    token = tokens.popleft()
    if token.type == tokenize.NAME and token.string == 'def':
    start_line, _ = token.start
    last_token = token
    while tokens:
    token = tokens.popleft()
    if token.type == tokenize.NEWLINE:
    break
    last_token = token
    if last_token.type == tokenize.OP and last_token.string == ':':
    indents = 0
    while tokens:
    token = tokens.popleft()
    if token.type == tokenize.NL:
    continue
    if token.type == tokenize.INDENT:
    indents += 1
    elif token.type == tokenize.DEDENT:
    indents -= 1
    if not indents:
    break
    else:
    last_token = token
    lines.append((start_line, last_token.end[0]))
    print(lines)


    This outputs:



    [(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 27), (29, 33)]


    Note however that the continuation line:



    a = 
    1


    is treated by tokenize as one line even though it is in fact two lines, since if you print the tokens:



    TokenInfo(type=53 (OP), string=':', start=(24, 20), end=(24, 21), line='  def hello(self, x):n')
    TokenInfo(type=4 (NEWLINE), string='n', start=(24, 21), end=(24, 22), line=' def hello(self, x):n')
    TokenInfo(type=5 (INDENT), string=' ', start=(25, 0), end=(25, 4), line=' a = 1n')
    TokenInfo(type=1 (NAME), string='a', start=(25, 4), end=(25, 5), line=' a = 1n')
    TokenInfo(type=53 (OP), string='=', start=(25, 6), end=(25, 7), line=' a = 1n')
    TokenInfo(type=2 (NUMBER), string='1', start=(25, 8), end=(25, 9), line=' a = 1n')
    TokenInfo(type=4 (NEWLINE), string='n', start=(25, 9), end=(25, 10), line=' a = 1n')
    TokenInfo(type=1 (NAME), string='return', start=(26, 4), end=(26, 10), line=' return self.hello(n')


    you can see that the continuation line is literally treated as one line of ' a = 1n', with only one line number 25. This is apparently a bug/limitation of the tokenize module unfortunately.






    share|improve this answer















    A much more robust solution would be to use the tokenize module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:



    import tokenize
    from io import BytesIO
    from collections import deque
    code_string = """
    # A comment.
    def foo(a, b):
    return a + b

    class Bar(object):
    def __init__(self):

    self.my_list = [
    'a',
    'b',
    ]

    def test(self): pass
    def abc(self):
    '''multi-
    line token'''

    def baz():
    return [
    1,
    ]

    class Baz(object):
    def hello(self, x):
    a =
    1
    return self.hello(
    x - 1)

    def my_type_annotated_function(
    my_long_argument_name: SomeLongArgumentTypeName
    ) -> SomeLongReturnTypeName:
    pass
    # unmatched parenthesis: (
    """.strip()
    file = BytesIO(code_string.encode())
    tokens = deque(tokenize.tokenize(file.readline))
    lines =
    while tokens:
    token = tokens.popleft()
    if token.type == tokenize.NAME and token.string == 'def':
    start_line, _ = token.start
    last_token = token
    while tokens:
    token = tokens.popleft()
    if token.type == tokenize.NEWLINE:
    break
    last_token = token
    if last_token.type == tokenize.OP and last_token.string == ':':
    indents = 0
    while tokens:
    token = tokens.popleft()
    if token.type == tokenize.NL:
    continue
    if token.type == tokenize.INDENT:
    indents += 1
    elif token.type == tokenize.DEDENT:
    indents -= 1
    if not indents:
    break
    else:
    last_token = token
    lines.append((start_line, last_token.end[0]))
    print(lines)


    This outputs:



    [(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 27), (29, 33)]


    Note however that the continuation line:



    a = 
    1


    is treated by tokenize as one line even though it is in fact two lines, since if you print the tokens:



    TokenInfo(type=53 (OP), string=':', start=(24, 20), end=(24, 21), line='  def hello(self, x):n')
    TokenInfo(type=4 (NEWLINE), string='n', start=(24, 21), end=(24, 22), line=' def hello(self, x):n')
    TokenInfo(type=5 (INDENT), string=' ', start=(25, 0), end=(25, 4), line=' a = 1n')
    TokenInfo(type=1 (NAME), string='a', start=(25, 4), end=(25, 5), line=' a = 1n')
    TokenInfo(type=53 (OP), string='=', start=(25, 6), end=(25, 7), line=' a = 1n')
    TokenInfo(type=2 (NUMBER), string='1', start=(25, 8), end=(25, 9), line=' a = 1n')
    TokenInfo(type=4 (NEWLINE), string='n', start=(25, 9), end=(25, 10), line=' a = 1n')
    TokenInfo(type=1 (NAME), string='return', start=(26, 4), end=(26, 10), line=' return self.hello(n')


    you can see that the continuation line is literally treated as one line of ' a = 1n', with only one line number 25. This is apparently a bug/limitation of the tokenize module unfortunately.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jan 26 at 14:58

























    answered Jan 26 at 2:06









    blhsingblhsing

    33.5k41437




    33.5k41437













    • This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.

      – pkpnd
      Jan 26 at 2:12











    • Oops did not actually have any logic to handle weird indentation. Added now.

      – blhsing
      Jan 26 at 4:22











    • This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.

      – user2357112
      Jan 26 at 4:41











    • @user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line by tokenize even though it is literally multiple lines, so the line numbers returned by tokenize would be off in such a case. It's apparently a bug/limitation of the tokenize module unfortunately.

      – blhsing
      Jan 26 at 6:08





















    • This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.

      – pkpnd
      Jan 26 at 2:12











    • Oops did not actually have any logic to handle weird indentation. Added now.

      – blhsing
      Jan 26 at 4:22











    • This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.

      – user2357112
      Jan 26 at 4:41











    • @user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line by tokenize even though it is literally multiple lines, so the line numbers returned by tokenize would be off in such a case. It's apparently a bug/limitation of the tokenize module unfortunately.

      – blhsing
      Jan 26 at 6:08



















    This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.

    – pkpnd
    Jan 26 at 2:12





    This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function.

    – pkpnd
    Jan 26 at 2:12













    Oops did not actually have any logic to handle weird indentation. Added now.

    – blhsing
    Jan 26 at 4:22





    Oops did not actually have any logic to handle weird indentation. Added now.

    – blhsing
    Jan 26 at 4:22













    This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.

    – user2357112
    Jan 26 at 4:41





    This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust.

    – user2357112
    Jan 26 at 4:41













    @user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line by tokenize even though it is literally multiple lines, so the line numbers returned by tokenize would be off in such a case. It's apparently a bug/limitation of the tokenize module unfortunately.

    – blhsing
    Jan 26 at 6:08







    @user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line by tokenize even though it is literally multiple lines, so the line numbers returned by tokenize would be off in such a case. It's apparently a bug/limitation of the tokenize module unfortunately.

    – blhsing
    Jan 26 at 6:08















    1














    Rather than reinventing a parser, I would use python itself.



    Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def to the farther line which does not fail to compile.



    code_string = """
    #A comment
    def foo(a, b):
    return a + b

    def bir(a, b):
    c = a + b
    return c

    class Bar(object):
    def __init__(self):
    self.my_list = [
    'a',
    'b',
    ]

    def baz():
    return [
    1,
    ]

    """.strip()

    lines = code_string.split('n')

    #looking for lines with 'def' keywords
    defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]

    #getting the indentation of each 'def'
    indents = {}
    for i in defidxs:
    ll = lines[i].split('def')
    indents[i] = len(ll[0])

    #extracting the strings
    end = len(lines)-1
    while end > 0:
    if end < defidxs[-1]:
    defidxs.pop()
    try:
    start = defidxs[-1]
    except IndexError: #break if there are no more 'def'
    break

    #empty lines between functions will cause an error, let's remove them
    if len(lines[end].strip()) == 0:
    end = end -1
    continue

    try:
    #fix lines removing indentation or compile will not compile
    fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
    body = 'n'.join(fixlines)
    compile(body, '<string>', 'exec') #if it fails, throws an exception
    print(body)
    end = start #no need to parse less line if it succeed.
    except:
    pass

    end = end -1


    It is a bit nasty because of the except clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile to fail, so I do not know how to avoid it.



    This will prints



    def baz():
    return [
    1,
    ]
    def __init__(self):
    self.my_list = [
    'a',
    'b',
    ]
    def bir(a, b):
    c = a + b
    return c
    def foo(a, b):
    return a + b


    Note that the functions are printed in reverse order than those they appear inside code_strings



    This should handle even the weird indentation code, but I think it will fails if you have nested functions.






    share|improve this answer




























      1














      Rather than reinventing a parser, I would use python itself.



      Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def to the farther line which does not fail to compile.



      code_string = """
      #A comment
      def foo(a, b):
      return a + b

      def bir(a, b):
      c = a + b
      return c

      class Bar(object):
      def __init__(self):
      self.my_list = [
      'a',
      'b',
      ]

      def baz():
      return [
      1,
      ]

      """.strip()

      lines = code_string.split('n')

      #looking for lines with 'def' keywords
      defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]

      #getting the indentation of each 'def'
      indents = {}
      for i in defidxs:
      ll = lines[i].split('def')
      indents[i] = len(ll[0])

      #extracting the strings
      end = len(lines)-1
      while end > 0:
      if end < defidxs[-1]:
      defidxs.pop()
      try:
      start = defidxs[-1]
      except IndexError: #break if there are no more 'def'
      break

      #empty lines between functions will cause an error, let's remove them
      if len(lines[end].strip()) == 0:
      end = end -1
      continue

      try:
      #fix lines removing indentation or compile will not compile
      fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
      body = 'n'.join(fixlines)
      compile(body, '<string>', 'exec') #if it fails, throws an exception
      print(body)
      end = start #no need to parse less line if it succeed.
      except:
      pass

      end = end -1


      It is a bit nasty because of the except clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile to fail, so I do not know how to avoid it.



      This will prints



      def baz():
      return [
      1,
      ]
      def __init__(self):
      self.my_list = [
      'a',
      'b',
      ]
      def bir(a, b):
      c = a + b
      return c
      def foo(a, b):
      return a + b


      Note that the functions are printed in reverse order than those they appear inside code_strings



      This should handle even the weird indentation code, but I think it will fails if you have nested functions.






      share|improve this answer


























        1












        1








        1







        Rather than reinventing a parser, I would use python itself.



        Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def to the farther line which does not fail to compile.



        code_string = """
        #A comment
        def foo(a, b):
        return a + b

        def bir(a, b):
        c = a + b
        return c

        class Bar(object):
        def __init__(self):
        self.my_list = [
        'a',
        'b',
        ]

        def baz():
        return [
        1,
        ]

        """.strip()

        lines = code_string.split('n')

        #looking for lines with 'def' keywords
        defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]

        #getting the indentation of each 'def'
        indents = {}
        for i in defidxs:
        ll = lines[i].split('def')
        indents[i] = len(ll[0])

        #extracting the strings
        end = len(lines)-1
        while end > 0:
        if end < defidxs[-1]:
        defidxs.pop()
        try:
        start = defidxs[-1]
        except IndexError: #break if there are no more 'def'
        break

        #empty lines between functions will cause an error, let's remove them
        if len(lines[end].strip()) == 0:
        end = end -1
        continue

        try:
        #fix lines removing indentation or compile will not compile
        fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
        body = 'n'.join(fixlines)
        compile(body, '<string>', 'exec') #if it fails, throws an exception
        print(body)
        end = start #no need to parse less line if it succeed.
        except:
        pass

        end = end -1


        It is a bit nasty because of the except clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile to fail, so I do not know how to avoid it.



        This will prints



        def baz():
        return [
        1,
        ]
        def __init__(self):
        self.my_list = [
        'a',
        'b',
        ]
        def bir(a, b):
        c = a + b
        return c
        def foo(a, b):
        return a + b


        Note that the functions are printed in reverse order than those they appear inside code_strings



        This should handle even the weird indentation code, but I think it will fails if you have nested functions.






        share|improve this answer













        Rather than reinventing a parser, I would use python itself.



        Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def to the farther line which does not fail to compile.



        code_string = """
        #A comment
        def foo(a, b):
        return a + b

        def bir(a, b):
        c = a + b
        return c

        class Bar(object):
        def __init__(self):
        self.my_list = [
        'a',
        'b',
        ]

        def baz():
        return [
        1,
        ]

        """.strip()

        lines = code_string.split('n')

        #looking for lines with 'def' keywords
        defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]

        #getting the indentation of each 'def'
        indents = {}
        for i in defidxs:
        ll = lines[i].split('def')
        indents[i] = len(ll[0])

        #extracting the strings
        end = len(lines)-1
        while end > 0:
        if end < defidxs[-1]:
        defidxs.pop()
        try:
        start = defidxs[-1]
        except IndexError: #break if there are no more 'def'
        break

        #empty lines between functions will cause an error, let's remove them
        if len(lines[end].strip()) == 0:
        end = end -1
        continue

        try:
        #fix lines removing indentation or compile will not compile
        fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation
        body = 'n'.join(fixlines)
        compile(body, '<string>', 'exec') #if it fails, throws an exception
        print(body)
        end = start #no need to parse less line if it succeed.
        except:
        pass

        end = end -1


        It is a bit nasty because of the except clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile to fail, so I do not know how to avoid it.



        This will prints



        def baz():
        return [
        1,
        ]
        def __init__(self):
        self.my_list = [
        'a',
        'b',
        ]
        def bir(a, b):
        c = a + b
        return c
        def foo(a, b):
        return a + b


        Note that the functions are printed in reverse order than those they appear inside code_strings



        This should handle even the weird indentation code, but I think it will fails if you have nested functions.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 26 at 3:08









        ValentinoValentino

        7131313




        7131313























            1














            I think a small parser is in order to try and take into account this weird exceptions:



            import re

            code_string = """
            # A comment.
            def foo(a, b):
            return a + b
            class Bar(object):
            def __init__(self):
            self.my_list = [
            'a',
            'b',
            ]

            def baz():
            return [
            1,
            ]

            class Baz(object):
            def hello(self, x):
            return self.hello(
            x - 1)

            def my_type_annotated_function(
            my_long_argument_name: SomeLongArgumentTypeName
            ) -> SomeLongReturnTypeName:
            # This function's indentation isn't unusual at all.
            pass

            def test_multiline():
            """
            asdasdada
            sdadd
            """
            pass

            def test_comment(
            a #)
            ):
            return [a,
            # ]
            a]

            def test_escaped_endline():
            return "asdad
            asdsad
            asdas"

            def test_nested():
            return {():[,
            {
            }
            ]
            }

            def test_strings():
            return '""" asdasd' + """
            12asd
            12312
            "asd2" [
            """

            """
            def test_fake_def_in_multiline()
            """
            print(123)
            a = "def in_string():"
            def after().
            print("NOPE")

            """Phew this ain't valid syntax""" def something(): pass

            """.strip()

            code_string += 'n'


            func_list=
            func = ''
            tab = ''
            brackets = {'(':0, '[':0, '{':0}
            close = {')':'(', ']':'[', '}':'{'}
            string=''
            tab_f=''
            c1=''
            multiline=False
            check=False
            for line in code_string.split('n'):
            tab = re.findall(r'^s*',line)[0]
            if re.findall(r'^s*def', line) and not string and not multiline:
            func += line + 'n'
            tab_f = tab
            check=True
            if func:
            if not check:
            if sum(brackets.values()) == 0 and not string and not multiline:
            if len(tab) <= len(tab_f):
            func_list.append(func)
            func=''
            c1=''
            c2=''
            continue
            func += line + 'n'
            check = False
            for c0 in line:
            if c0 == '#' and not string and not multiline:
            break
            if c1 != '\':
            if c0 in ['"', "'"]:
            if c2 == c1 == c0 == '"' and string != "'":
            multiline = not multiline
            string = ''
            continue
            if not multiline:
            if c0 in string:
            string = ''
            else:
            if not string:
            string = c0
            if not string and not multiline:
            if c0 in brackets:
            brackets[c0] += 1
            if c0 in close:
            b = close[c0]
            brackets[b] -= 1
            c2=c1
            c1=c0

            for f in func_list:
            print('-'*40)
            print(f)


            output:



            ----------------------------------------
            def foo(a, b):
            return a + b

            ----------------------------------------
            def __init__(self):
            self.my_list = [
            'a',
            'b',
            ]

            ----------------------------------------
            def baz():
            return [
            1,
            ]

            ----------------------------------------
            def hello(self, x):
            return self.hello(
            x - 1)

            ----------------------------------------
            def my_type_annotated_function(
            my_long_argument_name: SomeLongArgumentTypeName
            ) -> SomeLongReturnTypeName:
            # This function's indentation isn't unusual at all.
            pass

            ----------------------------------------
            def test_multiline():
            """
            asdasdada
            sdadd
            """
            pass

            ----------------------------------------
            def test_comment(
            a #)
            ):
            return [a,
            # ]
            a]

            ----------------------------------------
            def test_escaped_endline():
            return "asdad asdsad asdas"

            ----------------------------------------
            def test_nested():
            return {():[,
            {
            }
            ]
            }

            ----------------------------------------
            def test_strings():
            return '""" asdasd' + """
            12asd
            12312
            "asd2" [
            """

            ----------------------------------------
            def after():
            print("NOPE")





            share|improve this answer


























            • Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with """) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).

              – pkpnd
              Jan 26 at 2:35











            • Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it

              – Crivella
              Jan 26 at 2:38













            • You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).

              – pkpnd
              Jan 26 at 2:40






            • 1





              Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize

              – Crivella
              Jan 26 at 2:44
















            1














            I think a small parser is in order to try and take into account this weird exceptions:



            import re

            code_string = """
            # A comment.
            def foo(a, b):
            return a + b
            class Bar(object):
            def __init__(self):
            self.my_list = [
            'a',
            'b',
            ]

            def baz():
            return [
            1,
            ]

            class Baz(object):
            def hello(self, x):
            return self.hello(
            x - 1)

            def my_type_annotated_function(
            my_long_argument_name: SomeLongArgumentTypeName
            ) -> SomeLongReturnTypeName:
            # This function's indentation isn't unusual at all.
            pass

            def test_multiline():
            """
            asdasdada
            sdadd
            """
            pass

            def test_comment(
            a #)
            ):
            return [a,
            # ]
            a]

            def test_escaped_endline():
            return "asdad
            asdsad
            asdas"

            def test_nested():
            return {():[,
            {
            }
            ]
            }

            def test_strings():
            return '""" asdasd' + """
            12asd
            12312
            "asd2" [
            """

            """
            def test_fake_def_in_multiline()
            """
            print(123)
            a = "def in_string():"
            def after().
            print("NOPE")

            """Phew this ain't valid syntax""" def something(): pass

            """.strip()

            code_string += 'n'


            func_list=
            func = ''
            tab = ''
            brackets = {'(':0, '[':0, '{':0}
            close = {')':'(', ']':'[', '}':'{'}
            string=''
            tab_f=''
            c1=''
            multiline=False
            check=False
            for line in code_string.split('n'):
            tab = re.findall(r'^s*',line)[0]
            if re.findall(r'^s*def', line) and not string and not multiline:
            func += line + 'n'
            tab_f = tab
            check=True
            if func:
            if not check:
            if sum(brackets.values()) == 0 and not string and not multiline:
            if len(tab) <= len(tab_f):
            func_list.append(func)
            func=''
            c1=''
            c2=''
            continue
            func += line + 'n'
            check = False
            for c0 in line:
            if c0 == '#' and not string and not multiline:
            break
            if c1 != '\':
            if c0 in ['"', "'"]:
            if c2 == c1 == c0 == '"' and string != "'":
            multiline = not multiline
            string = ''
            continue
            if not multiline:
            if c0 in string:
            string = ''
            else:
            if not string:
            string = c0
            if not string and not multiline:
            if c0 in brackets:
            brackets[c0] += 1
            if c0 in close:
            b = close[c0]
            brackets[b] -= 1
            c2=c1
            c1=c0

            for f in func_list:
            print('-'*40)
            print(f)


            output:



            ----------------------------------------
            def foo(a, b):
            return a + b

            ----------------------------------------
            def __init__(self):
            self.my_list = [
            'a',
            'b',
            ]

            ----------------------------------------
            def baz():
            return [
            1,
            ]

            ----------------------------------------
            def hello(self, x):
            return self.hello(
            x - 1)

            ----------------------------------------
            def my_type_annotated_function(
            my_long_argument_name: SomeLongArgumentTypeName
            ) -> SomeLongReturnTypeName:
            # This function's indentation isn't unusual at all.
            pass

            ----------------------------------------
            def test_multiline():
            """
            asdasdada
            sdadd
            """
            pass

            ----------------------------------------
            def test_comment(
            a #)
            ):
            return [a,
            # ]
            a]

            ----------------------------------------
            def test_escaped_endline():
            return "asdad asdsad asdas"

            ----------------------------------------
            def test_nested():
            return {():[,
            {
            }
            ]
            }

            ----------------------------------------
            def test_strings():
            return '""" asdasd' + """
            12asd
            12312
            "asd2" [
            """

            ----------------------------------------
            def after():
            print("NOPE")





            share|improve this answer


























            • Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with """) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).

              – pkpnd
              Jan 26 at 2:35











            • Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it

              – Crivella
              Jan 26 at 2:38













            • You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).

              – pkpnd
              Jan 26 at 2:40






            • 1





              Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize

              – Crivella
              Jan 26 at 2:44














            1












            1








            1







            I think a small parser is in order to try and take into account this weird exceptions:



            import re

            code_string = """
            # A comment.
            def foo(a, b):
            return a + b
            class Bar(object):
            def __init__(self):
            self.my_list = [
            'a',
            'b',
            ]

            def baz():
            return [
            1,
            ]

            class Baz(object):
            def hello(self, x):
            return self.hello(
            x - 1)

            def my_type_annotated_function(
            my_long_argument_name: SomeLongArgumentTypeName
            ) -> SomeLongReturnTypeName:
            # This function's indentation isn't unusual at all.
            pass

            def test_multiline():
            """
            asdasdada
            sdadd
            """
            pass

            def test_comment(
            a #)
            ):
            return [a,
            # ]
            a]

            def test_escaped_endline():
            return "asdad
            asdsad
            asdas"

            def test_nested():
            return {():[,
            {
            }
            ]
            }

            def test_strings():
            return '""" asdasd' + """
            12asd
            12312
            "asd2" [
            """

            """
            def test_fake_def_in_multiline()
            """
            print(123)
            a = "def in_string():"
            def after().
            print("NOPE")

            """Phew this ain't valid syntax""" def something(): pass

            """.strip()

            code_string += 'n'


            func_list=
            func = ''
            tab = ''
            brackets = {'(':0, '[':0, '{':0}
            close = {')':'(', ']':'[', '}':'{'}
            string=''
            tab_f=''
            c1=''
            multiline=False
            check=False
            for line in code_string.split('n'):
            tab = re.findall(r'^s*',line)[0]
            if re.findall(r'^s*def', line) and not string and not multiline:
            func += line + 'n'
            tab_f = tab
            check=True
            if func:
            if not check:
            if sum(brackets.values()) == 0 and not string and not multiline:
            if len(tab) <= len(tab_f):
            func_list.append(func)
            func=''
            c1=''
            c2=''
            continue
            func += line + 'n'
            check = False
            for c0 in line:
            if c0 == '#' and not string and not multiline:
            break
            if c1 != '\':
            if c0 in ['"', "'"]:
            if c2 == c1 == c0 == '"' and string != "'":
            multiline = not multiline
            string = ''
            continue
            if not multiline:
            if c0 in string:
            string = ''
            else:
            if not string:
            string = c0
            if not string and not multiline:
            if c0 in brackets:
            brackets[c0] += 1
            if c0 in close:
            b = close[c0]
            brackets[b] -= 1
            c2=c1
            c1=c0

            for f in func_list:
            print('-'*40)
            print(f)


            output:



            ----------------------------------------
            def foo(a, b):
            return a + b

            ----------------------------------------
            def __init__(self):
            self.my_list = [
            'a',
            'b',
            ]

            ----------------------------------------
            def baz():
            return [
            1,
            ]

            ----------------------------------------
            def hello(self, x):
            return self.hello(
            x - 1)

            ----------------------------------------
            def my_type_annotated_function(
            my_long_argument_name: SomeLongArgumentTypeName
            ) -> SomeLongReturnTypeName:
            # This function's indentation isn't unusual at all.
            pass

            ----------------------------------------
            def test_multiline():
            """
            asdasdada
            sdadd
            """
            pass

            ----------------------------------------
            def test_comment(
            a #)
            ):
            return [a,
            # ]
            a]

            ----------------------------------------
            def test_escaped_endline():
            return "asdad asdsad asdas"

            ----------------------------------------
            def test_nested():
            return {():[,
            {
            }
            ]
            }

            ----------------------------------------
            def test_strings():
            return '""" asdasd' + """
            12asd
            12312
            "asd2" [
            """

            ----------------------------------------
            def after():
            print("NOPE")





            share|improve this answer















            I think a small parser is in order to try and take into account this weird exceptions:



            import re

            code_string = """
            # A comment.
            def foo(a, b):
            return a + b
            class Bar(object):
            def __init__(self):
            self.my_list = [
            'a',
            'b',
            ]

            def baz():
            return [
            1,
            ]

            class Baz(object):
            def hello(self, x):
            return self.hello(
            x - 1)

            def my_type_annotated_function(
            my_long_argument_name: SomeLongArgumentTypeName
            ) -> SomeLongReturnTypeName:
            # This function's indentation isn't unusual at all.
            pass

            def test_multiline():
            """
            asdasdada
            sdadd
            """
            pass

            def test_comment(
            a #)
            ):
            return [a,
            # ]
            a]

            def test_escaped_endline():
            return "asdad
            asdsad
            asdas"

            def test_nested():
            return {():[,
            {
            }
            ]
            }

            def test_strings():
            return '""" asdasd' + """
            12asd
            12312
            "asd2" [
            """

            """
            def test_fake_def_in_multiline()
            """
            print(123)
            a = "def in_string():"
            def after().
            print("NOPE")

            """Phew this ain't valid syntax""" def something(): pass

            """.strip()

            code_string += 'n'


            func_list=
            func = ''
            tab = ''
            brackets = {'(':0, '[':0, '{':0}
            close = {')':'(', ']':'[', '}':'{'}
            string=''
            tab_f=''
            c1=''
            multiline=False
            check=False
            for line in code_string.split('n'):
            tab = re.findall(r'^s*',line)[0]
            if re.findall(r'^s*def', line) and not string and not multiline:
            func += line + 'n'
            tab_f = tab
            check=True
            if func:
            if not check:
            if sum(brackets.values()) == 0 and not string and not multiline:
            if len(tab) <= len(tab_f):
            func_list.append(func)
            func=''
            c1=''
            c2=''
            continue
            func += line + 'n'
            check = False
            for c0 in line:
            if c0 == '#' and not string and not multiline:
            break
            if c1 != '\':
            if c0 in ['"', "'"]:
            if c2 == c1 == c0 == '"' and string != "'":
            multiline = not multiline
            string = ''
            continue
            if not multiline:
            if c0 in string:
            string = ''
            else:
            if not string:
            string = c0
            if not string and not multiline:
            if c0 in brackets:
            brackets[c0] += 1
            if c0 in close:
            b = close[c0]
            brackets[b] -= 1
            c2=c1
            c1=c0

            for f in func_list:
            print('-'*40)
            print(f)


            output:



            ----------------------------------------
            def foo(a, b):
            return a + b

            ----------------------------------------
            def __init__(self):
            self.my_list = [
            'a',
            'b',
            ]

            ----------------------------------------
            def baz():
            return [
            1,
            ]

            ----------------------------------------
            def hello(self, x):
            return self.hello(
            x - 1)

            ----------------------------------------
            def my_type_annotated_function(
            my_long_argument_name: SomeLongArgumentTypeName
            ) -> SomeLongReturnTypeName:
            # This function's indentation isn't unusual at all.
            pass

            ----------------------------------------
            def test_multiline():
            """
            asdasdada
            sdadd
            """
            pass

            ----------------------------------------
            def test_comment(
            a #)
            ):
            return [a,
            # ]
            a]

            ----------------------------------------
            def test_escaped_endline():
            return "asdad asdsad asdas"

            ----------------------------------------
            def test_nested():
            return {():[,
            {
            }
            ]
            }

            ----------------------------------------
            def test_strings():
            return '""" asdasd' + """
            12asd
            12312
            "asd2" [
            """

            ----------------------------------------
            def after():
            print("NOPE")






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 26 at 9:39

























            answered Jan 26 at 2:28









            CrivellaCrivella

            541310




            541310













            • Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with """) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).

              – pkpnd
              Jan 26 at 2:35











            • Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it

              – Crivella
              Jan 26 at 2:38













            • You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).

              – pkpnd
              Jan 26 at 2:40






            • 1





              Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize

              – Crivella
              Jan 26 at 2:44



















            • Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with """) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).

              – pkpnd
              Jan 26 at 2:35











            • Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it

              – Crivella
              Jan 26 at 2:38













            • You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).

              – pkpnd
              Jan 26 at 2:40






            • 1





              Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize

              – Crivella
              Jan 26 at 2:44

















            Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with """) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).

            – pkpnd
            Jan 26 at 2:35





            Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with """) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters).

            – pkpnd
            Jan 26 at 2:35













            Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it

            – Crivella
            Jan 26 at 2:38







            Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it

            – Crivella
            Jan 26 at 2:38















            You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).

            – pkpnd
            Jan 26 at 2:40





            You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment).

            – pkpnd
            Jan 26 at 2:40




            1




            1





            Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize

            – Crivella
            Jan 26 at 2:44





            Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize

            – Crivella
            Jan 26 at 2:44


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54374296%2fextract-python-function-source-text-from-the-source-code-string%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Probability when a professor distributes a quiz and homework assignment to a class of n students.

            Aardman Animations

            Are they similar matrix