Jtdylktuy

Question

Suppose I have valid Python source code, as a string:

code_string = """

# A comment.

def foo(a, b):

  return a + b

class Bar(object):

  def __init__(self):

    self.my_list = [

        'a',

        'b',

    ]

""".strip()

Objective: I would like to obtain the lines containing the source code of the function definitions, preserving whitespace. For the code string above, I would like to get the strings

def foo(a, b):

  return a + b

and

  def __init__(self):

    self.my_list = [

        'a',

        'b',

    ]

Or, equivalently, I'd be happy to get the line numbers of functions in the code string: foo spans lines 2-3, and __init__ spans lines 5-9.

Attempts

I can parse the code string into its AST:

code_ast = ast.parse(code_string)

And I can find the FunctionDef nodes, e.g.:

function_def_nodes = [node for node in ast.walk(code_ast)

                      if isinstance(node, ast.FunctionDef)]

Each FunctionDef node's lineno attribute tells us the first line for that function. We can estimate the last line of that function with:

last_line = max(node.lineno for node in ast.walk(function_def_node)

                if hasattr(node, 'lineno'))

but this doesn't work perfectly when the function ends with syntactic elements that don't show up as AST nodes, for instance the last ] in __init__.

I doubt there is an approach that only uses the AST, because the AST fundamentally does not have enough information in cases like __init__.

I cannot use the inspect module because that only works on "live objects" and I only have the Python code as a string. I cannot eval the code because that's a huge security headache.

In theory I could write a parser for Python but that really seems like overkill.

A heuristic suggested in the comments is to use the leading whitespace of lines. However, that can break for strange but valid functions with weird indentation like:

def baz():

  return [

1,

  ]



class Baz(object):

  def hello(self, x):

    return self.hello(

x - 1)



def my_type_annotated_function(

  my_long_argument_name: SomeLongArgumentTypeName

) -> SomeLongReturnTypeName:

  # This function's indentation isn't unusual at all.

  pass

I suppose you could just iterate lines, and when one matches ^(s*)defs.*$, extract that matched group (the leading whitespace) and then consume the line and all subsequent lines that startWith(thatWhitespace) — Jan 26 at 0:30
You mean, extract all subsequent lines that start with strictly more than that whitespace? Or else you'd also extract the following functions defined at the same indentation level — Jan 26 at 0:36
Hmm, doesn't work if the function has weird indentation inside, for example def baz():n return [n1,n ] — Jan 26 at 0:48
Ah, I didn't even realise that was valid python. Looks like there's no simple text-processing method, then. — Jan 26 at 0:51

score 5 · Accepted Answer · 2019-01-26 14:58:06Z

A much more robust solution would be to use the tokenize module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:

import tokenize

from io import BytesIO

from collections import deque

code_string = """

# A comment.

def foo(a, b):

  return a + b



class Bar(object):

  def __init__(self):



    self.my_list = [

        'a',

        'b',

    ]



  def test(self): pass

  def abc(self):

    '''multi-

    line token'''



def baz():

  return [

1,

  ]



class Baz(object):

  def hello(self, x):

    a = 

1

    return self.hello(

x - 1)



def my_type_annotated_function(

  my_long_argument_name: SomeLongArgumentTypeName

) -> SomeLongReturnTypeName:

  pass

  # unmatched parenthesis: (

""".strip()

file = BytesIO(code_string.encode())

tokens = deque(tokenize.tokenize(file.readline))

lines = 

while tokens:

    token = tokens.popleft()

    if token.type == tokenize.NAME and token.string == 'def':

        start_line, _ = token.start

        last_token = token

        while tokens:

            token = tokens.popleft()

            if token.type == tokenize.NEWLINE:

                break

            last_token = token

        if last_token.type == tokenize.OP and last_token.string == ':':

            indents = 0

            while tokens:

                token = tokens.popleft()

                if token.type == tokenize.NL:

                    continue

                if token.type == tokenize.INDENT:

                    indents += 1

                elif token.type == tokenize.DEDENT:

                    indents -= 1

                    if not indents:

                        break

                else:

                    last_token = token

        lines.append((start_line, last_token.end[0]))

print(lines)

This outputs:

[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 27), (29, 33)]

Note however that the continuation line:

a = 

1

is treated by tokenize as one line even though it is in fact two lines, since if you print the tokens:

TokenInfo(type=53 (OP), string=':', start=(24, 20), end=(24, 21), line='  def hello(self, x):n')

TokenInfo(type=4 (NEWLINE), string='n', start=(24, 21), end=(24, 22), line='  def hello(self, x):n')

TokenInfo(type=5 (INDENT), string='    ', start=(25, 0), end=(25, 4), line='    a = 1n')

TokenInfo(type=1 (NAME), string='a', start=(25, 4), end=(25, 5), line='    a = 1n')

TokenInfo(type=53 (OP), string='=', start=(25, 6), end=(25, 7), line='    a = 1n')

TokenInfo(type=2 (NUMBER), string='1', start=(25, 8), end=(25, 9), line='    a = 1n')

TokenInfo(type=4 (NEWLINE), string='n', start=(25, 9), end=(25, 10), line='    a = 1n')

TokenInfo(type=1 (NAME), string='return', start=(26, 4), end=(26, 10), line='    return self.hello(n')

you can see that the continuation line is literally treated as one line of ' a = 1n', with only one line number 25. This is apparently a bug/limitation of the tokenize module unfortunately.

This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function. — Jan 26 at 2:12
Oops did not actually have any logic to handle weird indentation. Added now. — Jan 26 at 4:22
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust. — Jan 26 at 4:41
@user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line by tokenize even though it is literally multiple lines, so the line numbers returned by tokenize would be off in such a case. It's apparently a bug/limitation of the tokenize module unfortunately. — Jan 26 at 6:08

ValentinoValentino 7131313 · Accepted Answer · 2019-01-26 03:08:41Z

Rather than reinventing a parser, I would use python itself.

Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def to the farther line which does not fail to compile.

code_string = """

#A comment

def foo(a, b):

  return a + b



def bir(a, b):

  c = a + b

  return c



class Bar(object):

  def __init__(self):

    self.my_list = [

        'a',

        'b',

    ]



def baz():

  return [

1,

  ]



""".strip()



lines = code_string.split('n')



#looking for lines with 'def' keywords

defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]



#getting the indentation of each 'def'

indents = {}

for i in defidxs:

    ll = lines[i].split('def')

    indents[i] = len(ll[0])



#extracting the strings

end = len(lines)-1

while end > 0:

    if end < defidxs[-1]:

        defidxs.pop()

    try:

        start = defidxs[-1]

    except IndexError: #break if there are no more 'def'

        break



    #empty lines between functions will cause an error, let's remove them

    if len(lines[end].strip()) == 0:

        end = end -1

        continue



    try:

        #fix lines removing indentation or compile will not compile

        fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation

        body = 'n'.join(fixlines)

        compile(body, '<string>', 'exec') #if it fails, throws an exception

        print(body)

        end = start #no need to parse less line if it succeed.

    except:

        pass



    end = end -1

It is a bit nasty because of the except clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile to fail, so I do not know how to avoid it.

This will prints

def baz():

  return [

1,

  ]

def __init__(self):

  self.my_list = [

      'a',

      'b',

  ]

def bir(a, b):

  c = a + b

  return c

def foo(a, b):

  return a + b

Note that the functions are printed in reverse order than those they appear inside code_strings

This should handle even the weird indentation code, but I think it will fails if you have nested functions.

score 1 · Accepted Answer · 2019-01-26 09:39:05Z

I think a small parser is in order to try and take into account this weird exceptions:

import re



code_string = """

# A comment.

def foo(a, b):

  return a + b

class Bar(object):

  def __init__(self):

    self.my_list = [

        'a',

        'b',

    ]



def baz():

  return [

1,

  ]



class Baz(object):

  def hello(self, x):

    return self.hello(

x - 1)



def my_type_annotated_function(

  my_long_argument_name: SomeLongArgumentTypeName

) -> SomeLongReturnTypeName:

  # This function's indentation isn't unusual at all.

  pass



def test_multiline():

    """

    asdasdada

sdadd

    """

    pass



def test_comment(

    a #)

):

    return [a,

    # ]

a]



def test_escaped_endline():

    return "asdad 

asdsad 

asdas"



def test_nested():

    return {():[,

{

}

]

}



def test_strings():

    return '""" asdasd' + """

12asd

12312

"asd2" [

"""



"""

def test_fake_def_in_multiline()

"""

    print(123)

a = "def in_string():"

  def after().

    print("NOPE")



"""Phew this ain't valid syntax""" def something(): pass



""".strip()



code_string += 'n'





func_list=

func = ''

tab  = ''

brackets = {'(':0, '[':0, '{':0}

close = {')':'(', ']':'[', '}':'{'}

string=''

tab_f=''

c1=''

multiline=False

check=False

for line in code_string.split('n'):

    tab = re.findall(r'^s*',line)[0]

    if re.findall(r'^s*def', line) and not string and not multiline:

        func += line + 'n'

        tab_f = tab

        check=True

    if func:

        if not check:

            if sum(brackets.values()) == 0 and not string and not multiline:

                if len(tab) <= len(tab_f):

                    func_list.append(func)

                    func=''

                    c1=''

                    c2=''

                    continue

            func += line + 'n'

        check = False

    for c0 in line:

        if c0 == '#' and not string and not multiline:

            break

        if c1 != '\':

            if c0 in ['"', "'"]:

                if c2 == c1 == c0 == '"' and string != "'":

                    multiline = not multiline

                    string = ''

                    continue

                if not multiline:

                    if c0 in string:

                        string = ''

                    else:

                        if not string:

                            string = c0

            if not string and not multiline:

                if c0 in brackets:

                    brackets[c0] += 1

                if c0 in close:

                    b = close[c0]

                    brackets[b] -= 1

        c2=c1

        c1=c0



for f in func_list:

    print('-'*40)

    print(f)

output:

----------------------------------------

def foo(a, b):

  return a + b



----------------------------------------

  def __init__(self):

    self.my_list = [

        'a',

        'b',

    ]



----------------------------------------

def baz():

  return [

1,

  ]



----------------------------------------

  def hello(self, x):

    return self.hello(

x - 1)



----------------------------------------

def my_type_annotated_function(

  my_long_argument_name: SomeLongArgumentTypeName

) -> SomeLongReturnTypeName:

  # This function's indentation isn't unusual at all.

  pass



----------------------------------------

def test_multiline():

    """

    asdasdada

sdadd

    """

    pass



----------------------------------------

def test_comment(

    a #)

):

    return [a,

    # ]

a]



----------------------------------------

def test_escaped_endline():

    return "asdad asdsad asdas"



----------------------------------------

def test_nested():

    return {():[,

{

}

]

}



----------------------------------------

def test_strings():

    return '""" asdasd' + """

12asd

12312

"asd2" [

"""



----------------------------------------

  def after():

    print("NOPE")

Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with """) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters). — Jan 26 at 2:35
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it — Jan 26 at 2:38
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment). — Jan 26 at 2:40
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize — Jan 26 at 2:44

score 5 · Accepted Answer · 2019-01-26 14:58:06Z

A much more robust solution would be to use the tokenize module. The following code can handle weird indentations, comments, multi-line tokens, single-line function blocks and empty lines within function blocks:

import tokenize

from io import BytesIO

from collections import deque

code_string = """

# A comment.

def foo(a, b):

  return a + b



class Bar(object):

  def __init__(self):



    self.my_list = [

        'a',

        'b',

    ]



  def test(self): pass

  def abc(self):

    '''multi-

    line token'''



def baz():

  return [

1,

  ]



class Baz(object):

  def hello(self, x):

    a = 

1

    return self.hello(

x - 1)



def my_type_annotated_function(

  my_long_argument_name: SomeLongArgumentTypeName

) -> SomeLongReturnTypeName:

  pass

  # unmatched parenthesis: (

""".strip()

file = BytesIO(code_string.encode())

tokens = deque(tokenize.tokenize(file.readline))

lines = 

while tokens:

    token = tokens.popleft()

    if token.type == tokenize.NAME and token.string == 'def':

        start_line, _ = token.start

        last_token = token

        while tokens:

            token = tokens.popleft()

            if token.type == tokenize.NEWLINE:

                break

            last_token = token

        if last_token.type == tokenize.OP and last_token.string == ':':

            indents = 0

            while tokens:

                token = tokens.popleft()

                if token.type == tokenize.NL:

                    continue

                if token.type == tokenize.INDENT:

                    indents += 1

                elif token.type == tokenize.DEDENT:

                    indents -= 1

                    if not indents:

                        break

                else:

                    last_token = token

        lines.append((start_line, last_token.end[0]))

print(lines)

This outputs:

[(2, 3), (6, 11), (13, 13), (14, 16), (18, 21), (24, 27), (29, 33)]

Note however that the continuation line:

a = 

1

is treated by tokenize as one line even though it is in fact two lines, since if you print the tokens:

TokenInfo(type=53 (OP), string=':', start=(24, 20), end=(24, 21), line='  def hello(self, x):n')

TokenInfo(type=4 (NEWLINE), string='n', start=(24, 21), end=(24, 22), line='  def hello(self, x):n')

TokenInfo(type=5 (INDENT), string='    ', start=(25, 0), end=(25, 4), line='    a = 1n')

TokenInfo(type=1 (NAME), string='a', start=(25, 4), end=(25, 5), line='    a = 1n')

TokenInfo(type=53 (OP), string='=', start=(25, 6), end=(25, 7), line='    a = 1n')

TokenInfo(type=2 (NUMBER), string='1', start=(25, 8), end=(25, 9), line='    a = 1n')

TokenInfo(type=4 (NEWLINE), string='n', start=(25, 9), end=(25, 10), line='    a = 1n')

TokenInfo(type=1 (NAME), string='return', start=(26, 4), end=(26, 10), line='    return self.hello(n')

you can see that the continuation line is literally treated as one line of ' a = 1n', with only one line number 25. This is apparently a bug/limitation of the tokenize module unfortunately.

This looks promising. Are you sure it works for the "weird indentation" cases? I tried your code and it seems to break on all of the "weird indentation" functions I provided, extracting only the first part of each function. — Jan 26 at 2:12
Oops did not actually have any logic to handle weird indentation. Added now. — Jan 26 at 4:22
This fails to handle line continuations. Looking for INDENT and DEDENT tokens (and checking for the single-logical-line case, where there is no INDENT) would probably be more robust. — Jan 26 at 4:41
@user2357112 Using INDENT and DEDENT was indeed my first thought too, although it wasn't immediately clear to me how to handle single-logical-line cases easily. I've now rewritten the code so that it uses INDENTs and DEDENTs, but noticed that a continuation line is treated as if it were a single line by tokenize even though it is literally multiple lines, so the line numbers returned by tokenize would be off in such a case. It's apparently a bug/limitation of the tokenize module unfortunately. — Jan 26 at 6:08

ValentinoValentino 7131313 · Accepted Answer · 2019-01-26 03:08:41Z

Rather than reinventing a parser, I would use python itself.

Basically I would use the compile() built-in function, which can check if a string is a valid python code by compiling it. I pass to it a string made of selected lines, starting from each def to the farther line which does not fail to compile.

code_string = """

#A comment

def foo(a, b):

  return a + b



def bir(a, b):

  c = a + b

  return c



class Bar(object):

  def __init__(self):

    self.my_list = [

        'a',

        'b',

    ]



def baz():

  return [

1,

  ]



""".strip()



lines = code_string.split('n')



#looking for lines with 'def' keywords

defidxs = [e[0] for e in enumerate(lines) if 'def' in e[1]]



#getting the indentation of each 'def'

indents = {}

for i in defidxs:

    ll = lines[i].split('def')

    indents[i] = len(ll[0])



#extracting the strings

end = len(lines)-1

while end > 0:

    if end < defidxs[-1]:

        defidxs.pop()

    try:

        start = defidxs[-1]

    except IndexError: #break if there are no more 'def'

        break



    #empty lines between functions will cause an error, let's remove them

    if len(lines[end].strip()) == 0:

        end = end -1

        continue



    try:

        #fix lines removing indentation or compile will not compile

        fixlines = [ll[indents[start]:] for ll in lines[start:end+1]] #remove indentation

        body = 'n'.join(fixlines)

        compile(body, '<string>', 'exec') #if it fails, throws an exception

        print(body)

        end = start #no need to parse less line if it succeed.

    except:

        pass



    end = end -1

It is a bit nasty because of the except clause without specific exceptions, which is usually not recommended, but there is no way to know what may cause compile to fail, so I do not know how to avoid it.

This will prints

def baz():

  return [

1,

  ]

def __init__(self):

  self.my_list = [

      'a',

      'b',

  ]

def bir(a, b):

  c = a + b

  return c

def foo(a, b):

  return a + b

Note that the functions are printed in reverse order than those they appear inside code_strings

This should handle even the weird indentation code, but I think it will fails if you have nested functions.

score 1 · Accepted Answer · 2019-01-26 09:39:05Z

I think a small parser is in order to try and take into account this weird exceptions:

import re



code_string = """

# A comment.

def foo(a, b):

  return a + b

class Bar(object):

  def __init__(self):

    self.my_list = [

        'a',

        'b',

    ]



def baz():

  return [

1,

  ]



class Baz(object):

  def hello(self, x):

    return self.hello(

x - 1)



def my_type_annotated_function(

  my_long_argument_name: SomeLongArgumentTypeName

) -> SomeLongReturnTypeName:

  # This function's indentation isn't unusual at all.

  pass



def test_multiline():

    """

    asdasdada

sdadd

    """

    pass



def test_comment(

    a #)

):

    return [a,

    # ]

a]



def test_escaped_endline():

    return "asdad 

asdsad 

asdas"



def test_nested():

    return {():[,

{

}

]

}



def test_strings():

    return '""" asdasd' + """

12asd

12312

"asd2" [

"""



"""

def test_fake_def_in_multiline()

"""

    print(123)

a = "def in_string():"

  def after().

    print("NOPE")



"""Phew this ain't valid syntax""" def something(): pass



""".strip()



code_string += 'n'





func_list=

func = ''

tab  = ''

brackets = {'(':0, '[':0, '{':0}

close = {')':'(', ']':'[', '}':'{'}

string=''

tab_f=''

c1=''

multiline=False

check=False

for line in code_string.split('n'):

    tab = re.findall(r'^s*',line)[0]

    if re.findall(r'^s*def', line) and not string and not multiline:

        func += line + 'n'

        tab_f = tab

        check=True

    if func:

        if not check:

            if sum(brackets.values()) == 0 and not string and not multiline:

                if len(tab) <= len(tab_f):

                    func_list.append(func)

                    func=''

                    c1=''

                    c2=''

                    continue

            func += line + 'n'

        check = False

    for c0 in line:

        if c0 == '#' and not string and not multiline:

            break

        if c1 != '\':

            if c0 in ['"', "'"]:

                if c2 == c1 == c0 == '"' and string != "'":

                    multiline = not multiline

                    string = ''

                    continue

                if not multiline:

                    if c0 in string:

                        string = ''

                    else:

                        if not string:

                            string = c0

            if not string and not multiline:

                if c0 in brackets:

                    brackets[c0] += 1

                if c0 in close:

                    b = close[c0]

                    brackets[b] -= 1

        c2=c1

        c1=c0



for f in func_list:

    print('-'*40)

    print(f)

output:

----------------------------------------

def foo(a, b):

  return a + b



----------------------------------------

  def __init__(self):

    self.my_list = [

        'a',

        'b',

    ]



----------------------------------------

def baz():

  return [

1,

  ]



----------------------------------------

  def hello(self, x):

    return self.hello(

x - 1)



----------------------------------------

def my_type_annotated_function(

  my_long_argument_name: SomeLongArgumentTypeName

) -> SomeLongReturnTypeName:

  # This function's indentation isn't unusual at all.

  pass



----------------------------------------

def test_multiline():

    """

    asdasdada

sdadd

    """

    pass



----------------------------------------

def test_comment(

    a #)

):

    return [a,

    # ]

a]



----------------------------------------

def test_escaped_endline():

    return "asdad asdsad asdas"



----------------------------------------

def test_nested():

    return {():[,

{

}

]

}



----------------------------------------

def test_strings():

    return '""" asdasd' + """

12asd

12312

"asd2" [

"""



----------------------------------------

  def after():

    print("NOPE")

Writing a parser is hard. I haven't run your code but just by glancing at it, I think it fails for multiline strings (delimited with """) and escaped string delimiters, and it doesn't understand comments (which may contain stray brackets or string delimiters). — Jan 26 at 2:35
Please do try it i should've included cases including strings and open/close brackets should not count if inside a string. EDIT: the escaped delimiters are an exception i will include it — Jan 26 at 2:38
You aren't checking for comments so there's no way you can tell if a close parenthesis should be counted or not (it shouldn't count if it's inside a comment). — Jan 26 at 2:40
Included both escaped characters and comments. Sorry i do tend to write parsers by starting simple and adding stuff as i find exception, not the best practice i realize — Jan 26 at 2:44

搜尋此網誌

Jtdylktuy

Extract Python function source text from the source code string

3 Answers
3

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Post as a guest

Popular posts from this blog

What is the story behind “peach kanji” 桃?

Pontes Indestrutíveis

Padre Marcelo Rossi

Extract Python function source text from the source code string

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

What is the story behind “peach kanji” 桃?

Pontes Indestrutíveis

Padre Marcelo Rossi

3 Answers
3

3 Answers
3

3 Answers
3