How can I get a Wikipedia article's text using Python 3 with Beautiful Soup?

I have this script made in Python 3:

response = simple_get("https://en.wikipedia.org/wiki/Mathematics")

result = {}

result["url"] = url

if response is not None:

    html = BeautifulSoup(response, 'html.parser')

    title = html.select("#firstHeading")[0].text

As you can see I can get the title from the article, but I cannot figure out how to get the text from "Mathematics (from Greek μά..." to the contents table...

edited Dec 17 at 13:43

Boann

36.7k1287121

asked Dec 16 at 17:17

wiki one

855

add a comment |

I have this script made in Python 3:

response = simple_get("https://en.wikipedia.org/wiki/Mathematics")

result = {}

result["url"] = url

if response is not None:

    html = BeautifulSoup(response, 'html.parser')

    title = html.select("#firstHeading")[0].text

As you can see I can get the title from the article, but I cannot figure out how to get the text from "Mathematics (from Greek μά..." to the contents table...

edited Dec 17 at 13:43

Boann

36.7k1287121

asked Dec 16 at 17:17

wiki one

855

add a comment |

I have this script made in Python 3:

response = simple_get("https://en.wikipedia.org/wiki/Mathematics")

result = {}

result["url"] = url

if response is not None:

    html = BeautifulSoup(response, 'html.parser')

    title = html.select("#firstHeading")[0].text

As you can see I can get the title from the article, but I cannot figure out how to get the text from "Mathematics (from Greek μά..." to the contents table...

edited Dec 17 at 13:43

Boann

36.7k1287121

asked Dec 16 at 17:17

wiki one

855

I have this script made in Python 3:

response = simple_get("https://en.wikipedia.org/wiki/Mathematics")

result = {}

result["url"] = url

if response is not None:

    html = BeautifulSoup(response, 'html.parser')

    title = html.select("#firstHeading")[0].text

As you can see I can get the title from the article, but I cannot figure out how to get the text from "Mathematics (from Greek μά..." to the contents table...

python web-scraping beautifulsoup

edited Dec 17 at 13:43

Boann

36.7k1287121

asked Dec 16 at 17:17

wiki one

855

edited Dec 17 at 13:43

Boann

36.7k1287121

asked Dec 16 at 17:17

wiki one

855

edited Dec 17 at 13:43

Boann

36.7k1287121

edited Dec 17 at 13:43

Boann

36.7k1287121

edited Dec 17 at 13:43

Boann

36.7k1287121

asked Dec 16 at 17:17

wiki one

855

asked Dec 16 at 17:17

wiki one

855

asked Dec 16 at 17:17

wiki one

855

add a comment |

6 Answers
6

active

oldest

votes

select the <p> tag. There are 52 elements. Not sure if you want the whole thing, but you can iterate through those tags to store it as you may. I just chose to print each of them to show the output.

import bs4

import requests





response = requests.get("https://en.wikipedia.org/wiki/Mathematics")



if response is not None:

    html = bs4.BeautifulSoup(response.text, 'html.parser')



    title = html.select("#firstHeading")[0].text

    paragraphs = html.select("p")

    for para in paragraphs:

        print (para.text)



    # just grab the text up to contents as stated in question

    intro = 'n'.join([ para.text for para in paragraphs[0:5]])

    print (intro)

edited Dec 16 at 17:33

answered Dec 16 at 17:24

chitown88

1,6491314

3

if response is not None can be rewritten as if response. also since the content may change in the future I would have suggested to get the entire div, read only the p and stop when you reach the div with class "toclimit-3"
– PinoSan
Dec 16 at 17:42

4

@PinoSan I think it doesn't hurt to check for None explicitly. For example bool('' is not None) is not the same as bool(''). However, in this case the None check is completely unnecessary because response will always be a requests.models.Response object. If the request fails an exception will be raised.
– t.m.adam
Dec 16 at 18:42

@t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
– PinoSan
Dec 16 at 19:17

@PinoSan Of course, and I too prefer the if response style, but you know, "Explicit is better than implicit.". The problem with if response is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
– t.m.adam
Dec 16 at 19:22

Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
– Noah B. Johnson
Dec 18 at 1:56

add a comment |

There is a much, much more easy way to get information from wikipedia - Wikipedia API.

There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:

import wikipediaapi



wiki_wiki = wikipediaapi.Wikipedia('en')



page = wiki_wiki.page('Mathematics')

print(page.summary)

Prints:

Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
includes the study of such topics as quantity, structure, space, and
change...(omitted intentionally)

And, in general, try to avoid screen-scraping if there's a direct API available.

edited Dec 17 at 0:10

answered Dec 16 at 17:47

alecxe

321k64619845

add a comment |

Use the library wikipedia

import wikipedia

#print(wikipedia.summary("Mathematics"))

#wikipedia.search("Mathematics")

print(wikipedia.page("Mathematics").content)

edited Dec 16 at 17:34

answered Dec 16 at 17:26

QHarr

29.9k81841

3

I'd use wikipediaapi instead, wikipedia module seems to be not maintained. Though, both would do the job in a similar manner.
– alecxe
Dec 16 at 17:49

add a comment |

You can get the desired output using lxml library like following.

import requests

from lxml.html import fromstring



url = "https://en.wikipedia.org/wiki/Mathematics"



res = requests.get(url)

source = fromstring(res.content)

paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])

print(paragraph)

Using BeautifulSoup:

from bs4 import BeautifulSoup

import requests



res = requests.get("https://en.wikipedia.org/wiki/Mathematics")

soup = BeautifulSoup(res.text, 'html.parser')

for item in soup.find_all("p"):

    if item.text.startswith("The history"):break

    print(item.text)

edited Dec 16 at 19:54

answered Dec 16 at 19:14

SIM

9,8453740

add a comment |

What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:

Probably the easiest way in your case is to include the parameter action=render in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.

Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.

The advantage of using the API is that it can also give you a lot of other information about the page that you may find useful. For example, if you'd like to have a list of the interlanguage links normally shown in the page's sidebar, or the categories normally shown below the content area, you can get those from the API like this:

https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories

(To also get the page content with the same request, use prop=langlinks|categories|text.)

There are several Python libraries for using the MediaWiki API that can automate some of nitty gritty details of using it, although the feature set they support can vary. That said, just using the API directly from your code without a library in between is perfectly possible, too.

answered Dec 17 at 2:17

Ilmari Karonen

37.1k566125

add a comment |

To get a proper way using function, you can just get JSON API offered by Wikipedia :

from urllib.request import urlopen

from urllib.parse import urlencode

from json import loads





def getJSON(page):

    params = urlencode({

        'format': 'json',

        'action': 'parse',

        'prop': 'text',

        'redirects' : 'true',

        'page': page})

    API = "https://en.wikipedia.org/w/api.php"

    response = urlopen(API + "?" + params)

    return response.read().decode('utf-8')





def getRawPage(page):

    parsed = loads(getJSON(page))

    try:

        title = parsed['parse']['title']

        content = parsed['parse']['text']['*']

        return title, content

    except KeyError:

        # The page doesn't exist

        return None, None



title, content = getRawPage("Mathematics")

enter image description here

You can then parse it with any library you want to extract what you need :)

answered Dec 17 at 17:55

LaSul

446215

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53804643%2fhow-can-i-get-a-wikipedia-articles-text-using-python-3-with-beautiful-soup%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

6 Answers
6

active

oldest

votes

6 Answers
6

active

oldest

votes

import bs4

import requests





response = requests.get("https://en.wikipedia.org/wiki/Mathematics")



if response is not None:

    html = bs4.BeautifulSoup(response.text, 'html.parser')



    title = html.select("#firstHeading")[0].text

    paragraphs = html.select("p")

    for para in paragraphs:

        print (para.text)



    # just grab the text up to contents as stated in question

    intro = 'n'.join([ para.text for para in paragraphs[0:5]])

    print (intro)

edited Dec 16 at 17:33

answered Dec 16 at 17:24

chitown88

1,6491314

3

if response is not None can be rewritten as if response. also since the content may change in the future I would have suggested to get the entire div, read only the p and stop when you reach the div with class "toclimit-3"
– PinoSan
Dec 16 at 17:42

4

@PinoSan I think it doesn't hurt to check for None explicitly. For example bool('' is not None) is not the same as bool(''). However, in this case the None check is completely unnecessary because response will always be a requests.models.Response object. If the request fails an exception will be raised.
– t.m.adam
Dec 16 at 18:42

@t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
– PinoSan
Dec 16 at 19:17

@PinoSan Of course, and I too prefer the if response style, but you know, "Explicit is better than implicit.". The problem with if response is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
– t.m.adam
Dec 16 at 19:22

Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
– Noah B. Johnson
Dec 18 at 1:56

add a comment |

import bs4

import requests





response = requests.get("https://en.wikipedia.org/wiki/Mathematics")



if response is not None:

    html = bs4.BeautifulSoup(response.text, 'html.parser')



    title = html.select("#firstHeading")[0].text

    paragraphs = html.select("p")

    for para in paragraphs:

        print (para.text)



    # just grab the text up to contents as stated in question

    intro = 'n'.join([ para.text for para in paragraphs[0:5]])

    print (intro)

edited Dec 16 at 17:33

answered Dec 16 at 17:24

chitown88

1,6491314

3

if response is not None can be rewritten as if response. also since the content may change in the future I would have suggested to get the entire div, read only the p and stop when you reach the div with class "toclimit-3"
– PinoSan
Dec 16 at 17:42

4

@PinoSan I think it doesn't hurt to check for None explicitly. For example bool('' is not None) is not the same as bool(''). However, in this case the None check is completely unnecessary because response will always be a requests.models.Response object. If the request fails an exception will be raised.
– t.m.adam
Dec 16 at 18:42

@t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
– PinoSan
Dec 16 at 19:17

@PinoSan Of course, and I too prefer the if response style, but you know, "Explicit is better than implicit.". The problem with if response is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
– t.m.adam
Dec 16 at 19:22

Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
– Noah B. Johnson
Dec 18 at 1:56

add a comment |

import bs4

import requests





response = requests.get("https://en.wikipedia.org/wiki/Mathematics")



if response is not None:

    html = bs4.BeautifulSoup(response.text, 'html.parser')



    title = html.select("#firstHeading")[0].text

    paragraphs = html.select("p")

    for para in paragraphs:

        print (para.text)



    # just grab the text up to contents as stated in question

    intro = 'n'.join([ para.text for para in paragraphs[0:5]])

    print (intro)

edited Dec 16 at 17:33

answered Dec 16 at 17:24

chitown88

1,6491314

import bs4

import requests





response = requests.get("https://en.wikipedia.org/wiki/Mathematics")



if response is not None:

    html = bs4.BeautifulSoup(response.text, 'html.parser')



    title = html.select("#firstHeading")[0].text

    paragraphs = html.select("p")

    for para in paragraphs:

        print (para.text)



    # just grab the text up to contents as stated in question

    intro = 'n'.join([ para.text for para in paragraphs[0:5]])

    print (intro)

edited Dec 16 at 17:33

answered Dec 16 at 17:24

chitown88

1,6491314

edited Dec 16 at 17:33

answered Dec 16 at 17:24

chitown88

1,6491314

answered Dec 16 at 17:24

chitown88

1,6491314

answered Dec 16 at 17:24

chitown88

1,6491314

3

if response is not None can be rewritten as if response. also since the content may change in the future I would have suggested to get the entire div, read only the p and stop when you reach the div with class "toclimit-3"
– PinoSan
Dec 16 at 17:42

4

@PinoSan I think it doesn't hurt to check for None explicitly. For example bool('' is not None) is not the same as bool(''). However, in this case the None check is completely unnecessary because response will always be a requests.models.Response object. If the request fails an exception will be raised.
– t.m.adam
Dec 16 at 18:42

@t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
– PinoSan
Dec 16 at 19:17

@PinoSan Of course, and I too prefer the if response style, but you know, "Explicit is better than implicit.". The problem with if response is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
– t.m.adam
Dec 16 at 19:22

Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
– Noah B. Johnson
Dec 18 at 1:56

add a comment |

3

if response is not None can be rewritten as if response. also since the content may change in the future I would have suggested to get the entire div, read only the p and stop when you reach the div with class "toclimit-3"
– PinoSan
Dec 16 at 17:42

4

@PinoSan I think it doesn't hurt to check for None explicitly. For example bool('' is not None) is not the same as bool(''). However, in this case the None check is completely unnecessary because response will always be a requests.models.Response object. If the request fails an exception will be raised.
– t.m.adam
Dec 16 at 18:42

@t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
– PinoSan
Dec 16 at 19:17

@PinoSan Of course, and I too prefer the if response style, but you know, "Explicit is better than implicit.". The problem with if response is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
– t.m.adam
Dec 16 at 19:22

Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
– Noah B. Johnson
Dec 18 at 1:56

if response is not None can be rewritten as if response. also since the content may change in the future I would have suggested to get the entire div, read only the p and stop when you reach the div with class "toclimit-3"
– PinoSan
Dec 16 at 17:42

@PinoSan I think it doesn't hurt to check for None explicitly. For example bool('' is not None) is not the same as bool(''). However, in this case the None check is completely unnecessary because response will always be a requests.models.Response object. If the request fails an exception will be raised.
– t.m.adam
Dec 16 at 18:42

@t.m.adam what you are saying is true but as you said the response is not a string. So you just wanted to check that it was a valid object, not an empty string, None or an empty dictionary, ... About the exceptions, I agree we should check for exceptions in case of networks errors but also we should check for status code to be 200
– PinoSan
Dec 16 at 19:17

@PinoSan Of course, and I too prefer the if response style, but you know, "Explicit is better than implicit.". The problem with if response is that it may produce strange errors, difficult to debug. But yes, in most cases a simple boolean check should be enough.
– t.m.adam
Dec 16 at 19:22

Just because you can scrape the page, doesn't mean you should. The Wikipedia API has python packages that allow for easy and direct access to articles without undue load on the site or extra work on your end.
– Noah B. Johnson
Dec 18 at 1:56

add a comment |

There is a much, much more easy way to get information from wikipedia - Wikipedia API.

There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:

import wikipediaapi



wiki_wiki = wikipediaapi.Wikipedia('en')



page = wiki_wiki.page('Mathematics')

print(page.summary)

Prints:

Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
includes the study of such topics as quantity, structure, space, and
change...(omitted intentionally)

And, in general, try to avoid screen-scraping if there's a direct API available.

edited Dec 17 at 0:10

answered Dec 16 at 17:47

alecxe

321k64619845

add a comment |

There is a much, much more easy way to get information from wikipedia - Wikipedia API.

There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:

import wikipediaapi



wiki_wiki = wikipediaapi.Wikipedia('en')



page = wiki_wiki.page('Mathematics')

print(page.summary)

Prints:

Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
includes the study of such topics as quantity, structure, space, and
change...(omitted intentionally)

And, in general, try to avoid screen-scraping if there's a direct API available.

edited Dec 17 at 0:10

answered Dec 16 at 17:47

alecxe

321k64619845

add a comment |

There is a much, much more easy way to get information from wikipedia - Wikipedia API.

There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:

import wikipediaapi



wiki_wiki = wikipediaapi.Wikipedia('en')



page = wiki_wiki.page('Mathematics')

print(page.summary)

Prints:

Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
includes the study of such topics as quantity, structure, space, and
change...(omitted intentionally)

And, in general, try to avoid screen-scraping if there's a direct API available.

edited Dec 17 at 0:10

answered Dec 16 at 17:47

alecxe

321k64619845

There is a much, much more easy way to get information from wikipedia - Wikipedia API.

There is this Python wrapper, which allows you to do it in a few lines only with zero HTML-parsing:

import wikipediaapi



wiki_wiki = wikipediaapi.Wikipedia('en')



page = wiki_wiki.page('Mathematics')

print(page.summary)

Prints:

Mathematics (from Greek μάθημα máthēma, "knowledge, study, learning")
includes the study of such topics as quantity, structure, space, and
change...(omitted intentionally)

And, in general, try to avoid screen-scraping if there's a direct API available.

edited Dec 17 at 0:10

answered Dec 16 at 17:47

alecxe

321k64619845

edited Dec 17 at 0:10

answered Dec 16 at 17:47

alecxe

321k64619845

answered Dec 16 at 17:47

alecxe

321k64619845

answered Dec 16 at 17:47

alecxe

321k64619845

add a comment |

Use the library wikipedia

import wikipedia

#print(wikipedia.summary("Mathematics"))

#wikipedia.search("Mathematics")

print(wikipedia.page("Mathematics").content)

edited Dec 16 at 17:34

answered Dec 16 at 17:26

QHarr

29.9k81841

3

I'd use wikipediaapi instead, wikipedia module seems to be not maintained. Though, both would do the job in a similar manner.
– alecxe
Dec 16 at 17:49

add a comment |

Use the library wikipedia

import wikipedia

#print(wikipedia.summary("Mathematics"))

#wikipedia.search("Mathematics")

print(wikipedia.page("Mathematics").content)

edited Dec 16 at 17:34

answered Dec 16 at 17:26

QHarr

29.9k81841

3

I'd use wikipediaapi instead, wikipedia module seems to be not maintained. Though, both would do the job in a similar manner.
– alecxe
Dec 16 at 17:49

add a comment |

Use the library wikipedia

import wikipedia

#print(wikipedia.summary("Mathematics"))

#wikipedia.search("Mathematics")

print(wikipedia.page("Mathematics").content)

edited Dec 16 at 17:34

answered Dec 16 at 17:26

QHarr

29.9k81841

Use the library wikipedia

import wikipedia

#print(wikipedia.summary("Mathematics"))

#wikipedia.search("Mathematics")

print(wikipedia.page("Mathematics").content)

edited Dec 16 at 17:34

answered Dec 16 at 17:26

QHarr

29.9k81841

edited Dec 16 at 17:34

answered Dec 16 at 17:26

QHarr

29.9k81841

answered Dec 16 at 17:26

QHarr

29.9k81841

answered Dec 16 at 17:26

QHarr

29.9k81841

3

I'd use wikipediaapi instead, wikipedia module seems to be not maintained. Though, both would do the job in a similar manner.
– alecxe
Dec 16 at 17:49

add a comment |

3

I'd use wikipediaapi instead, wikipedia module seems to be not maintained. Though, both would do the job in a similar manner.
– alecxe
Dec 16 at 17:49

I'd use wikipediaapi instead, wikipedia module seems to be not maintained. Though, both would do the job in a similar manner.
– alecxe
Dec 16 at 17:49

add a comment |

You can get the desired output using lxml library like following.

import requests

from lxml.html import fromstring



url = "https://en.wikipedia.org/wiki/Mathematics"



res = requests.get(url)

source = fromstring(res.content)

paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])

print(paragraph)

Using BeautifulSoup:

from bs4 import BeautifulSoup

import requests



res = requests.get("https://en.wikipedia.org/wiki/Mathematics")

soup = BeautifulSoup(res.text, 'html.parser')

for item in soup.find_all("p"):

    if item.text.startswith("The history"):break

    print(item.text)

edited Dec 16 at 19:54

answered Dec 16 at 19:14

SIM

9,8453740

add a comment |

You can get the desired output using lxml library like following.

import requests

from lxml.html import fromstring



url = "https://en.wikipedia.org/wiki/Mathematics"



res = requests.get(url)

source = fromstring(res.content)

paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])

print(paragraph)

Using BeautifulSoup:

from bs4 import BeautifulSoup

import requests



res = requests.get("https://en.wikipedia.org/wiki/Mathematics")

soup = BeautifulSoup(res.text, 'html.parser')

for item in soup.find_all("p"):

    if item.text.startswith("The history"):break

    print(item.text)

edited Dec 16 at 19:54

answered Dec 16 at 19:14

SIM

9,8453740

add a comment |

You can get the desired output using lxml library like following.

import requests

from lxml.html import fromstring



url = "https://en.wikipedia.org/wiki/Mathematics"



res = requests.get(url)

source = fromstring(res.content)

paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])

print(paragraph)

Using BeautifulSoup:

from bs4 import BeautifulSoup

import requests



res = requests.get("https://en.wikipedia.org/wiki/Mathematics")

soup = BeautifulSoup(res.text, 'html.parser')

for item in soup.find_all("p"):

    if item.text.startswith("The history"):break

    print(item.text)

edited Dec 16 at 19:54

answered Dec 16 at 19:14

SIM

9,8453740

You can get the desired output using lxml library like following.

import requests

from lxml.html import fromstring



url = "https://en.wikipedia.org/wiki/Mathematics"



res = requests.get(url)

source = fromstring(res.content)

paragraph = 'n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])

print(paragraph)

Using BeautifulSoup:

from bs4 import BeautifulSoup

import requests



res = requests.get("https://en.wikipedia.org/wiki/Mathematics")

soup = BeautifulSoup(res.text, 'html.parser')

for item in soup.find_all("p"):

    if item.text.startswith("The history"):break

    print(item.text)

edited Dec 16 at 19:54

answered Dec 16 at 19:14

SIM

9,8453740

edited Dec 16 at 19:54

answered Dec 16 at 19:14

SIM

9,8453740

answered Dec 16 at 19:14

SIM

9,8453740

answered Dec 16 at 19:14

SIM

9,8453740

add a comment |

What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:

Probably the easiest way in your case is to include the parameter action=render in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.

Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.

https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories

(To also get the page content with the same request, use prop=langlinks|categories|text.)

answered Dec 17 at 2:17

Ilmari Karonen

37.1k566125

add a comment |

What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:

Probably the easiest way in your case is to include the parameter action=render in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.

Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.

https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories

(To also get the page content with the same request, use prop=langlinks|categories|text.)

answered Dec 17 at 2:17

Ilmari Karonen

37.1k566125

add a comment |

What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:

Probably the easiest way in your case is to include the parameter action=render in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.

Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.

https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories

(To also get the page content with the same request, use prop=langlinks|categories|text.)

answered Dec 17 at 2:17

Ilmari Karonen

37.1k566125

What you seem to want is the (HTML) page content without the surrounding navigation elements. As I described in this earlier answer from 2013, there are (at least) two ways to get it:

Probably the easiest way in your case is to include the parameter action=render in the URL, as in https://en.wikipedia.org/wiki/Mathematics?action=render. This will give you just the content HTML, and nothing else.

Alternatively, you can also obtain the page content via the MediaWiki API, as in https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics.

https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Mathematics&prop=langlinks|categories

(To also get the page content with the same request, use prop=langlinks|categories|text.)

answered Dec 17 at 2:17

Ilmari Karonen

37.1k566125

answered Dec 17 at 2:17

Ilmari Karonen

37.1k566125

answered Dec 17 at 2:17

Ilmari Karonen

37.1k566125

answered Dec 17 at 2:17

Ilmari Karonen

37.1k566125

add a comment |

To get a proper way using function, you can just get JSON API offered by Wikipedia :

from urllib.request import urlopen

from urllib.parse import urlencode

from json import loads





def getJSON(page):

    params = urlencode({

        'format': 'json',

        'action': 'parse',

        'prop': 'text',

        'redirects' : 'true',

        'page': page})

    API = "https://en.wikipedia.org/w/api.php"

    response = urlopen(API + "?" + params)

    return response.read().decode('utf-8')





def getRawPage(page):

    parsed = loads(getJSON(page))

    try:

        title = parsed['parse']['title']

        content = parsed['parse']['text']['*']

        return title, content

    except KeyError:

        # The page doesn't exist

        return None, None



title, content = getRawPage("Mathematics")

enter image description here

You can then parse it with any library you want to extract what you need :)

answered Dec 17 at 17:55

LaSul

446215

add a comment |

To get a proper way using function, you can just get JSON API offered by Wikipedia :

from urllib.request import urlopen

from urllib.parse import urlencode

from json import loads





def getJSON(page):

    params = urlencode({

        'format': 'json',

        'action': 'parse',

        'prop': 'text',

        'redirects' : 'true',

        'page': page})

    API = "https://en.wikipedia.org/w/api.php"

    response = urlopen(API + "?" + params)

    return response.read().decode('utf-8')





def getRawPage(page):

    parsed = loads(getJSON(page))

    try:

        title = parsed['parse']['title']

        content = parsed['parse']['text']['*']

        return title, content

    except KeyError:

        # The page doesn't exist

        return None, None



title, content = getRawPage("Mathematics")

enter image description here

You can then parse it with any library you want to extract what you need :)

answered Dec 17 at 17:55

LaSul

446215

add a comment |

To get a proper way using function, you can just get JSON API offered by Wikipedia :

from urllib.request import urlopen

from urllib.parse import urlencode

from json import loads





def getJSON(page):

    params = urlencode({

        'format': 'json',

        'action': 'parse',

        'prop': 'text',

        'redirects' : 'true',

        'page': page})

    API = "https://en.wikipedia.org/w/api.php"

    response = urlopen(API + "?" + params)

    return response.read().decode('utf-8')





def getRawPage(page):

    parsed = loads(getJSON(page))

    try:

        title = parsed['parse']['title']

        content = parsed['parse']['text']['*']

        return title, content

    except KeyError:

        # The page doesn't exist

        return None, None



title, content = getRawPage("Mathematics")

enter image description here

You can then parse it with any library you want to extract what you need :)

answered Dec 17 at 17:55

LaSul

446215

To get a proper way using function, you can just get JSON API offered by Wikipedia :

from urllib.request import urlopen

from urllib.parse import urlencode

from json import loads





def getJSON(page):

    params = urlencode({

        'format': 'json',

        'action': 'parse',

        'prop': 'text',

        'redirects' : 'true',

        'page': page})

    API = "https://en.wikipedia.org/w/api.php"

    response = urlopen(API + "?" + params)

    return response.read().decode('utf-8')





def getRawPage(page):

    parsed = loads(getJSON(page))

    try:

        title = parsed['parse']['title']

        content = parsed['parse']['text']['*']

        return title, content

    except KeyError:

        # The page doesn't exist

        return None, None



title, content = getRawPage("Mathematics")

enter image description here

You can then parse it with any library you want to extract what you need :)

answered Dec 17 at 17:55

LaSul

446215

answered Dec 17 at 17:55

LaSul

446215

answered Dec 17 at 17:55

LaSul

446215

answered Dec 17 at 17:55

LaSul

446215

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

pL1,ieaY,Og5 TbrFboOfU,zErKk4f,L,QTmZK,pMvYY

搜尋此網誌

Jtdylktuy