Command line tool to search and replace text on a PDF
I have a PDF that has my name as an obnoxious watermark through out a rather long PDF file. I tried replacing the text in LibreOffice Draw with blanks, but while my name does appear as text, the find and replace function seems to tank my computer taking significant RAM and CPU time to do.
Is there a command line way to remove strings from PDF? Hmm... can sed
do that?
command-line libreoffice pdf
add a comment |
I have a PDF that has my name as an obnoxious watermark through out a rather long PDF file. I tried replacing the text in LibreOffice Draw with blanks, but while my name does appear as text, the find and replace function seems to tank my computer taking significant RAM and CPU time to do.
Is there a command line way to remove strings from PDF? Hmm... can sed
do that?
command-line libreoffice pdf
add a comment |
I have a PDF that has my name as an obnoxious watermark through out a rather long PDF file. I tried replacing the text in LibreOffice Draw with blanks, but while my name does appear as text, the find and replace function seems to tank my computer taking significant RAM and CPU time to do.
Is there a command line way to remove strings from PDF? Hmm... can sed
do that?
command-line libreoffice pdf
I have a PDF that has my name as an obnoxious watermark through out a rather long PDF file. I tried replacing the text in LibreOffice Draw with blanks, but while my name does appear as text, the find and replace function seems to tank my computer taking significant RAM and CPU time to do.
Is there a command line way to remove strings from PDF? Hmm... can sed
do that?
command-line libreoffice pdf
command-line libreoffice pdf
edited 2 days ago
Pablo Bianchi
2,3381528
2,3381528
asked Dec 14 at 21:45
j0h
6,2451352112
6,2451352112
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
As in many cases it’s just text, you can often remove it simply with sed
or in fact any text editor – let’s say it says “watermark”:
sed 's/watermark//g' in.pdf >out.pdf
If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk
(How can I install pdftk in Ubuntu 18.04 and later?):
pdftk in.pdf output out.pdf uncompress
If sed
’s output is not readable with your preferred PDF reader, try repairing it with pdftk
:
pdftk out.pdf output out_pdftk.pdf
Further reading: How to Edit PDFs?
Source: How to remove watermark from pdf using pdftk • Super User
1
Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence yoursed
command will not succeed -- not even after uncompressing the PDF.
– Kurt Pfeifle
2 days ago
@KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases wheresed
fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
– dessert
2 days ago
PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases wheresed
method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
– Kurt Pfeifle
2 days ago
add a comment |
Accepted answer will work only in rare cases
Sorry, the answer given by @dessert is as wrong as it could be as a general advice. It will not work for the general case of text replacement in PDFs (watermarks or not), and you'll have to be very lucky for very rare cases of PDFs you encounter were it would work. (Moreover, watermarks inserted by LibreOffice frequently are converted into vector or pixel graphics, even if they appear like text when printed or viewed on screen.... but this case I'll not discuss any further -- below I deal only with real text contents in a PDF.)
Reasons
The reasons for this are these:
What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.
Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...
Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.
Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.
Example
Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:
56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ
I'll dissect for you what that means:
56.8 726.989 Td
:Td
is an operator to move the text positioning on the page;56.8 726.989
are the x-/y-coordinates to describe that exact position./F2 16 Tf
:Tf
is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name/F2
and its size should be16
pt.
[<01>29<0203>-2<0405>6<06>-1<020507>]TJ
:TJ
is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:
<01>
: this is the'W'
.<0203>
: this is the'at'
.<0405>
: this is the'er'
.<06>
: this is the'm'
.<020507>
: this is the'ark'
.
The numbers in between these hex snippets (
29
,-2
,6
and-1
) are correction values which determine the individual spacings of the different characters.
Now you show me how you'd replace that "string" by something else by using sed
... Remember, you do not know the encoding in advance, nor the placement correction numbers, when you deal with an arbitrary PDF. You can only find out by opening its source code in an editor and analysing its content.
Executive Summary
No, there is no command line way to reliably remove unwanted strings from a PDF!
You can only do this if...
(a) ...you are a PDF expert who is skilled to read the PDF source code;
(b) ...you are prepared to analyse the PDF file in question individually;
(c) ...you use a text editor to modify its contents after uncompressing the PDF source code.
WARNING: The answer currently marked as 'accepted' might have worked for the specific PDF of the OP. However, it will not work in the general case. Don't take the "recipe" it advertises for granted!
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1100970%2fcommand-line-tool-to-search-and-replace-text-on-a-pdf%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
As in many cases it’s just text, you can often remove it simply with sed
or in fact any text editor – let’s say it says “watermark”:
sed 's/watermark//g' in.pdf >out.pdf
If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk
(How can I install pdftk in Ubuntu 18.04 and later?):
pdftk in.pdf output out.pdf uncompress
If sed
’s output is not readable with your preferred PDF reader, try repairing it with pdftk
:
pdftk out.pdf output out_pdftk.pdf
Further reading: How to Edit PDFs?
Source: How to remove watermark from pdf using pdftk • Super User
1
Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence yoursed
command will not succeed -- not even after uncompressing the PDF.
– Kurt Pfeifle
2 days ago
@KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases wheresed
fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
– dessert
2 days ago
PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases wheresed
method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
– Kurt Pfeifle
2 days ago
add a comment |
As in many cases it’s just text, you can often remove it simply with sed
or in fact any text editor – let’s say it says “watermark”:
sed 's/watermark//g' in.pdf >out.pdf
If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk
(How can I install pdftk in Ubuntu 18.04 and later?):
pdftk in.pdf output out.pdf uncompress
If sed
’s output is not readable with your preferred PDF reader, try repairing it with pdftk
:
pdftk out.pdf output out_pdftk.pdf
Further reading: How to Edit PDFs?
Source: How to remove watermark from pdf using pdftk • Super User
1
Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence yoursed
command will not succeed -- not even after uncompressing the PDF.
– Kurt Pfeifle
2 days ago
@KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases wheresed
fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
– dessert
2 days ago
PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases wheresed
method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
– Kurt Pfeifle
2 days ago
add a comment |
As in many cases it’s just text, you can often remove it simply with sed
or in fact any text editor – let’s say it says “watermark”:
sed 's/watermark//g' in.pdf >out.pdf
If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk
(How can I install pdftk in Ubuntu 18.04 and later?):
pdftk in.pdf output out.pdf uncompress
If sed
’s output is not readable with your preferred PDF reader, try repairing it with pdftk
:
pdftk out.pdf output out_pdftk.pdf
Further reading: How to Edit PDFs?
Source: How to remove watermark from pdf using pdftk • Super User
As in many cases it’s just text, you can often remove it simply with sed
or in fact any text editor – let’s say it says “watermark”:
sed 's/watermark//g' in.pdf >out.pdf
If your PDF file is compressed this doesn’t work, you need to uncompress it first, e.g. with pdftk
(How can I install pdftk in Ubuntu 18.04 and later?):
pdftk in.pdf output out.pdf uncompress
If sed
’s output is not readable with your preferred PDF reader, try repairing it with pdftk
:
pdftk out.pdf output out_pdftk.pdf
Further reading: How to Edit PDFs?
Source: How to remove watermark from pdf using pdftk • Super User
edited 2 days ago
answered Dec 14 at 21:58
dessert
22k56097
22k56097
1
Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence yoursed
command will not succeed -- not even after uncompressing the PDF.
– Kurt Pfeifle
2 days ago
@KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases wheresed
fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
– dessert
2 days ago
PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases wheresed
method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
– Kurt Pfeifle
2 days ago
add a comment |
1
Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence yoursed
command will not succeed -- not even after uncompressing the PDF.
– Kurt Pfeifle
2 days ago
@KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases wheresed
fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!
– dessert
2 days ago
PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases wheresed
method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.
– Kurt Pfeifle
2 days ago
1
1
Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your
sed
command will not succeed -- not even after uncompressing the PDF.– Kurt Pfeifle
2 days ago
Sorry, your answer is as wrong as it could be. What appears to be ASCII text in the visual representation of its content in a PDF viewer, may be hex encoded inside the PDF source code, or its individual characters might be placed individually, with each having its own coordinate information sprinkled in between the individual characters... Hence your
sed
command will not succeed -- not even after uncompressing the PDF.– Kurt Pfeifle
2 days ago
@KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where
sed
fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!– dessert
2 days ago
@KurtPfeifle It may be, but it may also be just ASCII text, PDF is far from being standardized enough to be sure. I changed the wording to make it sound less catholic, of course you’re right there are cases where
sed
fails here. You’re familiar with the topic, do you have a better answer for OP? I’d love to learn about it!– dessert
2 days ago
PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where
sed
method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.– Kurt Pfeifle
2 days ago
PDF is standardized enough, for me to be sure about what I wrote above and below. I'm quite familiar with this standard. The cases where
sed
method will fail are the overwhelming majority. The OPs requirements cannot be fulfilled by a CLI tool.– Kurt Pfeifle
2 days ago
add a comment |
Accepted answer will work only in rare cases
Sorry, the answer given by @dessert is as wrong as it could be as a general advice. It will not work for the general case of text replacement in PDFs (watermarks or not), and you'll have to be very lucky for very rare cases of PDFs you encounter were it would work. (Moreover, watermarks inserted by LibreOffice frequently are converted into vector or pixel graphics, even if they appear like text when printed or viewed on screen.... but this case I'll not discuss any further -- below I deal only with real text contents in a PDF.)
Reasons
The reasons for this are these:
What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.
Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...
Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.
Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.
Example
Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:
56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ
I'll dissect for you what that means:
56.8 726.989 Td
:Td
is an operator to move the text positioning on the page;56.8 726.989
are the x-/y-coordinates to describe that exact position./F2 16 Tf
:Tf
is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name/F2
and its size should be16
pt.
[<01>29<0203>-2<0405>6<06>-1<020507>]TJ
:TJ
is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:
<01>
: this is the'W'
.<0203>
: this is the'at'
.<0405>
: this is the'er'
.<06>
: this is the'm'
.<020507>
: this is the'ark'
.
The numbers in between these hex snippets (
29
,-2
,6
and-1
) are correction values which determine the individual spacings of the different characters.
Now you show me how you'd replace that "string" by something else by using sed
... Remember, you do not know the encoding in advance, nor the placement correction numbers, when you deal with an arbitrary PDF. You can only find out by opening its source code in an editor and analysing its content.
Executive Summary
No, there is no command line way to reliably remove unwanted strings from a PDF!
You can only do this if...
(a) ...you are a PDF expert who is skilled to read the PDF source code;
(b) ...you are prepared to analyse the PDF file in question individually;
(c) ...you use a text editor to modify its contents after uncompressing the PDF source code.
WARNING: The answer currently marked as 'accepted' might have worked for the specific PDF of the OP. However, it will not work in the general case. Don't take the "recipe" it advertises for granted!
add a comment |
Accepted answer will work only in rare cases
Sorry, the answer given by @dessert is as wrong as it could be as a general advice. It will not work for the general case of text replacement in PDFs (watermarks or not), and you'll have to be very lucky for very rare cases of PDFs you encounter were it would work. (Moreover, watermarks inserted by LibreOffice frequently are converted into vector or pixel graphics, even if they appear like text when printed or viewed on screen.... but this case I'll not discuss any further -- below I deal only with real text contents in a PDF.)
Reasons
The reasons for this are these:
What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.
Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...
Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.
Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.
Example
Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:
56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ
I'll dissect for you what that means:
56.8 726.989 Td
:Td
is an operator to move the text positioning on the page;56.8 726.989
are the x-/y-coordinates to describe that exact position./F2 16 Tf
:Tf
is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name/F2
and its size should be16
pt.
[<01>29<0203>-2<0405>6<06>-1<020507>]TJ
:TJ
is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:
<01>
: this is the'W'
.<0203>
: this is the'at'
.<0405>
: this is the'er'
.<06>
: this is the'm'
.<020507>
: this is the'ark'
.
The numbers in between these hex snippets (
29
,-2
,6
and-1
) are correction values which determine the individual spacings of the different characters.
Now you show me how you'd replace that "string" by something else by using sed
... Remember, you do not know the encoding in advance, nor the placement correction numbers, when you deal with an arbitrary PDF. You can only find out by opening its source code in an editor and analysing its content.
Executive Summary
No, there is no command line way to reliably remove unwanted strings from a PDF!
You can only do this if...
(a) ...you are a PDF expert who is skilled to read the PDF source code;
(b) ...you are prepared to analyse the PDF file in question individually;
(c) ...you use a text editor to modify its contents after uncompressing the PDF source code.
WARNING: The answer currently marked as 'accepted' might have worked for the specific PDF of the OP. However, it will not work in the general case. Don't take the "recipe" it advertises for granted!
add a comment |
Accepted answer will work only in rare cases
Sorry, the answer given by @dessert is as wrong as it could be as a general advice. It will not work for the general case of text replacement in PDFs (watermarks or not), and you'll have to be very lucky for very rare cases of PDFs you encounter were it would work. (Moreover, watermarks inserted by LibreOffice frequently are converted into vector or pixel graphics, even if they appear like text when printed or viewed on screen.... but this case I'll not discuss any further -- below I deal only with real text contents in a PDF.)
Reasons
The reasons for this are these:
What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.
Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...
Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.
Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.
Example
Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:
56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ
I'll dissect for you what that means:
56.8 726.989 Td
:Td
is an operator to move the text positioning on the page;56.8 726.989
are the x-/y-coordinates to describe that exact position./F2 16 Tf
:Tf
is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name/F2
and its size should be16
pt.
[<01>29<0203>-2<0405>6<06>-1<020507>]TJ
:TJ
is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:
<01>
: this is the'W'
.<0203>
: this is the'at'
.<0405>
: this is the'er'
.<06>
: this is the'm'
.<020507>
: this is the'ark'
.
The numbers in between these hex snippets (
29
,-2
,6
and-1
) are correction values which determine the individual spacings of the different characters.
Now you show me how you'd replace that "string" by something else by using sed
... Remember, you do not know the encoding in advance, nor the placement correction numbers, when you deal with an arbitrary PDF. You can only find out by opening its source code in an editor and analysing its content.
Executive Summary
No, there is no command line way to reliably remove unwanted strings from a PDF!
You can only do this if...
(a) ...you are a PDF expert who is skilled to read the PDF source code;
(b) ...you are prepared to analyse the PDF file in question individually;
(c) ...you use a text editor to modify its contents after uncompressing the PDF source code.
WARNING: The answer currently marked as 'accepted' might have worked for the specific PDF of the OP. However, it will not work in the general case. Don't take the "recipe" it advertises for granted!
Accepted answer will work only in rare cases
Sorry, the answer given by @dessert is as wrong as it could be as a general advice. It will not work for the general case of text replacement in PDFs (watermarks or not), and you'll have to be very lucky for very rare cases of PDFs you encounter were it would work. (Moreover, watermarks inserted by LibreOffice frequently are converted into vector or pixel graphics, even if they appear like text when printed or viewed on screen.... but this case I'll not discuss any further -- below I deal only with real text contents in a PDF.)
Reasons
The reasons for this are these:
What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.
Additionally, an ASCII string's individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters...
Also, the hex encoding of the ASCII (and non-ASCII) character table (the "mapping") will not be predictable, and it may change from font to font.
Hence in all these cases your sed command will not succeed -- not even after uncompressing the PDF.
Example
Here is an example for the "string" Watermark, how it can appear inside a PDF created with LibreOffice:
56.8 726.989 Td /F2 16 Tf[<01>29<0203>-2<0405>6<06>-1<020507>]TJ
I'll dissect for you what that means:
56.8 726.989 Td
:Td
is an operator to move the text positioning on the page;56.8 726.989
are the x-/y-coordinates to describe that exact position./F2 16 Tf
:Tf
is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name/F2
and its size should be16
pt.
[<01>29<0203>-2<0405>6<06>-1<020507>]TJ
:TJ
is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the 'charmap' table specific for that PDF and the used font:
<01>
: this is the'W'
.<0203>
: this is the'at'
.<0405>
: this is the'er'
.<06>
: this is the'm'
.<020507>
: this is the'ark'
.
The numbers in between these hex snippets (
29
,-2
,6
and-1
) are correction values which determine the individual spacings of the different characters.
Now you show me how you'd replace that "string" by something else by using sed
... Remember, you do not know the encoding in advance, nor the placement correction numbers, when you deal with an arbitrary PDF. You can only find out by opening its source code in an editor and analysing its content.
Executive Summary
No, there is no command line way to reliably remove unwanted strings from a PDF!
You can only do this if...
(a) ...you are a PDF expert who is skilled to read the PDF source code;
(b) ...you are prepared to analyse the PDF file in question individually;
(c) ...you use a text editor to modify its contents after uncompressing the PDF source code.
WARNING: The answer currently marked as 'accepted' might have worked for the specific PDF of the OP. However, it will not work in the general case. Don't take the "recipe" it advertises for granted!
edited 2 days ago
answered 2 days ago
Kurt Pfeifle
993710
993710
add a comment |
add a comment |
Thanks for contributing an answer to Ask Ubuntu!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1100970%2fcommand-line-tool-to-search-and-replace-text-on-a-pdf%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown