Read XML tags and then remove XML tags using a shell script [closed]
Given the following input:
<start>
<header>
This is header section
</header>
<body>
<body_start>
This is body section
<a>
<b>
<c>
<st>111</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_start>
<body_section>
This is body section
<a>
<b>
<c>
<st>5</st>
</c>
<d>
<st>666</st>
</d>
</b>
<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>
<b>
<c>
<st>845034</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_section>
</body>
</start>
I'd like to perform the following parsing.
If st
value of c
tag is 154
, then the whole <b>
to </b>
tag needs to removed. Note that value 154 may or not be present in the file.
So, if the value 154 is present, then the removal of the following part is needed:
<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>
I want to do the coding in a shell script. I can not use xslt
because my system does not support it.
unix shell-script sed awk
closed as off-topic by JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman Feb 28 at 11:51
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research. Here are a few suggestions on how to properly ask this type of question." – JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman
If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
Given the following input:
<start>
<header>
This is header section
</header>
<body>
<body_start>
This is body section
<a>
<b>
<c>
<st>111</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_start>
<body_section>
This is body section
<a>
<b>
<c>
<st>5</st>
</c>
<d>
<st>666</st>
</d>
</b>
<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>
<b>
<c>
<st>845034</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_section>
</body>
</start>
I'd like to perform the following parsing.
If st
value of c
tag is 154
, then the whole <b>
to </b>
tag needs to removed. Note that value 154 may or not be present in the file.
So, if the value 154 is present, then the removal of the following part is needed:
<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>
I want to do the coding in a shell script. I can not use xslt
because my system does not support it.
unix shell-script sed awk
closed as off-topic by JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman Feb 28 at 11:51
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research. Here are a few suggestions on how to properly ask this type of question." – JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman
If this question can be reworded to fit the rules in the help center, please edit the question.
I thinksed
isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.
– uzsolt
Jan 18 '17 at 8:33
4
Why inventing a bicycle if nearly all unix based system has in their reposxmlstarlet
?
– Alex
Jan 18 '17 at 8:35
add a comment |
Given the following input:
<start>
<header>
This is header section
</header>
<body>
<body_start>
This is body section
<a>
<b>
<c>
<st>111</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_start>
<body_section>
This is body section
<a>
<b>
<c>
<st>5</st>
</c>
<d>
<st>666</st>
</d>
</b>
<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>
<b>
<c>
<st>845034</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_section>
</body>
</start>
I'd like to perform the following parsing.
If st
value of c
tag is 154
, then the whole <b>
to </b>
tag needs to removed. Note that value 154 may or not be present in the file.
So, if the value 154 is present, then the removal of the following part is needed:
<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>
I want to do the coding in a shell script. I can not use xslt
because my system does not support it.
unix shell-script sed awk
Given the following input:
<start>
<header>
This is header section
</header>
<body>
<body_start>
This is body section
<a>
<b>
<c>
<st>111</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_start>
<body_section>
This is body section
<a>
<b>
<c>
<st>5</st>
</c>
<d>
<st>666</st>
</d>
</b>
<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>
<b>
<c>
<st>845034</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_section>
</body>
</start>
I'd like to perform the following parsing.
If st
value of c
tag is 154
, then the whole <b>
to </b>
tag needs to removed. Note that value 154 may or not be present in the file.
So, if the value 154 is present, then the removal of the following part is needed:
<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>
I want to do the coding in a shell script. I can not use xslt
because my system does not support it.
unix shell-script sed awk
unix shell-script sed awk
edited Feb 22 at 16:03
kenorb
11.5k1580116
11.5k1580116
asked Jan 18 '17 at 6:57
rjgrjg
62
62
closed as off-topic by JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman Feb 28 at 11:51
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research. Here are a few suggestions on how to properly ask this type of question." – JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman
If this question can be reworded to fit the rules in the help center, please edit the question.
closed as off-topic by JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman Feb 28 at 11:51
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research. Here are a few suggestions on how to properly ask this type of question." – JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman
If this question can be reworded to fit the rules in the help center, please edit the question.
I thinksed
isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.
– uzsolt
Jan 18 '17 at 8:33
4
Why inventing a bicycle if nearly all unix based system has in their reposxmlstarlet
?
– Alex
Jan 18 '17 at 8:35
add a comment |
I thinksed
isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.
– uzsolt
Jan 18 '17 at 8:33
4
Why inventing a bicycle if nearly all unix based system has in their reposxmlstarlet
?
– Alex
Jan 18 '17 at 8:35
I think
sed
isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.– uzsolt
Jan 18 '17 at 8:33
I think
sed
isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.– uzsolt
Jan 18 '17 at 8:33
4
4
Why inventing a bicycle if nearly all unix based system has in their repos
xmlstarlet
?– Alex
Jan 18 '17 at 8:35
Why inventing a bicycle if nearly all unix based system has in their repos
xmlstarlet
?– Alex
Jan 18 '17 at 8:35
add a comment |
1 Answer
1
active
oldest
votes
You can use pup
, a command line tool for processing HTML. For XML you can use xpup
.
For example, to find parts for removal, run:
$ pup ':parent-of(:parent-of(:contains("154")))' <file.html
<b>
<c>
<st>
154
</st>
</c>
<d>
<st>
1457954
</st>
</d>
</b>
To remove this section from the input using sed
(where file.html
is your HTML file), run:
sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")
Notes:
- We use
xargs <file.html | tr -d " "
to flatten the file into a single line without spaces. - We use mentioned
pup
command to find the pattern for removal. - We use
sed
to remove the pattern by:sed "s@PATTERN@@g" <(input)
. - To replace in-place (by modifying the file), add
-i
for GNU'ssed
, or-i'.bak'
for BSD'ssed
.
For easier understanding, the following script can be used:
function flat_it() { xargs | tr -d " "; }
input=$(flat_it <file.html)
remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)
sed "s@$remove@@g" <<<$input
Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.
So instead of xargs | tr -d " "
, sed
, ex
or paste
can be used.
Here is the example using ex
:
ex +%j +"s/[><]zs //g" +%p -scq! file.html
And here is the version with shell function (which can replace the previous version):
function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can use pup
, a command line tool for processing HTML. For XML you can use xpup
.
For example, to find parts for removal, run:
$ pup ':parent-of(:parent-of(:contains("154")))' <file.html
<b>
<c>
<st>
154
</st>
</c>
<d>
<st>
1457954
</st>
</d>
</b>
To remove this section from the input using sed
(where file.html
is your HTML file), run:
sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")
Notes:
- We use
xargs <file.html | tr -d " "
to flatten the file into a single line without spaces. - We use mentioned
pup
command to find the pattern for removal. - We use
sed
to remove the pattern by:sed "s@PATTERN@@g" <(input)
. - To replace in-place (by modifying the file), add
-i
for GNU'ssed
, or-i'.bak'
for BSD'ssed
.
For easier understanding, the following script can be used:
function flat_it() { xargs | tr -d " "; }
input=$(flat_it <file.html)
remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)
sed "s@$remove@@g" <<<$input
Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.
So instead of xargs | tr -d " "
, sed
, ex
or paste
can be used.
Here is the example using ex
:
ex +%j +"s/[><]zs //g" +%p -scq! file.html
And here is the version with shell function (which can replace the previous version):
function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }
add a comment |
You can use pup
, a command line tool for processing HTML. For XML you can use xpup
.
For example, to find parts for removal, run:
$ pup ':parent-of(:parent-of(:contains("154")))' <file.html
<b>
<c>
<st>
154
</st>
</c>
<d>
<st>
1457954
</st>
</d>
</b>
To remove this section from the input using sed
(where file.html
is your HTML file), run:
sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")
Notes:
- We use
xargs <file.html | tr -d " "
to flatten the file into a single line without spaces. - We use mentioned
pup
command to find the pattern for removal. - We use
sed
to remove the pattern by:sed "s@PATTERN@@g" <(input)
. - To replace in-place (by modifying the file), add
-i
for GNU'ssed
, or-i'.bak'
for BSD'ssed
.
For easier understanding, the following script can be used:
function flat_it() { xargs | tr -d " "; }
input=$(flat_it <file.html)
remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)
sed "s@$remove@@g" <<<$input
Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.
So instead of xargs | tr -d " "
, sed
, ex
or paste
can be used.
Here is the example using ex
:
ex +%j +"s/[><]zs //g" +%p -scq! file.html
And here is the version with shell function (which can replace the previous version):
function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }
add a comment |
You can use pup
, a command line tool for processing HTML. For XML you can use xpup
.
For example, to find parts for removal, run:
$ pup ':parent-of(:parent-of(:contains("154")))' <file.html
<b>
<c>
<st>
154
</st>
</c>
<d>
<st>
1457954
</st>
</d>
</b>
To remove this section from the input using sed
(where file.html
is your HTML file), run:
sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")
Notes:
- We use
xargs <file.html | tr -d " "
to flatten the file into a single line without spaces. - We use mentioned
pup
command to find the pattern for removal. - We use
sed
to remove the pattern by:sed "s@PATTERN@@g" <(input)
. - To replace in-place (by modifying the file), add
-i
for GNU'ssed
, or-i'.bak'
for BSD'ssed
.
For easier understanding, the following script can be used:
function flat_it() { xargs | tr -d " "; }
input=$(flat_it <file.html)
remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)
sed "s@$remove@@g" <<<$input
Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.
So instead of xargs | tr -d " "
, sed
, ex
or paste
can be used.
Here is the example using ex
:
ex +%j +"s/[><]zs //g" +%p -scq! file.html
And here is the version with shell function (which can replace the previous version):
function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }
You can use pup
, a command line tool for processing HTML. For XML you can use xpup
.
For example, to find parts for removal, run:
$ pup ':parent-of(:parent-of(:contains("154")))' <file.html
<b>
<c>
<st>
154
</st>
</c>
<d>
<st>
1457954
</st>
</d>
</b>
To remove this section from the input using sed
(where file.html
is your HTML file), run:
sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")
Notes:
- We use
xargs <file.html | tr -d " "
to flatten the file into a single line without spaces. - We use mentioned
pup
command to find the pattern for removal. - We use
sed
to remove the pattern by:sed "s@PATTERN@@g" <(input)
. - To replace in-place (by modifying the file), add
-i
for GNU'ssed
, or-i'.bak'
for BSD'ssed
.
For easier understanding, the following script can be used:
function flat_it() { xargs | tr -d " "; }
input=$(flat_it <file.html)
remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)
sed "s@$remove@@g" <<<$input
Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.
So instead of xargs | tr -d " "
, sed
, ex
or paste
can be used.
Here is the example using ex
:
ex +%j +"s/[><]zs //g" +%p -scq! file.html
And here is the version with shell function (which can replace the previous version):
function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }
edited Feb 22 at 16:26
answered Feb 22 at 16:01
kenorbkenorb
11.5k1580116
11.5k1580116
add a comment |
add a comment |
I think
sed
isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.– uzsolt
Jan 18 '17 at 8:33
4
Why inventing a bicycle if nearly all unix based system has in their repos
xmlstarlet
?– Alex
Jan 18 '17 at 8:35