Read XML tags and then remove XML tags using a shell script [closed]

Given the following input:

<start>

   <header>

      This is header section

   </header>

   <body>

      <body_start>

         This is body section

         <a>

            <b>

               <c>

                  <st>111</st>

               </c>

               <d>

                  <st>blank</st>

               </d>

            </b>

         </a>

      </body_start>

      <body_section>

         This is body section

         <a>

            <b>

               <c>

                  <st>5</st>

               </c>

               <d>

                  <st>666</st>

               </d>

            </b>

            <b>

               <c>

                  <st>154</st>

               </c>

               <d>

                  <st>1457954</st>

               </d>

            </b>

            <b>

               <c>

                  <st>845034</st>

               </c>

               <d>

                  <st>blank</st>

               </d>

            </b>

         </a>

      </body_section>

   </body>

</start>

I'd like to perform the following parsing.

If st value of c tag is 154, then the whole  to  tag needs to removed. Note that value 154 may or not be present in the file.

So, if the value 154 is present, then the removal of the following part is needed:

<b>

   <c>

      <st>154</st>

   </c>

   <d>

      <st>1457954</st>

   </d>

</b>

I want to do the coding in a shell script. I can not use xslt because my system does not support it.

edited Feb 22 at 16:03

kenorb

11.5k1580116

asked Jan 18 '17 at 6:57

rjg

closed as off-topic by JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman Feb 28 at 11:51

This question appears to be off-topic. The users who voted to close gave this specific reason:

"Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research. Here are a few suggestions on how to properly ask this type of question." – JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman

If this question can be reworded to fit the rules in the help center, please edit the question.

I think sed isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.

– uzsolt
Jan 18 '17 at 8:33

4

Why inventing a bicycle if nearly all unix based system has in their repos xmlstarlet ?

– Alex
Jan 18 '17 at 8:35

add a comment |

Given the following input:

<start>

   <header>

      This is header section

   </header>

   <body>

      <body_start>

         This is body section

         <a>

            <b>

               <c>

                  <st>111</st>

               </c>

               <d>

                  <st>blank</st>

               </d>

            </b>

         </a>

      </body_start>

      <body_section>

         This is body section

         <a>

            <b>

               <c>

                  <st>5</st>

               </c>

               <d>

                  <st>666</st>

               </d>

            </b>

            <b>

               <c>

                  <st>154</st>

               </c>

               <d>

                  <st>1457954</st>

               </d>

            </b>

            <b>

               <c>

                  <st>845034</st>

               </c>

               <d>

                  <st>blank</st>

               </d>

            </b>

         </a>

      </body_section>

   </body>

</start>

I'd like to perform the following parsing.

If st value of c tag is 154, then the whole  to  tag needs to removed. Note that value 154 may or not be present in the file.

So, if the value 154 is present, then the removal of the following part is needed:

<b>

   <c>

      <st>154</st>

   </c>

   <d>

      <st>1457954</st>

   </d>

</b>

I want to do the coding in a shell script. I can not use xslt because my system does not support it.

edited Feb 22 at 16:03

kenorb

11.5k1580116

asked Jan 18 '17 at 6:57

rjg

closed as off-topic by JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman Feb 28 at 11:51

This question appears to be off-topic. The users who voted to close gave this specific reason:

"Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research. Here are a few suggestions on how to properly ask this type of question." – JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman

If this question can be reworded to fit the rules in the help center, please edit the question.

I think sed isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.

– uzsolt
Jan 18 '17 at 8:33

4

Why inventing a bicycle if nearly all unix based system has in their repos xmlstarlet ?

– Alex
Jan 18 '17 at 8:35

add a comment |

Given the following input:

<start>

   <header>

      This is header section

   </header>

   <body>

      <body_start>

         This is body section

         <a>

            <b>

               <c>

                  <st>111</st>

               </c>

               <d>

                  <st>blank</st>

               </d>

            </b>

         </a>

      </body_start>

      <body_section>

         This is body section

         <a>

            <b>

               <c>

                  <st>5</st>

               </c>

               <d>

                  <st>666</st>

               </d>

            </b>

            <b>

               <c>

                  <st>154</st>

               </c>

               <d>

                  <st>1457954</st>

               </d>

            </b>

            <b>

               <c>

                  <st>845034</st>

               </c>

               <d>

                  <st>blank</st>

               </d>

            </b>

         </a>

      </body_section>

   </body>

</start>

I'd like to perform the following parsing.

If st value of c tag is 154, then the whole  to  tag needs to removed. Note that value 154 may or not be present in the file.

So, if the value 154 is present, then the removal of the following part is needed:

<b>

   <c>

      <st>154</st>

   </c>

   <d>

      <st>1457954</st>

   </d>

</b>

I want to do the coding in a shell script. I can not use xslt because my system does not support it.

edited Feb 22 at 16:03

kenorb

11.5k1580116

asked Jan 18 '17 at 6:57

rjg

Given the following input:

<start>

   <header>

      This is header section

   </header>

   <body>

      <body_start>

         This is body section

         <a>

            <b>

               <c>

                  <st>111</st>

               </c>

               <d>

                  <st>blank</st>

               </d>

            </b>

         </a>

      </body_start>

      <body_section>

         This is body section

         <a>

            <b>

               <c>

                  <st>5</st>

               </c>

               <d>

                  <st>666</st>

               </d>

            </b>

            <b>

               <c>

                  <st>154</st>

               </c>

               <d>

                  <st>1457954</st>

               </d>

            </b>

            <b>

               <c>

                  <st>845034</st>

               </c>

               <d>

                  <st>blank</st>

               </d>

            </b>

         </a>

      </body_section>

   </body>

</start>

I'd like to perform the following parsing.

If st value of c tag is 154, then the whole  to  tag needs to removed. Note that value 154 may or not be present in the file.

So, if the value 154 is present, then the removal of the following part is needed:

<b>

   <c>

      <st>154</st>

   </c>

   <d>

      <st>1457954</st>

   </d>

</b>

I want to do the coding in a shell script. I can not use xslt because my system does not support it.

unix shell-script sed awk

edited Feb 22 at 16:03

kenorb

11.5k1580116

asked Jan 18 '17 at 6:57

rjg

edited Feb 22 at 16:03

kenorb

11.5k1580116

asked Jan 18 '17 at 6:57

rjg

edited Feb 22 at 16:03

kenorb

11.5k1580116

edited Feb 22 at 16:03

kenorb

11.5k1580116

edited Feb 22 at 16:03

kenorb

11.5k1580116

asked Jan 18 '17 at 6:57

rjg

asked Jan 18 '17 at 6:57

rjg

asked Jan 18 '17 at 6:57

rjg

closed as off-topic by JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman Feb 28 at 11:51

This question appears to be off-topic. The users who voted to close gave this specific reason:

"Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research. Here are a few suggestions on how to properly ask this type of question." – JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman

If this question can be reworded to fit the rules in the help center, please edit the question.

closed as off-topic by JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman Feb 28 at 11:51

This question appears to be off-topic. The users who voted to close gave this specific reason:

"Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research. Here are a few suggestions on how to properly ask this type of question." – JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman

If this question can be reworded to fit the rules in the help center, please edit the question.

I think sed isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.

– uzsolt
Jan 18 '17 at 8:33

4

Why inventing a bicycle if nearly all unix based system has in their repos xmlstarlet ?

– Alex
Jan 18 '17 at 8:35

add a comment |

I think sed isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.

– uzsolt
Jan 18 '17 at 8:33

4

Why inventing a bicycle if nearly all unix based system has in their repos xmlstarlet ?

– Alex
Jan 18 '17 at 8:35

I think sed isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.

– uzsolt
Jan 18 '17 at 8:33

Why inventing a bicycle if nearly all unix based system has in their repos xmlstarlet ?

– Alex
Jan 18 '17 at 8:35

add a comment |

1 Answer
1

active

oldest

votes

You can use pup, a command line tool for processing HTML. For XML you can use xpup.

For example, to find parts for removal, run:

$ pup ':parent-of(:parent-of(:contains("154")))' <file.html

<b>

 <c>

  <st>

   154

  </st>

 </c>

 <d>

  <st>

   1457954

  </st>

 </d>

</b>

To remove this section from the input using sed (where file.html is your HTML file), run:

 sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")

Notes:

We use xargs <file.html | tr -d " " to flatten the file into a single line without spaces.

We use mentioned pup command to find the pattern for removal.

We use sed to remove the pattern by: sed "s@PATTERN@@g" <(input).

To replace in-place (by modifying the file), add -i for GNU's sed, or -i'.bak' for BSD's sed.

For easier understanding, the following script can be used:

function flat_it() { xargs | tr -d " "; }

input=$(flat_it <file.html)

remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)

sed "s@$remove@@g" <<<$input

Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.

So instead of xargs | tr -d " ", sed, ex or paste can be used.

Here is the example using ex:

ex +%j +"s/[><]zs //g" +%p -scq! file.html

And here is the version with shell function (which can replace the previous version):

function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }

edited Feb 22 at 16:26

answered Feb 22 at 16:01

kenorb

11.5k1580116

add a comment |

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You can use pup, a command line tool for processing HTML. For XML you can use xpup.

For example, to find parts for removal, run:

$ pup ':parent-of(:parent-of(:contains("154")))' <file.html

<b>

 <c>

  <st>

   154

  </st>

 </c>

 <d>

  <st>

   1457954

  </st>

 </d>

</b>

To remove this section from the input using sed (where file.html is your HTML file), run:

 sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")

Notes:

We use xargs <file.html | tr -d " " to flatten the file into a single line without spaces.

We use mentioned pup command to find the pattern for removal.

We use sed to remove the pattern by: sed "s@PATTERN@@g" <(input).

To replace in-place (by modifying the file), add -i for GNU's sed, or -i'.bak' for BSD's sed.

For easier understanding, the following script can be used:

function flat_it() { xargs | tr -d " "; }

input=$(flat_it <file.html)

remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)

sed "s@$remove@@g" <<<$input

Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.

So instead of xargs | tr -d " ", sed, ex or paste can be used.

Here is the example using ex:

ex +%j +"s/[><]zs //g" +%p -scq! file.html

And here is the version with shell function (which can replace the previous version):

function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }

edited Feb 22 at 16:26

answered Feb 22 at 16:01

kenorb

11.5k1580116

add a comment |

You can use pup, a command line tool for processing HTML. For XML you can use xpup.

For example, to find parts for removal, run:

$ pup ':parent-of(:parent-of(:contains("154")))' <file.html

<b>

 <c>

  <st>

   154

  </st>

 </c>

 <d>

  <st>

   1457954

  </st>

 </d>

</b>

To remove this section from the input using sed (where file.html is your HTML file), run:

 sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")

Notes:

We use xargs <file.html | tr -d " " to flatten the file into a single line without spaces.

We use mentioned pup command to find the pattern for removal.

We use sed to remove the pattern by: sed "s@PATTERN@@g" <(input).

To replace in-place (by modifying the file), add -i for GNU's sed, or -i'.bak' for BSD's sed.

For easier understanding, the following script can be used:

function flat_it() { xargs | tr -d " "; }

input=$(flat_it <file.html)

remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)

sed "s@$remove@@g" <<<$input

Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.

So instead of xargs | tr -d " ", sed, ex or paste can be used.

Here is the example using ex:

ex +%j +"s/[><]zs //g" +%p -scq! file.html

And here is the version with shell function (which can replace the previous version):

function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }

edited Feb 22 at 16:26

answered Feb 22 at 16:01

kenorb

11.5k1580116

add a comment |

You can use pup, a command line tool for processing HTML. For XML you can use xpup.

For example, to find parts for removal, run:

$ pup ':parent-of(:parent-of(:contains("154")))' <file.html

<b>

 <c>

  <st>

   154

  </st>

 </c>

 <d>

  <st>

   1457954

  </st>

 </d>

</b>

To remove this section from the input using sed (where file.html is your HTML file), run:

 sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")

Notes:

We use xargs <file.html | tr -d " " to flatten the file into a single line without spaces.

We use mentioned pup command to find the pattern for removal.

We use sed to remove the pattern by: sed "s@PATTERN@@g" <(input).

To replace in-place (by modifying the file), add -i for GNU's sed, or -i'.bak' for BSD's sed.

For easier understanding, the following script can be used:

function flat_it() { xargs | tr -d " "; }

input=$(flat_it <file.html)

remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)

sed "s@$remove@@g" <<<$input

Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.

So instead of xargs | tr -d " ", sed, ex or paste can be used.

Here is the example using ex:

ex +%j +"s/[><]zs //g" +%p -scq! file.html

And here is the version with shell function (which can replace the previous version):

function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }

edited Feb 22 at 16:26

answered Feb 22 at 16:01

kenorb

11.5k1580116

You can use pup, a command line tool for processing HTML. For XML you can use xpup.

For example, to find parts for removal, run:

$ pup ':parent-of(:parent-of(:contains("154")))' <file.html

<b>

 <c>

  <st>

   154

  </st>

 </c>

 <d>

  <st>

   1457954

  </st>

 </d>

</b>

To remove this section from the input using sed (where file.html is your HTML file), run:

 sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")

Notes:

We use xargs <file.html | tr -d " " to flatten the file into a single line without spaces.

We use mentioned pup command to find the pattern for removal.

We use sed to remove the pattern by: sed "s@PATTERN@@g" <(input).

To replace in-place (by modifying the file), add -i for GNU's sed, or -i'.bak' for BSD's sed.

For easier understanding, the following script can be used:

function flat_it() { xargs | tr -d " "; }

input=$(flat_it <file.html)

remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)

sed "s@$remove@@g" <<<$input

Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.

So instead of xargs | tr -d " ", sed, ex or paste can be used.

Here is the example using ex:

ex +%j +"s/[><]zs //g" +%p -scq! file.html

And here is the version with shell function (which can replace the previous version):

function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }

edited Feb 22 at 16:26

answered Feb 22 at 16:01

kenorb

11.5k1580116

edited Feb 22 at 16:26

answered Feb 22 at 16:01

kenorb

11.5k1580116

answered Feb 22 at 16:01

kenorb

11.5k1580116

answered Feb 22 at 16:01

kenorb

11.5k1580116

add a comment |

This page is only for reference, If you need detailed information, please check here

c8vnaRofnToHCqNNjAP5wgB9AWUOFX,32TbVCzprBfYkTqgW5P,T3RXmGt7WJw7Kb9TLfUsfsM

搜尋此網誌

Jtdylktuy