Read XML tags and then remove XML tags using a shell script [closed]












1















Given the following input:



<start>
<header>
This is header section
</header>
<body>
<body_start>
This is body section
<a>
<b>
<c>
<st>111</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_start>
<body_section>
This is body section
<a>
<b>
<c>
<st>5</st>
</c>
<d>
<st>666</st>
</d>
</b>
<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>
<b>
<c>
<st>845034</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_section>
</body>
</start>


I'd like to perform the following parsing.



If st value of c tag is 154, then the whole <b> to </b> tag needs to removed. Note that value 154 may or not be present in the file.



So, if the value 154 is present, then the removal of the following part is needed:



<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>


I want to do the coding in a shell script. I can not use xslt because my system does not support it.










share|improve this question















closed as off-topic by JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman Feb 28 at 11:51


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research. Here are a few suggestions on how to properly ask this type of question." – JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman

If this question can be reworded to fit the rules in the help center, please edit the question.

















  • I think sed isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.

    – uzsolt
    Jan 18 '17 at 8:33






  • 4





    Why inventing a bicycle if nearly all unix based system has in their repos xmlstarlet ?

    – Alex
    Jan 18 '17 at 8:35
















1















Given the following input:



<start>
<header>
This is header section
</header>
<body>
<body_start>
This is body section
<a>
<b>
<c>
<st>111</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_start>
<body_section>
This is body section
<a>
<b>
<c>
<st>5</st>
</c>
<d>
<st>666</st>
</d>
</b>
<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>
<b>
<c>
<st>845034</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_section>
</body>
</start>


I'd like to perform the following parsing.



If st value of c tag is 154, then the whole <b> to </b> tag needs to removed. Note that value 154 may or not be present in the file.



So, if the value 154 is present, then the removal of the following part is needed:



<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>


I want to do the coding in a shell script. I can not use xslt because my system does not support it.










share|improve this question















closed as off-topic by JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman Feb 28 at 11:51


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research. Here are a few suggestions on how to properly ask this type of question." – JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman

If this question can be reworded to fit the rules in the help center, please edit the question.

















  • I think sed isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.

    – uzsolt
    Jan 18 '17 at 8:33






  • 4





    Why inventing a bicycle if nearly all unix based system has in their repos xmlstarlet ?

    – Alex
    Jan 18 '17 at 8:35














1












1








1








Given the following input:



<start>
<header>
This is header section
</header>
<body>
<body_start>
This is body section
<a>
<b>
<c>
<st>111</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_start>
<body_section>
This is body section
<a>
<b>
<c>
<st>5</st>
</c>
<d>
<st>666</st>
</d>
</b>
<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>
<b>
<c>
<st>845034</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_section>
</body>
</start>


I'd like to perform the following parsing.



If st value of c tag is 154, then the whole <b> to </b> tag needs to removed. Note that value 154 may or not be present in the file.



So, if the value 154 is present, then the removal of the following part is needed:



<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>


I want to do the coding in a shell script. I can not use xslt because my system does not support it.










share|improve this question
















Given the following input:



<start>
<header>
This is header section
</header>
<body>
<body_start>
This is body section
<a>
<b>
<c>
<st>111</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_start>
<body_section>
This is body section
<a>
<b>
<c>
<st>5</st>
</c>
<d>
<st>666</st>
</d>
</b>
<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>
<b>
<c>
<st>845034</st>
</c>
<d>
<st>blank</st>
</d>
</b>
</a>
</body_section>
</body>
</start>


I'd like to perform the following parsing.



If st value of c tag is 154, then the whole <b> to </b> tag needs to removed. Note that value 154 may or not be present in the file.



So, if the value 154 is present, then the removal of the following part is needed:



<b>
<c>
<st>154</st>
</c>
<d>
<st>1457954</st>
</d>
</b>


I want to do the coding in a shell script. I can not use xslt because my system does not support it.







unix shell-script sed awk






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 22 at 16:03









kenorb

11.5k1580116




11.5k1580116










asked Jan 18 '17 at 6:57









rjgrjg

62




62




closed as off-topic by JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman Feb 28 at 11:51


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research. Here are a few suggestions on how to properly ask this type of question." – JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman

If this question can be reworded to fit the rules in the help center, please edit the question.







closed as off-topic by JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman Feb 28 at 11:51


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research. Here are a few suggestions on how to properly ask this type of question." – JakeGould, Burgi, Rajesh S, music2myear, Anaksunaman

If this question can be reworded to fit the rules in the help center, please edit the question.













  • I think sed isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.

    – uzsolt
    Jan 18 '17 at 8:33






  • 4





    Why inventing a bicycle if nearly all unix based system has in their repos xmlstarlet ?

    – Alex
    Jan 18 '17 at 8:35



















  • I think sed isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.

    – uzsolt
    Jan 18 '17 at 8:33






  • 4





    Why inventing a bicycle if nearly all unix based system has in their repos xmlstarlet ?

    – Alex
    Jan 18 '17 at 8:35

















I think sed isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.

– uzsolt
Jan 18 '17 at 8:33





I think sed isn't the ideal tool for this task. You should use perl, php or similar language - or a xml-releated tool.

– uzsolt
Jan 18 '17 at 8:33




4




4





Why inventing a bicycle if nearly all unix based system has in their repos xmlstarlet ?

– Alex
Jan 18 '17 at 8:35





Why inventing a bicycle if nearly all unix based system has in their repos xmlstarlet ?

– Alex
Jan 18 '17 at 8:35










1 Answer
1






active

oldest

votes


















0














You can use pup, a command line tool for processing HTML. For XML you can use xpup.



For example, to find parts for removal, run:



$ pup ':parent-of(:parent-of(:contains("154")))' <file.html
<b>
<c>
<st>
154
</st>
</c>
<d>
<st>
1457954
</st>
</d>
</b>


To remove this section from the input using sed (where file.html is your HTML file), run:



 sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")


Notes:




  • We use xargs <file.html | tr -d " " to flatten the file into a single line without spaces.

  • We use mentioned pup command to find the pattern for removal.

  • We use sed to remove the pattern by: sed "s@PATTERN@@g" <(input).

  • To replace in-place (by modifying the file), add -i for GNU's sed, or -i'.bak' for BSD's sed.




For easier understanding, the following script can be used:



function flat_it() { xargs | tr -d " "; }
input=$(flat_it <file.html)
remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)
sed "s@$remove@@g" <<<$input




Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.



So instead of xargs | tr -d " ", sed, ex or paste can be used.



Here is the example using ex:



ex +%j +"s/[><]zs //g" +%p -scq! file.html


And here is the version with shell function (which can replace the previous version):



function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }





share|improve this answer
































    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    You can use pup, a command line tool for processing HTML. For XML you can use xpup.



    For example, to find parts for removal, run:



    $ pup ':parent-of(:parent-of(:contains("154")))' <file.html
    <b>
    <c>
    <st>
    154
    </st>
    </c>
    <d>
    <st>
    1457954
    </st>
    </d>
    </b>


    To remove this section from the input using sed (where file.html is your HTML file), run:



     sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")


    Notes:




    • We use xargs <file.html | tr -d " " to flatten the file into a single line without spaces.

    • We use mentioned pup command to find the pattern for removal.

    • We use sed to remove the pattern by: sed "s@PATTERN@@g" <(input).

    • To replace in-place (by modifying the file), add -i for GNU's sed, or -i'.bak' for BSD's sed.




    For easier understanding, the following script can be used:



    function flat_it() { xargs | tr -d " "; }
    input=$(flat_it <file.html)
    remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)
    sed "s@$remove@@g" <<<$input




    Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.



    So instead of xargs | tr -d " ", sed, ex or paste can be used.



    Here is the example using ex:



    ex +%j +"s/[><]zs //g" +%p -scq! file.html


    And here is the version with shell function (which can replace the previous version):



    function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }





    share|improve this answer






























      0














      You can use pup, a command line tool for processing HTML. For XML you can use xpup.



      For example, to find parts for removal, run:



      $ pup ':parent-of(:parent-of(:contains("154")))' <file.html
      <b>
      <c>
      <st>
      154
      </st>
      </c>
      <d>
      <st>
      1457954
      </st>
      </d>
      </b>


      To remove this section from the input using sed (where file.html is your HTML file), run:



       sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")


      Notes:




      • We use xargs <file.html | tr -d " " to flatten the file into a single line without spaces.

      • We use mentioned pup command to find the pattern for removal.

      • We use sed to remove the pattern by: sed "s@PATTERN@@g" <(input).

      • To replace in-place (by modifying the file), add -i for GNU's sed, or -i'.bak' for BSD's sed.




      For easier understanding, the following script can be used:



      function flat_it() { xargs | tr -d " "; }
      input=$(flat_it <file.html)
      remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)
      sed "s@$remove@@g" <<<$input




      Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.



      So instead of xargs | tr -d " ", sed, ex or paste can be used.



      Here is the example using ex:



      ex +%j +"s/[><]zs //g" +%p -scq! file.html


      And here is the version with shell function (which can replace the previous version):



      function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }





      share|improve this answer




























        0












        0








        0







        You can use pup, a command line tool for processing HTML. For XML you can use xpup.



        For example, to find parts for removal, run:



        $ pup ':parent-of(:parent-of(:contains("154")))' <file.html
        <b>
        <c>
        <st>
        154
        </st>
        </c>
        <d>
        <st>
        1457954
        </st>
        </d>
        </b>


        To remove this section from the input using sed (where file.html is your HTML file), run:



         sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")


        Notes:




        • We use xargs <file.html | tr -d " " to flatten the file into a single line without spaces.

        • We use mentioned pup command to find the pattern for removal.

        • We use sed to remove the pattern by: sed "s@PATTERN@@g" <(input).

        • To replace in-place (by modifying the file), add -i for GNU's sed, or -i'.bak' for BSD's sed.




        For easier understanding, the following script can be used:



        function flat_it() { xargs | tr -d " "; }
        input=$(flat_it <file.html)
        remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)
        sed "s@$remove@@g" <<<$input




        Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.



        So instead of xargs | tr -d " ", sed, ex or paste can be used.



        Here is the example using ex:



        ex +%j +"s/[><]zs //g" +%p -scq! file.html


        And here is the version with shell function (which can replace the previous version):



        function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }





        share|improve this answer















        You can use pup, a command line tool for processing HTML. For XML you can use xpup.



        For example, to find parts for removal, run:



        $ pup ':parent-of(:parent-of(:contains("154")))' <file.html
        <b>
        <c>
        <st>
        154
        </st>
        </c>
        <d>
        <st>
        1457954
        </st>
        </d>
        </b>


        To remove this section from the input using sed (where file.html is your HTML file), run:



         sed "s@$(pup ':parent-of(:parent-of(:contains("154")))' <file.html | xargs | tr -d " ")@@g" <(xargs <file.html | tr -d " ")


        Notes:




        • We use xargs <file.html | tr -d " " to flatten the file into a single line without spaces.

        • We use mentioned pup command to find the pattern for removal.

        • We use sed to remove the pattern by: sed "s@PATTERN@@g" <(input).

        • To replace in-place (by modifying the file), add -i for GNU's sed, or -i'.bak' for BSD's sed.




        For easier understanding, the following script can be used:



        function flat_it() { xargs | tr -d " "; }
        input=$(flat_it <file.html)
        remove=$(pup ':parent-of(:parent-of(:contains("154")))' <<<$input | flat_it)
        sed "s@$remove@@g" <<<$input




        Note: The disadvantage of the above method is that all spaces are removed, including in the content. To make it better, some other way of flattening input needs to be used.



        So instead of xargs | tr -d " ", sed, ex or paste can be used.



        Here is the example using ex:



        ex +%j +"s/[><]zs //g" +%p -scq! file.html


        And here is the version with shell function (which can replace the previous version):



        function flat_it() { ex +%j +"s/[><]zs //g" +%p -scq! /dev/stdin; }






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Feb 22 at 16:26

























        answered Feb 22 at 16:01









        kenorbkenorb

        11.5k1580116




        11.5k1580116















            Popular posts from this blog

            Probability when a professor distributes a quiz and homework assignment to a class of n students.

            Aardman Animations

            Are they similar matrix